An Improved KNN Classification Algorithm based on Sampling

Similar documents
The Comparative Study of Machine Learning Algorithms in Text Data Classification*

A Data Classification Algorithm of Internet of Things Based on Neural Network

Design of student information system based on association algorithm and data mining technology. CaiYan, ChenHua

Open Access The Three-dimensional Coding Based on the Cone for XML Under Weaving Multi-documents

Distribution-free Predictive Approaches

An Optimization Algorithm of Selecting Initial Clustering Center in K means

Time Series Clustering Ensemble Algorithm Based on Locality Preserving Projection

Open Access Research on the Prediction Model of Material Cost Based on Data Mining

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

Cost-sensitive C4.5 with post-pruning and competition

Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Naïve Bayes for text classification

Analysis on the technology improvement of the library network information retrieval efficiency

Design and Implementation of Movie Recommendation System Based on Knn Collaborative Filtering Algorithm

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

A Modular k-nearest Neighbor Classification Method for Massively Parallel Text Categorization

Face recognition based on improved BP neural network

Half-edge Collapse Simplification Algorithm Based on Angle Feature

Machine Learning. Supervised Learning. Manfred Huber

Face Recognition Using Vector Quantization Histogram and Support Vector Machine Classifier Rong-sheng LI, Fei-fei LEE *, Yan YAN and Qiu CHEN

Machine Learning and Pervasive Computing

Text clustering based on a divide and merge strategy

The Research of Collision Detection Algorithm Based on Separating axis Theorem

School of Computer and Communication, Lanzhou University of Technology, Gansu, Lanzhou,730050,P.R. China

Privacy-Preserving of Check-in Services in MSNS Based on a Bit Matrix

Marbling Stone Slab Image Segmentation Based on Election Campaign Algorithm Qinghua Xie1, a*, Xiangwei Zhang1, b, Wenge Lv1, c and Siyuan Cheng1, d

A Kind of Wireless Sensor Network Coverage Optimization Algorithm Based on Genetic PSO

Improvement of SURF Feature Image Registration Algorithm Based on Cluster Analysis

Machine Learning Classifiers and Boosting

Comprehensive analysis and evaluation of big data for main transformer equipment based on PCA and Apriority

Machine Learning. Classification

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Social Network Recommendation Algorithm based on ICIP

Implementation of Parallel CASINO Algorithm Based on MapReduce. Li Zhang a, Yijie Shi b

Hole repair algorithm in hybrid sensor networks

A Novel Identification Approach to Encryption Mode of Block Cipher Cheng Tan1, a, Yifu Li 2,b and Shan Yao*2,c

Keyword Extraction by KNN considering Similarity among Features

Landslide Monitoring Point Optimization. Deployment Based on Fuzzy Cluster Analysis.

Data Mining and Machine Learning: Techniques and Algorithms

STUDYING OF CLASSIFYING CHINESE SMS MESSAGES

Traffic Flow Prediction Based on the location of Big Data. Xijun Zhang, Zhanting Yuan

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

Available online at ScienceDirect. Procedia Engineering 99 (2015 )

Encoding Words into String Vectors for Word Categorization

Chinese Text Auto-Categorization on Petro-Chemical Industrial Processes

Balanced COD-CLARANS: A Constrained Clustering Algorithm to Optimize Logistics Distribution Network

Slides for Data Mining by I. H. Witten and E. Frank

K-coverage prediction optimization for non-uniform motion objects in wireless video sensor networks

Performance Analysis of Data Mining Classification Techniques

USING OF THE K NEAREST NEIGHBOURS ALGORITHM (k-nns) IN THE DATA CLASSIFICATION

Searching Algorithm of Dormant Node in Wireless Sensor Networks

Pupil Localization Algorithm based on Hough Transform and Harris Corner Detection

5th International Conference on Information Engineering for Mechanics and Materials (ICIMM 2015)

Test Analysis of Serial Communication Extension in Mobile Nodes of Participatory Sensing System Xinqiang Tang 1, Huichun Peng 2

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Processing Technology of Massive Human Health Data Based on Hadoop

The principle of a fulltext searching instrument and its application research Wen Ju Gao 1, a, Yue Ou Ren 2, b and Qiu Yan Li 3,c

The Establishment of Large Data Mining Platform Based on Cloud Computing. Wei CAI

Personalized Search for TV Programs Based on Software Man

Research on adaptive network theft Trojan detection model Ting Wu

Performance Degradation Assessment and Fault Diagnosis of Bearing Based on EMD and PCA-SOM

Prediction of traffic flow based on the EMD and wavelet neural network Teng Feng 1,a,Xiaohong Wang 1,b,Yunlai He 1,c

Recognition of Human Body Movements Trajectory Based on the Three-dimensional Depth Data

Cluster Validity Classification Approaches Based on Geometric Probability and Application in the Classification of Remotely Sensed Images

Research on Design and Application of Computer Database Quality Evaluation Model

A Novel Approach for Minimum Spanning Tree Based Clustering Algorithm

A paralleled algorithm based on multimedia retrieval

An Improved Document Clustering Approach Using Weighted K-Means Algorithm

Research on Load Balancing in Task Allocation Process in Heterogeneous Hadoop Cluster

An ELM-based traffic flow prediction method adapted to different data types Wang Xingchao1, a, Hu Jianming2, b, Zhang Yi3 and Wang Zhenyu4

More Efficient Classification of Web Content Using Graph Sampling

Recognition and Measurement of Small Defects in ICT Testing

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Research on Impact of Ground Control Point Distribution on Image Geometric Rectification Based on Voronoi Diagram

Algorithm research of 3D point cloud registration based on iterative closest point 1

Visual Working Efficiency Analysis Method of Cockpit Based On ANN

Face Recognition Technology Based On Image Processing Chen Xin, Yajuan Li, Zhimin Tian

A Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence

Using Gini-index for Feature Weighting in Text Categorization

Research on Quality Reliability of Rolling Bearings by Multi - Weight Method (Part Ⅱ: Experiment)

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

Generative and discriminative classification techniques

AN OPTIMIZATION GENETIC ALGORITHM FOR IMAGE DATABASES IN AGRICULTURE

Effect of Principle Component Analysis and Support Vector Machine in Software Fault Prediction

A Software Testing Optimization Method Based on Negative Association Analysis Lin Wan 1, Qiuling Fan 1,Qinzhao Wang 2

A Cache Utility Monitor for Multi-core Processor

Design and Realization of Data Mining System based on Web HE Defu1, a

Fuzzy Ant Clustering by Centroid Positioning

Support Vector Regression for Software Reliability Growth Modeling and Prediction

Cluster Analysis using Spherical SOM

Parallelization of K-Means Clustering Algorithm for Data Mining

Data Mining and Data Warehousing Classification-Lazy Learners

An improved PageRank algorithm for Social Network User s Influence research Peng Wang, Xue Bo*, Huamin Yang, Shuangzi Sun, Songjiang Li

Robot Simultaneous Localization and Mapping Based on Self-Detected Waypoint

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

A Comparison of Pattern-Based Spectral Clustering Algorithms in Directed Weighted Network

An Empirical Study of Lazy Multilabel Classification Algorithms

Tact Optimization Algorithm of LED Bulb Production Line Based on CEPSO

Transcription:

International Conference on Advances in Materials, Machinery, Electrical Engineering (AMMEE 017) An Improved KNN Classification Algorithm based on Sampling Zhiwei Cheng1, a, Caisen Chen1, b, Xuehuan Qiu1, c and Huan Xie1, d 1 The Academy of Armored Forces Engineering, Beijing 10007, China; acheng.zw@mail.scut.edu.cn, bcaisenchen@163.com, cqiuxuehuan@139.com, d38763316@qq.com Keywords: KNN, classification algorithm, computational overhead, sampling. Abstract. K nearest neighbor (KNN) algorithm has been widely used as a simple and effective classification algorithm. The traditional KNN classification algorithm will find k nearest neighbors, it is necessary to calculate the distance from the test sample to all training samples. When the training sample data is very large, it will produce a high computational overhead, resulting in a decline in classification speed. Therefore, we optimize the distance calculation of the KNN algorithm. Since KNN only considers the k samples of the shortest distance from the test sample to the nearest training sample point, the large distance training has no effect on the classification of the algorithm. The improved method is to sample the training data around the test data, which reduces the number of distance calculation of the test data to each training data, and reduces the time complexity of the algorithm. The experimental results show that the optimized KNN classification algorithm is superior to the traditional KNN algorithm. 1. Introduction In the field of machine learning and data mining, classification is an important way of data analysis, and is an important way to predict and analyze data. Classification is one of the supervised learning methods. By analyzing the training sample data, a classification rule is obtained to predict the type of test sample data. Common classification algorithms are: decision tree, association rules, Bayesian, neural network, genetic algorithm, KNN algorithm [1]. This paper is a study of the traditional KNN algorithm. KNN is an instance-based inert classification method [] and have no training process [3]. Because of the simple realization and superior performance, it is widely used in the fields of machine learning and data mining. However, the traditional KNN algorithm has some drawbacks in the testing process of the sample. The traditional KNN algorithm need to calculate the distance of the new sample to all training samples in the classification process. When the test data is large, it will lead to the traditional KNN algorithm become slow, low performance in the forecast analysis. At present, many people have put forward some improved methods for KNN algorithm. Ugur et al. [4] proposed a density-based training sample cutting method. This method makes the sample distribution density evenly and reduces the calculation of KNN. Nikolay K et al. [5] proposed a method through establishing a classification model based on using a central document instead of original samples to reduce the number of samples that need to be similarly calculated by the KNN algorithm. So, it increases the classification speed. Wanita S et al. [6] proposed a reduction algorithm and a merging algorithm that reduces the computational complexity. In the case of not reducing the original accuracy rate, a fast classification algorithm is constructed to achieve the effect of fast classification. A low performance problem for KNN in calculating the distance from the test sample to all training samples [7]. To solve this problem, in this paper, we propose a method to sample the training samples, which can improve the efficiency of KNN algorithm classification by reducing the time of distance calculation. During the test, training samples were sampled on each axis direction (d1, d, d3... dn) around the value of every test data. And then we calculate the distance from the test sample to the training sample after sampled and select the-nearest-distance k training data to make the decision of the test data. Copyright 017, the Authors. Published by Atlantis Press. This is an open access article under the CC BY-NC license (http://creativecommons.org/licenses/by-nc/4.0/). 0

. The Principle of KNN Algorithm In all classification algorithms, K-nearest neighbor method (KNN) is the simplest and most understandable algorithm. The input of the algorithm is the eigenvector of the instance. By calculating the distance between the new data and the eigenvalues of the training data, we choose the nearest neighbor of K (K> = 1) to judge the classification. If K = 1, the new data is simply assigned to its neighboring class. In the KNN classification algorithm, there are three elements, namely: K value selection, distance measurement, classification decision rules. (1)K value selection The choice of K value will have a great influence on the result output of KNN algorithm. When K = 1, the KNN algorithm is called the nearest neighbor algorithm. If the value of K is small, it indicates that a smaller training instance is used to predict, which will increase the estimation error. On the other hand, if the K value is large, it indicates that more training examples are used to predict, which leads to an increase in the approximate error of learning. When K = N, all incoming new instances will be grouped into the same class. ()Distance measurement The KNN algorithm requires that all features be quantifiable. If the data feature has a non-numeric type, the data must be quantified using quantization criteria. To prevent the parameters of the larger parameters overshadowed the smaller parameters, the sample parameters must be normalized. In the n-dimensional real vector space, the distance between the two instance points reflects the similarity of the two instance points. There are many distance calculation methods, but we generally use the European distance to calculate the distance between two instances. Assume that the general distance is L p : p n Lp ( xi, x j ) ( xi(l ) x (jl ) )1/ p l 1 xi, x j n xi = xi(1) xi() xi(n ) )T () xi = x (1) x (nj ) )T j x j (1) p 1 ①when p=1, the distance L p is the Manhattan distance: n L1 ( xi, x j ) ( xi(l ) x (jl ) ) () l 1 ②when p=, the distance L p is the Manhattan distance: n L ( xi, x j ) ( xi(l ) x (jl ) )1/ (3) l 1 (3)Classification decision rules The classification decision of KNN algorithm usually adopts the majority voting method. In the selected K training data, the classifier classifies the new instance into the largest proportion of the label. In algorithmic decision making, we can rely on the distance from the new instance to the weighted vote. If the distance is smaller, the greater the weight; if the distance is larger, the smaller the weight. The purpose of classification decisions is to minimize misclassification. From the mathematical analysis, the smallest classification is equivalent to the minimum risk of experience. Set 0-1 loss function for the classification of the loss function, then the classification function is f: n c1, c, ck c1, c, ck is the training data label, and the error probability is: P(Y f ( X )) 1 P(Y f ( X )) (4) 1

In the set of the K nearest training points, the category label of the new instance is c j ( c j c1, c, ck ), then the misclassification rate is: 1 1 I ( yi = c j ) I ( yi c j ) 1 k x k xi Nk ( x ) i Nk ( x ) In order to achieve the least risk of experience, we should make (5) xi N k ( x ) I ( yi = c j ) be the largest, then we can get: c j arg max cj xi N k ( x ) I ( yi = c j ) (6) The three elements of the KNN algorithm determine the classification accuracy, efficiency, and speed and error rate of the algorithm. The distance calculation determines the time and space complexity of the KNN algorithm. 3. A KNN Optimization Algorithm Based on Sampling Considering the distance calculation rules of KNN algorithm, the distance between these points will not cause any influence on the decision of the classification algorithm when the training point is very far from the test point. So, we can cut the uninfluential training points before calculating the distance. We propose a trimming method, which is mainly in the coordinate axis for parallel sampling. We take a dimension of the distance L as a sampling of the width of a dimension, the sampling in the two-dimensional coordinates is shown in Figure 1. Fig.1 The sampling in the two-dimensional coordinates We calculate the number of points in the square (excluding the points on the boundary), and we should ensure that the prediction accuracy after sampling does not change significantly. So, we must ensure that N> K. How to determine the width of the sampling is the difficulty of optimizing the KNN algorithm. In this paper, we take the maximum width between training points as the upper limit of the width. of the cut width, and take the maximum width of 1/4 as the lower limit i i L max = max( d max d min ) (D is the value of the i-dimensional) (7) 1 L min L max (8) 4 We take the new instance point E as the center, assuming its coordinates are E ( x1, x,, xn ), then we need to calculate the distance of the training point range of values: ( E1 ( x1 -L, x -L,, xn -L), E ( x1 L, x L,, xn L)) (9) When the width is taken L, we should guarantee N>K, N is the number of sampling. However, we are sampling in polygons. In fact, we sample points have invalid points, because when the new instance points as the center, the same distance forms a range of a circle. As shown in Figure is a two-dimensional case of sampling, the outer side of the large polygon is V1, circle is V, radius is L, the small polygon in the circle is V3. We call the point between V1 and V dirty point. If there are too

many dirty points, we can not guarantee the effective sampling data Valid (N)> K. To ensure that the number of points in the circle must be greater than K, our approach is to ensure, is the number of training data in V3. From the mathematical relationship, we can see that the width of is L. Fig. The sampling in two-dimensional coordinates To determine the data in V, we take the new instance point E as the center. Assume its coordinates is E ( x1, x,, xn ), then the coordinate range of E is: (10) L, x L,, xn L), E ( x1 L, x L,, xn L)) is the number of calculation by optimized KNN algorithm. is the number of calculation by tradition KNN algorithm. In the number of calculation distance, we compare the optimized KNN with the traditional KNN algorithm, and we can get the following conclusions. N The best situation: N n3 (n represents n dimensions) (11) 4 The poorest situation: N =N3 (1) Through the formula (11), we can get that the the times of distance calculation by the optimized KNN algorithm in the best situation is reduced exponentially. The optimized algorithm has reduced the time and space complexity, so we achieve the goal of algorithm optimization. But the density of different data is different, which will cause the same cutting method in different data sets will be different efficiency. If the training points near the new instance are many, the sampling can meet the requirements; however, when the training points near the new instance are too few, the training points collected are too few to meet the requirements. In this case, we will increase the sampling width L to ensure that the optimized KNN algorithm can classify correctly. When the new instance is at the edge, the maximum width L cannot satisfy the judgment condition, the algorithm will automatically use the traditional KNN algorithm for classification. ( E1 ( x1-4. Experiment verification and result analysis The experiment uses the data set of the 008 Sogou news corpus, which contains seven classes, namely history, culture, reading, military, society and law, entertainment and IT. The number of training for each category is 80, the rest for testing. Experimental run platform in the Intel core i5-410u 1.7GHz, 4G memory Linux x64 bit system. In the experiment, the traditional KNN algorithm and the optimized KNN algorithm are used to classify the data and compare the running time. The first experiment is to change the value of K. When the training set and testing set in is fixed, we compare the running time of two algorithms, as shown in Figure 3. The second experiment is to compare the time of the data classification in different dimension cases. In the case of K = 9, we compare the running time of two algorithms, as shown in Figure 4. As shown in Fig. 3, it can be seen from the experiment that the consumption time of the two algorithms increases with the K value. The time consuming of the optimized KNN algorithm is significantly lower than that of the traditional KNN algorithm. As shown in Fig. 4, the experiment shows that the KNN algorithm will increase the consumption time as the dimension increases. However, the time consuming of the traditional KNN algorithm increases rapidly with the dimension. The time consuming of the optimized KNN algorithm increases slowly. By comparing these two 3

groups of times, it is proved that the optimized algorithm is superior to the traditional KNN algorithm. Tradition KNN Optimized KNN time(s) 600 400 00 0 5 10 15 K Fig. 3 Time consumption of changing the K value Tradition KNN Optimized KNN 400 time(s) 300 00 100 0 4 6 8 D Fig.4 Time consumption of changing the dimension 5. Summary In this paper, the KNN algorithm is optimized by reducing the number of distances calculated from the test point to the training point. The optimized algorithm achieves the purpose of reducing the running time of the algorithm, which is obviously superior to the traditional KNN algorithm, so it has general value. However, the algorithm is not ideal for edge data processing. If the sampling data cannot meet the requirements, the traditional KNN algorithm is used to classify automatically. How to deal with the edge data is also the direction of further research. Acknowledgments The authors would like to thank anonymous reviewers for their valuable comments. This research was supported by the National Natural Science Foundation of China under Grant No. 614058. The authors gratefully acknowledge the anonymous reviewers for their valuable comments. References [1]. WuX, Kumarv, Quinlan J R, et al. Top 10 algorithms in data mining[j]. Knowledge and Information Systems, 008, 14(1):1-37. []. Witten L H, Frank E.Data Mining: Practical Machine Learning Tools and Techniques[M].nd Edition.New York, USA: Elsevier, 005 4

[3]. Tang Jiliang, Gao Huiji, Hu Xia, et al. Exploiting homophily effect for trust prediction [C] //Proc of the 6th ACM International Conference on Wen Search and Data Mining. New York: ACM Press, 013:53-6. [4]. Ugur K, Golbeck J. Sunny: a new algorithm for trust inference in social networks using probabilistic confidence models[c]//proc of National Conference on Artificial Intelligence. 007:1377-138. [5]. Nikolay K, Thomo A. Trust prediction from user-item ratings [J]. Social Network Analysis and Mining, 013, 3(3): 749-759. [6]. Wanita S,Nepal S, Paris C. A survey of trust in social networks [J].ACM Computing Surveys, 013, 45(4): Article 47. [7]. Wang Aiping, Xu Xiaoyan, Guo Weiwei, et al. Text categorization method based on improved KNN algorithm [J]. Microcomputer & Its Applications, 011, 30(18): 8-10. (In Chinese). 5