Nearest Neighbor Classifiers

Size: px

Start display at page:

Download "Nearest Neighbor Classifiers"

Anthony Blankenship
5 years ago
Views:

1 Nearest Neighbor Classifiers TNM033 Data Mining Techniques Linköping University When I see a bird that walks like a duck and swims like a duck and quacks like a duck, I call that bird a duck. James Whitcomb Riley Christian Alfons Eirik Fredäng Per Lind chral647@student.liu.se eirfr098@student.liu.se perli379@student.liu.se

2 Abstract The famous words by James Whitcomb Riley quoted on the front page actually capture the intuitive essence of nearest neighbor classification; if the nature of an object is unknown to us, we assume it's of the same kind as objects with similar features. Nearest neighbor classification is a technique for dividing datasets into different classes. The categorization of a new object is determined by the labels of the most similar already existing objects. To optimize the technique for a specific dataset, a different number of nearest neighbors to evaluate can be chosen. The technique can be improved by weighting the influence of the nearest neighbors based on their distances. Using a distance function other than regular Euclidean distance, the method can be extended to operate on datasets with symbolic attributes. One of the strengths of the technique is that it is intuitive to understand and easy to implement, and compared to other object classification techniques it performs well, although it can become quite computationally demanding when operating on large datasets.

3 Table of Contents 1 Introduction K-Nearest Neighbor Classification Algorithm Distance Metric Choosing the Number of Neighbors Applications Conclusions...5

4 1 Introduction When gathering data through a data mining process, it is often useful to be able to categorize the sampled entities by arranging them into predefined groups. In order to accomplish satisfactory classifications, well-chosen classification models are often used. Classification is used to categorize an object as belonging to a certain group based on the similarity in attributes with instances of the group. When trying to determine the classification of an object in a set of data, a simple technique commonly used is the nearest neighbor classifier method [1]. As with other classifiers, the data is split into a training set and a test set in nearest neighbor classification. Based on the training items, that have already been correctly classified in the training set, the algorithm predicts which group the test data belongs to [1]. In nearest neighbor classification, a classification model isn't built prior to the labeling of the test items. Instead, existing values are evaluated during the classification process. This kind of approach is known as a lazy learner [1]. In this paper, the k-nearest neighbor (k-nn) classification algorithm is described, and its advantages and disadvantages are presented and discussed. Two methods for comparing symbolic attributes overlapping and a value difference metric are compared [2]. Other classification techniques, such as rule-based classifiers and Bayesian classifiers, provide alternative approaches to solve the same types of problems. Naïve Bayesian classifiers assume independence between different attributes, allowing classification based on attribute probability, whereas rule-based classifiers make classification conclusions based on fulfillment of predefined conditions [1]. 2 K-Nearest Neighbor Classification Suppose that an object is sampled with a set of different attributes, but the group to which the object belongs is unknown. Assuming its group can be determined from its attributes, different algorithms can be used to automate the classification process. As previously mentioned, a nearest neighbor classifier is a technique for classifying elements based on the classification of the elements in the training set that are most similar to the test example. With the k-nearest neighbor technique, this is done by evaluating the k number of closest neighbors [1] Algorithm The k-nearest neighbor classification algorithm is fairly straightforward, and is carried out in a series of simple steps for each item to be classified. In pseudocode, it can be 1

5 expressed fairly compactly [1]: k number of nearest neighbors for each object X in the test set do calculate the distance D(X,Y) between X and every object Y in the training set neighborhood the k neighbors in the training set closest to X X.class SelectClass(neighborhood) end for Since the neighborhood search is an essential part of the algorithm, it is preferable to use methods that are more efficient than simple brute-force comparison of all objects (such as kd-trees). Although the basic algorithm is simple, there exist several different approaches of varying complexity to calculate the distance D(X,Y) and selecting the class based on the neighborhood Distance Metric Calculating the distance between objects with only numerical attributes is straightforward. For instance, it can be done using the Euclidean distance of the normalized attributes as the distance metric [3]: D euclidean X, Y = d x i, y i 2 = i=1 i=1 m m x i y i 2 (1) X and Y are the two compared objects and m is their number of attributes. However, for distance calculations involving symbolic (nominal) data, other methods are necessary; for symbolic data a similarity metric is used to measure distance, the most basic of which is the overlap metric. The overlap metric simply tests for equality between two values, so that different values get distance 1 whereas equal values get distance 0 [3]: d overlap x, y = { 0, x= y 1, x y (2) A more precise metric for symbolic attributes is the value difference metric (VDM). The value difference metric function computes a matrix of the value differences for each attribute in the training set. A simplified VDM measures the distance between two values x and y as follows [3]: C d vdm, a x, y = P P x, c y, c q (3) c=1 P x,c is the conditional probability that the current attribute has the value x when belonging to the class c, C is the total number of classes, and q is a constant (usually 1 or 2) [3]. Neither of the measures mentioned above are able to handle heterogeneous data, i.e. both numerical and symbolic attributes. An obvious solution is using a combination of 2

6 the measurement methods, such as the heterogeneous Euclidean-overlap metric function or the heterogeneous value difference metric, that selects one of two functions depending on data type [3] Choosing the Number of Neighbors Classifying an object based on the classifications of its neighbors can be done by simple majority voting, meaning that the class that is the most common among the neighbors is used [1]: c X = argmax c k n=1 I c, c n (4) Again, k is the number of nearest neighbors, and c is class. I c 1, c 2 = { 1,c 1=c 2 0,c 1 c 2 (5) If there's a tie, the decision can be done by randomly selecting one of the most common classes. However, for binary categorization, where there are only two different classes, odd values of k are often used to ensure that there will always be a majority class, thereby avoiding such conflicts [4]. Although performing the actual classification of an object once its neighbors have been determined is quite simple, choosing a suitable number of neighbors to evaluate, k, is not always trivial. The problem lies in that different values of k may yield significantly different results, which is illustrated in figure 1. Figure 1: Nearest neighborhoods for different values of k; k = 1, k = 2 and k = 3 It is important to find a value of k that is neither too small nor too large. Very small values of k make the classification sensitive to noise, which may result in overfitting, whereas very large values of k may cause misclassifications such as the one illustrated in figure 2. 3

7 Figure 2: Nearest neighborhood for k = 20 To make the algorithm less sensitive to the choice of k, the influence of each selected neighbor can be weighted differently depending on its distance, which is simply called distance-weighted voting. For instance, the following weighting function can be used [1]: c X = argmax c k n=1 w n I c,c n = argmax c k n=1 1 d X,Y n 2 I c,c n (6) Doing this kind of weighting will limit the influence from distant neighbors. 3 Applications The k-nearest neighbor method is a common tool and can be found in various data mining tools, e.g., WEKA 1. The WEKA implementation supports automated selection of the value of k using cross-validation, distance weighting and several accelerated neighbor search algorithms [5]. In its most basic form, k-nearest neighbor classification operates on numerical attributes, as these used to calculate Euclidean distances between objects. However, since the algorithm can be extended to make it possible to process data with symbolic features, it can be utilized in other fields such as recognizing letters in handwriting, computer vision, etc. [2]. Nearest neighbor classifiers can be used in medicine to automate the diagnosis process. In a simplified scenario, high values of properties such as body temperature and C-reactive protein (CRP) content could indicate that the patient in question suffers from an inflammation, as illustrated in figure 3. 1 Waikato Environment for Knowledge Analysis, a Java-based suite of software for data mining tasks 4

8 CRP content body temperature Figure 3: A simple example of how k-nearest neighbor classification can be used for automated diagnosis The k-nearest neighbor classification method has even indicated sufficient accuracy for practical use in medicine. For instance, Alzheimer's disease and frontotemporal dementia are common neurodegenerative cognitive disorders that are difficult to tell apart. Using k-nearest neighbor classification based on analysis of single photon emission computed tomography images of the brain, an 88 % accuracy in automated classification has been achieved [6]. 4 Conclusions The k-nearest neighbor classification algorithm often produces results with an accuracy comparable to that of more complicated classification methods [7, 8]. It is relatively intuitive and straightforward to implement, but it also has a few drawbacks. For instance, as previously mentioned, the value of k must be well-chosen and balanced individually for all different types of datasets in order for the algorithm to perform optimally. When using large datasets, the k-nearest neighbor classification method can become computationally heavy, since a classification model isn't calculated beforehand. However, this can also be an advantage since evaluation can begin without timeconsuming precalculations. Furthermore, the structure of the algorithm allows items to be evaluated independently of each other; this allows the classification process to be carried out in parallel, which in turn can produce significant performance boosts if supported by the hardware [8]. 5

9 References [1] Pang-Ning Tan, Michael Steinbach and Vipin Kumar. Introduction to Data Mining. Addison Wesley, [2] Scott Cost and Steven Salzberg. A Weighted Nearest Neighbor Algorithm for Learning with Symbolic Features. In Machine Learning (January 1993), pp [3] D. Randall Wilson and Tony R. Martinez. Improved Heterogeneous Distance Functions. In Journal of Artificial Intelligence Research (January 1997), pp [4] The StatSoft Electronic Statistics Textbook. K-Nearest Neighbors. Retrieved from [5] Pentaho Wiki. Package weka.classifiers.lazy. Retrieved from [6] Jean-François Horn, Marie-Odile Habert, Aurélie Kas, Zoulikha Malek, Philippe Maksud, Lucette Lacomblez, Alain Giron and Bernard Fertil. Differential automatic diagnosis between Alzheimer's disease and frontotemporal dementia based on perfusion SPECT images. In Artificial Intelligence in Medicine (October 2009), pp [7] Faraz Ahmadi Torshizi. Data Mining, Experiencing with WEKA. Retrieved from [8] Vassilis Athitsos. Nearest Neighbor Retrieval and Classification. Retrieved from 6

Topic 1 Classification Alternatives

Topic 1 Classification Alternatives [Jiawei Han, Micheline Kamber, Jian Pei. 2011. Data Mining Concepts and Techniques. 3 rd Ed. Morgan Kaufmann. ISBN: 9380931913.] 1 Contents 2. Classification Using Frequent