Why Do Nearest-Neighbour Algorithms Do So Well?

Similar documents
CS6716 Pattern Recognition

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Data Cleaning and Prototyping Using K-Means to Enhance Classification Accuracy

Machine Learning Techniques for Data Mining

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

Texture Classification by Combining Local Binary Pattern Features and a Self-Organizing Map

Slides for Data Mining by I. H. Witten and E. Frank

ECG782: Multidimensional Digital Signal Processing

Naïve Bayes for text classification

Supervised vs unsupervised clustering

Classification Algorithms in Data Mining

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

CS 584 Data Mining. Classification 1

Nearest Neighbor Classification. Machine Learning Fall 2017

Statistical Learning Part 2 Nonparametric Learning: The Main Ideas. R. Moeller Hamburg University of Technology

Radial Basis Function Neural Network Classifier

Artificial Intelligence. Programming Styles

Network Traffic Measurements and Analysis

Applying Supervised Learning

1 Case study of SVM (Rob)

Semi-supervised Learning

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Bayes Risk. Classifiers for Recognition Reading: Chapter 22 (skip 22.3) Discriminative vs Generative Models. Loss functions in classifiers

Using Machine Learning to Optimize Storage Systems

Chapter 3: Supervised Learning

Establishing Virtual Private Network Bandwidth Requirement at the University of Wisconsin Foundation

A Dendrogram. Bioinformatics (Lec 17)

INF 4300 Classification III Anne Solberg The agenda today:

Classifiers for Recognition Reading: Chapter 22 (skip 22.3)

Data mining techniques for actuaries: an overview

CS145: INTRODUCTION TO DATA MINING

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

Classification: Feature Vectors

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Random Forest A. Fornaser

Machine Learning in Biology

Nearest Cluster Classifier

SOCIAL MEDIA MINING. Data Mining Essentials

Lecture 3. Oct

Nearest Neighbor Predictors

5 Learning hypothesis classes (16 points)

Enhancing Forecasting Performance of Naïve-Bayes Classifiers with Discretization Techniques

CS 340 Lec. 4: K-Nearest Neighbors

PARALLEL CLASSIFICATION ALGORITHMS

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Data Mining: An experimental approach with WEKA on UCI Dataset

A study of classification algorithms using Rapidminer

Data mining with Support Vector Machine

Nearest Cluster Classifier

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

A Comparative Study of Conventional and Neural Network Classification of Multispectral Data

A neural network that classifies glass either as window or non-window depending on the glass chemistry.

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Machine Learning Classifiers and Boosting

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Adaptive Metric Nearest Neighbor Classification

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

CS570: Introduction to Data Mining

Combining Models to Improve Classifier Accuracy and Robustness 1

6.867 Machine Learning

Introduction to Data Mining and Data Analytics

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Deep Learning for Computer Vision

Multi-voxel pattern analysis: Decoding Mental States from fmri Activity Patterns

CSE 573: Artificial Intelligence Autumn 2010

DATA MINING I - 1DL360

Supervised Sementation: Pixel Classification

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Bayesian Network & Anomaly Detection

Neural Networks and Deep Learning

A vector quantization method for nearest neighbor classifier design

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

Distribution-free Predictive Approaches

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

Application of Support Vector Machine Algorithm in Spam Filtering

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

International Journal of Software and Web Sciences (IJSWS)

A Modular Reduction Method for k-nn Algorithm with Self-recombination Learning

A Program demonstrating Gini Index Classification

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Multi-label classification using rule-based classifier systems

Data Mining and Data Warehousing Classification-Lazy Learners

Color-Based Classification of Natural Rock Images Using Classifier Combinations

CSE4334/5334 DATA MINING

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Classification Lecture Notes cse352. Neural Networks. Professor Anita Wasilewska

Combining Models to Improve Classifier Accuracy and Robustness

K-Means Clustering. Sargur Srihari

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Spatial Information Based Image Classification Using Support Vector Machine

Nearest Neighbor Classification

INTRODUCTION TO ARTIFICIAL INTELLIGENCE

Comparative analysis of data mining methods for predicting credit default probabilities in a retail bank portfolio

Wild Mushrooms Classification Edible or Poisonous

Transcription:

References Why Do Nearest-Neighbour Algorithms Do So Well? Brian D. Ripley Professor of Applied Statistics University of Oxford Ripley, B. D. (1996) Pattern Recognition and Neural Networks. CUP. ISBN 0-521-48086-7. Devijver, P. A. & Kittler, J. V. (1982) Pattern Recognition. A Statistical Approach. Prentice-Hall. Fukunaga, K. (1990) Introduction to Statistical Pattern Recognition. Second edition. Academic Press. Michie, D., Spiegelhalter, D. J. & Taylor, C. C. (eds) (1994) Machine Learning, Neural and Statistical Classification. Ellis Horwood. Weiss, S. M. & Kulikowski, C. A. (1991) Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems. Morgan Kaufmann. ripley@stats.ox.ac.uk http://stats.ox.ac.uk/ ripley 1 Pattern Recognition Supervised vs unsupervised Most pattern recognition problems have preassigned classes of patterns, and the task is to classify new observations. Sometimes the task includes finding the patterns. Statistical pattern recognition so called by electrical engineers. Very little is assumed about the classes of patterns: everything is learned from examples. Structural pattern recognition Quite a lot is assumed about the classes of patterns. Main example is syntactic pattern recognition using formal languages. Very few working examples. Examples of Pattern Recognition Railway crossing Types of galaxies (SKYCAT) Medical diagnosis Detecting abnormal cells in cervical smears Recognizing dangerous driving conditions Reading ZIP codes Reading hand-written symbols (on a penpad) Logicook microwave oven Financial trading Spotting fraudulent use of credit cards Spotting fake antique furniture Forensic studies of fingerprints, glass Fingerprints, retinal patterns for security Reading a vehicle number plate Reading circuit diagrams Industrial inspection Oil-well lithofacies Credit allocation rules 2 3

Machine Learning Overview of Approaches Machine Learning is generally taken to encompass automatic learning procedures based on logical or binary operations, that learn a task from a series of examples. Machine Learning aims to generate classifying expressions simple enough to be understood easily by humans. They must mimic human reasoning sufficiently well to provide insight into the decision process. Like statistical approaches, background knowledge may be exploited in development, but operation is assumed without human intervention. (Michie et al., 1994, p. 2) This stresses the need for a comprehensible explanation, which is needed in some but not all pattern recognition tasks. We have already noted that we cannot explain our identification of faces, and to recognize Zip codes no explanation is needed, just speed and accuracy. to supervised statistical pattern recognition. Linear and logistic discrimination Nearest-neighbour approaches Partitioning, especially classification trees Rule-based expert systems Neural networks Bayesian belief nets A few comprehensive comparisons, especially Michie et al. Nearest-neighbour approaches come at or near the top. Why? 4 5 Nearest-Neighbour Approaches Several variants. Main one, usually attributed to Fix & Hodges (1951), is k-nn. Given a pattern x to classify (i) Find the k nearest patterns in the training set (ii) Decide the classification by a majority vote amongst these k [ A less-popular variant is to find the nearest l patterns of each class, and choose the class with the nearest set. ] Begs many questions: (a) What similarity measure? (b) Should less relevant features be included? (c) How do we choose k? (d) What about tied votes (possible for k > 1)? (e) Should we weight votes by proximity? (f) Do we want a minimum of consensus ((k, l) rules) Theory There is an elegant large-sample theory, mainly due to Cover & Hart (1967). This looks at an abundance of training cases, so we can assume that there are many examples in a region in which the prevalences of the K classes do not change significantly. In this theory there is no advantage in weighting votes, and the similarity measure is irrelevant, except to define the local region of constant prevalence. Let E k be the error rate of k-nn (averaged over training sets) and E the minimum possible error rate (Bayes risk). For K classes: ( E E 1 E 2 K ) K 1 E 2E For k = 2 E 2 E 4 E 2k E E 2k = E 2k 1 E 2 = E 1 = 2E 2 where E refers to not declaring an opinion is the vote is tied, and E to breaking ties at random. 6 7

Looking at the performance at particular x s: Choice of Metric k-nn risk 0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5 Bayes risk Large-sample risk r k (k odd) or r k (k even) of k-nn rules against the Bayes risk r in a two-class problem. There is also a theory of (k, l)-nn rules, where a minimum vote of l is needed. k=1 k=3 k=5 k=2 k=4 This is often ignored: I have heard k-nn described as an automatic method (as against neural networks, say) without describing how k and the metric are to be chosen. There are studies on choosing the metric automatically, almost all as a quadratic form in the variables (equivalently as Euclidean distance on linearly transformed variables), but Euclidean distance is the normal choice. It can be crucial to down-weight less-relevant variables. Hastie & Tibshirani s DANN uses (locally) the best linear discriminator, but good features can be poor linearly. One neat idea is to use the importance of features as measured by a non-linear method (e.g. classification trees). Where there are both continuous and categorical variables, Gower s general dissimilarity measure is the most common choice. 8 9 Data Editing One common complaint about k-nn methods is that they can take too long to compute and need too much storage for the whole training set. The difficulties are sometimes exaggerated, as there are fast ways to find near neighbours. However, in many problems it is only necessary to retain a small proportion of the training set to approximate very well the decision boundary of the k-nn classifier. This concept is known as data editing. It can also be used to improve the performance of the classifier by removing apparent outliers. Multiedit 1. Put all patterns in the current set. 2. Divide the current set more or less equally into V 3 sets. Use pairs cyclically as test and training sets. 3. For each pair classify the test set using the k- NN rule from the training set. 4. Delete from the current set all those patterns in the test set which were incorrectly classified. 5. If any patterns were deleted in the last I passes return to step 2. The edited set is then used with the 1-NN rule. The multiedit algorithm aims to form homogeneous clusters. However, only the points on the boundaries of the clusters are really effective in defining the classifier boundaries. Condensing algorithms aim to retain only the crucial exterior points in the clusters. 10 11

For example, Hart (1968) gives: 1. Divide the current patterns into a store and a grabbag. One possible partition is to put the first point in the store, the rest in the grabbag. 2. Classify each sample in the grabbag by the 1- NN rule using the store as training set. If the sample is incorrectly classified transfer it to the store. 3. Return to 2 unless no transfers occurred or the grabbag is empty. 4. Return the store. A refinement, the reduced nearest neighbour rule of Gates (1972), is to go back over the condensed training set and drop any patterns (one at a time) which are not needed to correctly classify the rest of the (edited) training set. 12 A synthetic example of 250 points, 125 of each class. The best possible decision boundary is shown as a continuous curve: this is only known for synthetic problems. 13 (a) (b) (c) (d) Reduction algorithms. The known decision boundary of the Bayes rule is shown with a solid line; the decision boundary for the 1-NN rule is shown dashed. (a) multiedit. (b) The result of retaining only those points whose posterior probability of the actual class exceeds 90% when estimated from the remaining points. (c) condense after multiedit. (d) reduced NN applied after condense to (a). 14 Learning Vector Quantization The refinements of the k-nn rule aim to choose a subset of the training set in such a way that the 1-NN rule based on this subset approximates the Bayes classifier. It is not necessary that the modified training set is a subset of the original. The approach taken in Kohonen s LVQ is to construct a modified training set iteratively. Following Kohonen, we call the modified training set the codebook. This procedure tries to represent the decision boundaries rather than the class distributions. The original procedure LVQ1 uses the following update rule. A example x is presented. The nearest codebook vector to x, m c, is updated by m c m c + α(t)[x m c ] if x is classified correctly by m c m c m c α(t)[x m c ] if x is classified incorrectly and all other codebook vectors are unchanged. Initially α(t) is chosen smaller than 0.1 and it is reduced linearly to zero during the fixed number of iterations. 15

Results of learning vector quantization. The initially chosen codebook is shown by small circles, the result of OLVQ1 by + and subsequently applying 25,000 passes of LVQ2.1 by triangles. The known decision boundary of the Bayes rule is also shown. Result of OLVQ1 with codebook of size 30. The decision boundary is shown as a dashed line. Further results of LVQ with a larger codebook. This time the triangles show the results from LVQ3. 16 17 Comparative Studies Comparative studies often only show how well a particular class of naïve users might do with particular methods. The most comprehensive study is that published by Michie et al., yet In the trials reported in this book, we used the nearest neighbour (k = 1) classifier with no condensing.... Distances were scaled using the standard deviation for each attribute, with the calculation conditional on the class. [p. 36] Although this method did very well on the whole, as expected it was the slowest of all for the very large datasets. However, it is known that substantial time saving can effected, at the expense of some slight loss of accuracy, by using a condensed version of the training data.... It is clear that substantial improved accuracy can be obtained with careful choice of variables, but the current implementation is far too slow. When scaling of attributes is not important, such as in object recognition databases, k-nn is first in the trials. Yet the explanatory power of k-nn might be said to be very small. [p. 216] So, despite being the simplest of all of the methods in the comparison, it was implemented in a cursory way and still came out as one of the best algorithms, consistently beating neural networks by a large margin. In the hands of a less inexperienced user it would do better; in particular editing helps greatly with overlapping classes as in our synthetic example. 18 19

A Forensic Example Data on 214 fragments of glass. Each has a measured refractive index and composition (weight percent of oxides of Na, Mg, Al, Si, K, Ca, Ba and Fe). Grouped as window float glass (70), window nonfloat glass (76), vehicle window glass (17) and other (containers, tableware, headlamps) (22). Used 89 to train, 96 to test, excluding headlamps. 50/89 Mg < 2.7 Mg > 2.7 O 5/16 39/73 RI < 1.52007 Al < 1.42 RI > 1.52007 O 0/8 3/8 15/43 RI < 1.52516 RI < 1.51707 RI > 1.52516 RI > 1.51707 O VW 2/5 0/3 1/6 10/37 Na < 13.495 RI < 1.51599 Fe < 0.125 Na > 13.495 RI > 1.51599 Fe > 0.125 Al > 1.42 4/30 Ca < 8.8 Ca > 8.8 VW 2/27 1/3 Fe < 0.155 RI < 1.51763 Fe > 0.155 RI > 1.51763 O VW VW 0/3 0/1 0/5 2/25 6/12 3 2/4 0/1 methods error rates % Ca < 10.69 Ca > 10.69 Na < 13.55 Na > 13.55 Na < 12.835 Na > 12.835 linear discriminant 41 22 nearest neighbour 26 17 neural network with 2 hidden units 38 16 ditto, averaged over fits 38 14 neural network with 6 hidden units 33 15 ditto, averaged over fits 28 12 classification tree 28 15 VW 1/24 0/1 4/10 Mg < 3.915 Ca < 9.635 Mg > 3.915 Ca > 9.635 3 0/1 2/8 RI < 1.51803 RI > 1.51803 1/3 0/5 RI < 1.51745 RI > 1.51745 0/1 In the second column of error rates confusion between the two types of window glass is allowed. Classification tree for the forensic glass example. At each node the majority classification is given, together with the error rate amongst training-set cases reaching that node. 20 21 A synthetic example of 250 points, 125 of each class. The best possible decision boundary is shown as a continuous curve: this is only known for synthetic problems.