Machine Learning nearest neighbors classification. Luigi Cerulo Department of Science and Technology University of Sannio

Machine Learning nearest neighbors classification Luigi Cerulo Department of Science and Technology University of Sannio

Nearest Neighbors Classification The idea is based on the hypothesis that things that are alike are likely to have properties that are alike. We can use this principle to classify data by placing it in the category with the most similar, or "nearest" neighbors. birds of a feather flock together

Success applications Computer vision applications, including optical character recognition and facial recognition in both still images and video. Predicting whether a person enjoys a movie which he/she has been recommended (as in the Netflix challenge). Identifying patterns in genetic data, for use in detecting specific proteins or diseases.

knn algorithm Example dataset Features Example ingredient sweetness crunchiness Class food type apple 10 9 fruit bacon 1 4 protein banana 10 1 fruit carrot 7 10 vegetable celery 3 10 vegetable cheese 1 1 protein

Calculating distance Locating the tomato's nearest neighbors requires a distance function, or a formula that measures the similarity between two instances.

Calculating distance Mathematically a distance is a function 0 means more similar more than zero means less similar D :(X, Y )! R It has some properties: Z D(X, Y )=D(Y,X) D(X, X) =0 X D(X, Y ) apple D(X, Z)+D(Z, Y ) Y

The most famous distance Euclidean distance 2-dimension (q 1,q 2 ) D(p, q) = p (p 1 q 1 ) 2 +(p 2 q 2 ) 2 (p 1,p 2 ) N-dimension D(p, q) = p (p 1 q 1 ) 2 +(p 2 q 2 ) 2 + +(p n q n ) 2

Manhattan distance Euclidean distance 2-dimension (q 1,q 2 ) D(p, q) = p 1 q 1 + p 2 q 2 (p 1,p 2 ) N-dimension D(p, q) = p 1 q 1 + p 2 q 2 + + p n q n

Other useful distance measures Minkowski distance Chebyshev distance Hamming distance Mahalanobis distance

Calculating euclidean distance between tomato and green bean D(tomato, greenbean) = p (6 3) 2 +(4 7) 2 =4.2 sweetness crunchiness tomato 6 4 green bean 3 7 7 4.2 4 3 6

How to classify tomato? To classify the tomato as a vegetable, protein, or fruit, we'll begin by calculating the distance between tomato and all other examples in the training set. 7 4 4.2 3.6 2.2 1.4 3 6

How to classify tomato? To classify the tomato as a vegetable, protein, or fruit, we'll begin by assigning the tomato, the food type of its single nearest neighbor. This is called 1NN classification as k = 1 7 4 4.2 3.6 2.2 1.4 3 6

Choosing an appropriate k Usually k is an odd number so to avoid a tie vote. Deciding how many neighbors to use for knn determines how well the model will generalize to future data. The balance between overfitting and underfitting the training data is a problem known as the bias-variance tradeoff. Choosing a large k reduces the impact or variance caused by noisy data, but can bias the learner such that it runs the risk of ignoring small, but important patterns.

Choosing an appropriate k Suppose a very large k (k=the total number of observations in the training data). As every training instance is represented in the final vote, the most common training class always has a majority of the voters. The model would, thus, always predict the majority class, regardless of which neighbors are nearest. Suppose a very small k (k=1). noisy data or outliers, to unduly influence the classification of examples and any unlabeled example will affect negatively the prediction.

Choosing an appropriate k In practice, choosing k depends on the difficulty of the concept to be learned and the number of records in the training data. Typically, k is set somewhere between 3 and 10. One common practice is to set k equal to the square root of the number of training examples. An alternative approach is to test several k values on a variety of test datasets and choose the one that delivers the best classification performance. On the other hand, unless the data is very noisy, larger and more representative training datasets can make the choice of k less important. This is because even subtle concepts will have a sufficiently large pool of examples to vote as nearest neighbors.

knn algorithm summary The knn algorithm begins with a training set made up of examples that are classified into several categories (nominal variable). For an unlabeled example, that have the same features as the training data, knn identifies k examples in the training set that are the "nearest" in similarity. The unlabeled example is assigned the class of the majority of the k nearest neighbors. k is an integer specified in advance.

Preparing data for knn Features are typically transformed to a standard range prior to applying the knn algorithm. The rationale for this step is that the distance formula is dependent on how features are measured. In particular, if certain features have much larger values than others, the distance measurements will be strongly dominated by the larger values. X new = max X min ( X) ( X) min ( X) X new X µ = = σ ( ) ( ) X Mean X StdDev X min-max normalization z-score standardization

Euclidean distance for nominal data A typical solution utilizes dummy coding. A dichotomic variable (2 category) is coded with the value 1 to indicates one category, and 0 indicates the other. X male female male 1 ifx= male = 0 otherwise An n-category variable is dummy coded with (n-1) binary variables with an exclusive 1 to indicate each category. X blue yellow red green d1 d2 d3 1 0 0 0 1 0 0 0 1 0 0 0

Euclidean distance for ordinal data A typical solution is to number the n-categories form 0 to (n-1) and then normalize. numbered i normalized i n X cold warm hot 0 1 2 3 4 0 0.25 0.5 0.75 1

knn is lazy It is known also as instance-based learning or rote learning or non-parametric learning method. Strengths Simple and effective Makes no assumptions about the underlying data distribution Fast training phase Weaknesses Does not produce a model, which limits the ability to find novel insights in relationships among features Slow classification phase Requires a large amount of memory Nominal features and missing data require additional processing Without generating theories about the underlying data it limits our ability to understand how the classifier is using the data. But this allows the learner to find natural patterns rather than trying to fit the data into a preconceived form. It reveals very powerful in many contexts.

Diagnosing breast cancer with the knn algorithm Dataset: Breast Cancer Wisconsin Diagnostic" from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml) file: wdbc.data (W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 163-171) 569 examples of cancer biopsies, each with 32 features. One feature is an identification number, another is the cancer diagnosis (M=malignant, B=benign). 30 are numeric-valued laboratory measurements. Radius Texture Perimeter Area Smoothness Compactness Concavity Concave points Symmetry Fractal dimension

Exploring and preparing the data

Exploring and preparing the data Assigning column names

Exploring and preparing the data Removing first column. A model that includes an identifier will most likely suffer from overfitting, and is not likely to generalize well to other data. Proportion of the class variable

Exploring and preparing the data Summary of other variables. Different ranges can be noticed so that normalization is required!

Exploring and preparing the data Min-max normalization

Creating training and test datasets Although all 569 biopsies are labeled with a benign or malignant status, it is not very interesting to predict what we already know. A more interesting question is how well our learner performs on a dataset of unseen (unlabeled) data. If we had access to a laboratory, we could apply our learner to measurements taken from the next 100 masses of unknown cancer status and see how well the machine learner's predictions compare to diagnoses obtained using conventional methods. But we don t have unseen data, so we can simulate this scenario by dividing our data into two portions: a training dataset that will be used to build the knn model and a test dataset that will be used to estimate the predictive accuracy of the model. We will use the first 469 records for the training dataset and the remaining 100 to simulate new patients.

Exploring and preparing the data Building training and testing sets Store the class labels columns in a vector Remove the class label columns from the datasets

The class package contains the knn function Install the class package in R.

Train on training data and predict on testing data Run the knn function with k=3 Predicted class True class (from the test set)

How well is the prediction? An intuitive measure of prediction performance is to evaluate to what extent the predicted class is equals to the true class (aka accuracy)

How well is the prediction? But usually in a two class problem one class (positive) is more important to the other (negative). So it is important to know to what extent the positive class has been correctly predicted as positives (true positives, TP) and to what extent is has been wrongly predicted as positives (false positives, FP) FP TP

How well is the prediction? But it is also important to know to what extent the negative class has been correctly predicted as negative (true negatives, TN) and to what extent is has been wrongly predicted as negative (false negatives, FN) TN FN FP TP

Performance measures confusion table (aka contingency table) True class Accuracy ACC = TP + FP TP + FP + FN + FN Predicted class negative class positive class negative class TN FN positive class FP TP Sensitivity or Recall Specificity TPR = FNR = TNR = TP TP + FN FN TP + FN =1 TN TN + FP TPR F1-score F 1= 2TP 2TP + FP + FN FPR = NPV = FP FP + TN =1 TN TN + FN TNR Precision PPV = TP TP + FP

Computing performance measures with R

Exercises 1.Try to improve prediction performance by using a z-score standardization and alternative values of k (e.g. 5 or 9) 2.Generate automatically 10 random train/test splits (similar to previous in size) and compute for each split the prediction accuracy. Print out the average of such accuracies. (hint: in R the sample function is able to generate a random permutation of a vector, see help for more details)