Machine Learning nearest neighbors classification. Luigi Cerulo Department of Science and Technology University of Sannio

Similar documents
k-nn Disgnosing Breast Cancer

Evaluating Classifiers

Evaluating Classifiers

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Network Traffic Measurements and Analysis

Stat 4510/7510 Homework 4

K- Nearest Neighbors(KNN) And Predictive Accuracy

Chuck Cartledge, PhD. 23 September 2017

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

CS4491/CS 7265 BIG DATA ANALYTICS

2. On classification and related tasks

CISC 4631 Data Mining

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

List of Exercises: Data Mining 1 December 12th, 2015

Large Scale Data Analysis Using Deep Learning

Topic 1 Classification Alternatives

CSC411/2515 Tutorial: K-NN and Decision Tree

Data Preprocessing. Supervised Learning

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Classification and Regression

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Artificial Intelligence. Programming Styles

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

ECLT 5810 Evaluation of Classification Quality

Slides for Data Mining by I. H. Witten and E. Frank

Pattern recognition (4)

k-nearest Neighbor (knn) Sept Youn-Hee Han

Supervised Learning: K-Nearest Neighbors and Decision Trees

Classification of Breast Cancer Cells Using JMP Marie Gaudard, North Haven Group

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Nearest Neighbor Classification. Machine Learning Fall 2017

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Supervised Learning: Nearest Neighbors

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

CP365 Artificial Intelligence

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Classification and Regression Analysis of the Prognostic Breast Cancer using Generation Optimizing Algorithms

Data Mining. Lecture 03: Nearest Neighbor Learning

Evaluating Machine-Learning Methods. Goals for the lecture

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

CS145: INTRODUCTION TO DATA MINING

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

I211: Information infrastructure II

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Instance-Based Learning. Goals for the lecture

A Critical Study of Selected Classification Algorithms for Liver Disease Diagnosis

Instance-Based Learning.

Cross- Valida+on & ROC curve. Anna Helena Reali Costa PCS 5024

CS489/698 Lecture 2: January 8 th, 2018

Machine Learning Techniques for Data Mining

Decision Tree (Continued) and K-Nearest Neighbour. Dr. Xiaowei Huang

CSE 573: Artificial Intelligence Autumn 2010

Nearest Neighbor Classifiers

K-Nearest Neighbour Classifier. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CS 584 Data Mining. Classification 1

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Bagging for One-Class Learning

Classification: Feature Vectors

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Classification and K-Nearest Neighbors

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

An Empirical Study on Lazy Multilabel Classification Algorithms

Intro to Artificial Intelligence

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Machine Learning for. Artem Lind & Aleskandr Tkachenko

Part II: A broader view

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Machine Learning and Pervasive Computing

Data Mining and Machine Learning: Techniques and Algorithms

Mining di Dati Web. Lezione 3 - Clustering and Classification

Clustering analysis of gene expression data

CS 340 Lec. 4: K-Nearest Neighbors

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

A weighted fuzzy classifier and its application to image processing tasks

Package kdevine. May 19, 2017

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Machine learning in fmri

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Classification. Slide sources:

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Machine Learning Models for Pattern Classification. Comp 473/6731

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Business Club. Decision Trees

EE795: Computer Vision and Intelligent Systems

Introduction to Supervised Learning

Machine Learning: Symbolische Ansätze

CS5670: Computer Vision

10/5/2017 MIST.6060 Business Intelligence and Data Mining 1. Nearest Neighbors. In a p-dimensional space, the Euclidean distance between two records,

K-Nearest Neighbour (Continued) Dr. Xiaowei Huang

Nearest Neighbor Predictors


Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Transcription:

Machine Learning nearest neighbors classification Luigi Cerulo Department of Science and Technology University of Sannio

Nearest Neighbors Classification The idea is based on the hypothesis that things that are alike are likely to have properties that are alike. We can use this principle to classify data by placing it in the category with the most similar, or "nearest" neighbors. birds of a feather flock together

Success applications Computer vision applications, including optical character recognition and facial recognition in both still images and video. Predicting whether a person enjoys a movie which he/she has been recommended (as in the Netflix challenge). Identifying patterns in genetic data, for use in detecting specific proteins or diseases.

knn algorithm Example dataset Features Example ingredient sweetness crunchiness Class food type apple 10 9 fruit bacon 1 4 protein banana 10 1 fruit carrot 7 10 vegetable celery 3 10 vegetable cheese 1 1 protein

knn algorithm Example dataset Features Example ingredient sweetness crunchiness Class food type apple 10 9 fruit bacon 1 4 protein banana 10 1 fruit carrot 7 10 vegetable celery 3 10 vegetable cheese 1 1 protein

knn algorithm Example dataset Features Example ingredient sweetness crunchiness Class food type apple 10 9 fruit bacon 1 4 protein banana 10 1 fruit carrot 7 10 vegetable celery 3 10 vegetable cheese 1 1 protein

knn algorithm Example dataset Features Example ingredient sweetness crunchiness Class food type apple 10 9 fruit bacon 1 4 protein banana 10 1 fruit carrot 7 10 vegetable celery 3 10 vegetable cheese 1 1 protein

Calculating distance Locating the tomato's nearest neighbors requires a distance function, or a formula that measures the similarity between two instances.

Calculating distance Locating the tomato's nearest neighbors requires a distance function, or a formula that measures the similarity between two instances.

Calculating distance Mathematically a distance is a function 0 means more similar more than zero means less similar D :(X, Y )! R It has some properties: Z D(X, Y )=D(Y,X) D(X, X) =0 X D(X, Y ) apple D(X, Z)+D(Z, Y ) Y

The most famous distance Euclidean distance 2-dimension (q 1,q 2 ) D(p, q) = p (p 1 q 1 ) 2 +(p 2 q 2 ) 2 (p 1,p 2 ) N-dimension D(p, q) = p (p 1 q 1 ) 2 +(p 2 q 2 ) 2 + +(p n q n ) 2

Manhattan distance Euclidean distance 2-dimension (q 1,q 2 ) D(p, q) = p 1 q 1 + p 2 q 2 (p 1,p 2 ) N-dimension D(p, q) = p 1 q 1 + p 2 q 2 + + p n q n

Other useful distance measures Minkowski distance Chebyshev distance Hamming distance Mahalanobis distance

Calculating euclidean distance between tomato and green bean D(tomato, greenbean) = p (6 3) 2 +(4 7) 2 =4.2 sweetness crunchiness tomato 6 4 green bean 3 7 7 4.2 4 3 6

How to classify tomato? To classify the tomato as a vegetable, protein, or fruit, we'll begin by calculating the distance between tomato and all other examples in the training set. 7 4 4.2 3.6 2.2 1.4 3 6

How to classify tomato? To classify the tomato as a vegetable, protein, or fruit, we'll begin by assigning the tomato, the food type of its single nearest neighbor. This is called 1NN classification as k = 1 7 4 4.2 3.6 2.2 1.4 3 6

How to classify tomato? To classify the tomato as a vegetable, protein, or fruit, we'll begin by assigning the tomato, the food type of its single nearest neighbor. This is called 1NN classification as k = 1 With k = 3 a vote among the three nearest 7 4 4.2 3.6 2.2 neighbors orange, 1.4 grape, and nuts, is performed 3 6

Choosing an appropriate k Usually k is an odd number so to avoid a tie vote. Deciding how many neighbors to use for knn determines how well the model will generalize to future data. The balance between overfitting and underfitting the training data is a problem known as the bias-variance tradeoff. Choosing a large k reduces the impact or variance caused by noisy data, but can bias the learner such that it runs the risk of ignoring small, but important patterns.

Choosing an appropriate k Suppose a very large k (k=the total number of observations in the training data). As every training instance is represented in the final vote, the most common training class always has a majority of the voters. The model would, thus, always predict the majority class, regardless of which neighbors are nearest. Suppose a very small k (k=1). noisy data or outliers, to unduly influence the classification of examples and any unlabeled example will affect negatively the prediction.

Choosing an appropriate k In practice, choosing k depends on the difficulty of the concept to be learned and the number of records in the training data. Typically, k is set somewhere between 3 and 10. One common practice is to set k equal to the square root of the number of training examples. An alternative approach is to test several k values on a variety of test datasets and choose the one that delivers the best classification performance. On the other hand, unless the data is very noisy, larger and more representative training datasets can make the choice of k less important. This is because even subtle concepts will have a sufficiently large pool of examples to vote as nearest neighbors.

knn algorithm summary The knn algorithm begins with a training set made up of examples that are classified into several categories (nominal variable). For an unlabeled example, that have the same features as the training data, knn identifies k examples in the training set that are the "nearest" in similarity. The unlabeled example is assigned the class of the majority of the k nearest neighbors. k is an integer specified in advance.

Preparing data for knn Features are typically transformed to a standard range prior to applying the knn algorithm. The rationale for this step is that the distance formula is dependent on how features are measured. In particular, if certain features have much larger values than others, the distance measurements will be strongly dominated by the larger values. X new = max X min ( X) ( X) min ( X) X new X µ = = σ ( ) ( ) X Mean X StdDev X min-max normalization z-score standardization

Euclidean distance for nominal data A typical solution utilizes dummy coding. A dichotomic variable (2 category) is coded with the value 1 to indicates one category, and 0 indicates the other. X male female male 1 ifx= male = 0 otherwise An n-category variable is dummy coded with (n-1) binary variables with an exclusive 1 to indicate each category. X blue yellow red green d1 d2 d3 1 0 0 0 1 0 0 0 1 0 0 0

Euclidean distance for ordinal data A typical solution is to number the n-categories form 0 to (n-1) and then normalize. numbered i normalized i n X cold warm hot 0 1 2 3 4 0 0.25 0.5 0.75 1

knn is lazy It is known also as instance-based learning or rote learning or non-parametric learning method. Strengths Simple and effective Makes no assumptions about the underlying data distribution Fast training phase Weaknesses Does not produce a model, which limits the ability to find novel insights in relationships among features Slow classification phase Requires a large amount of memory Nominal features and missing data require additional processing Without generating theories about the underlying data it limits our ability to understand how the classifier is using the data. But this allows the learner to find natural patterns rather than trying to fit the data into a preconceived form. It reveals very powerful in many contexts.

Diagnosing breast cancer with the knn algorithm Dataset: Breast Cancer Wisconsin Diagnostic" from UCI Machine Learning Repository (http://archive.ics.uci.edu/ml) file: wdbc.data (W.H. Wolberg, W.N. Street, and O.L. Mangasarian. Machine learning techniques to diagnose breast cancer from fine-needle aspirates. Cancer Letters 77 (1994) 163-171) 569 examples of cancer biopsies, each with 32 features. One feature is an identification number, another is the cancer diagnosis (M=malignant, B=benign). 30 are numeric-valued laboratory measurements. Radius Texture Perimeter Area Smoothness Compactness Concavity Concave points Symmetry Fractal dimension

Exploring and preparing the data

Exploring and preparing the data Assigning column names

Exploring and preparing the data Removing first column. A model that includes an identifier will most likely suffer from overfitting, and is not likely to generalize well to other data. Proportion of the class variable

Exploring and preparing the data Summary of other variables. Different ranges can be noticed so that normalization is required!

Exploring and preparing the data Min-max normalization

Creating training and test datasets Although all 569 biopsies are labeled with a benign or malignant status, it is not very interesting to predict what we already know. A more interesting question is how well our learner performs on a dataset of unseen (unlabeled) data. If we had access to a laboratory, we could apply our learner to measurements taken from the next 100 masses of unknown cancer status and see how well the machine learner's predictions compare to diagnoses obtained using conventional methods. But we don t have unseen data, so we can simulate this scenario by dividing our data into two portions: a training dataset that will be used to build the knn model and a test dataset that will be used to estimate the predictive accuracy of the model. We will use the first 469 records for the training dataset and the remaining 100 to simulate new patients.

Exploring and preparing the data Building training and testing sets Store the class labels columns in a vector Remove the class label columns from the datasets

The class package contains the knn function Install the class package in R.

Train on training data and predict on testing data Run the knn function with k=3 Predicted class True class (from the test set)

How well is the prediction? An intuitive measure of prediction performance is to evaluate to what extent the predicted class is equals to the true class (aka accuracy)

How well is the prediction? But usually in a two class problem one class (positive) is more important to the other (negative). So it is important to know to what extent the positive class has been correctly predicted as positives (true positives, TP) and to what extent is has been wrongly predicted as positives (false positives, FP) FP TP

How well is the prediction? But it is also important to know to what extent the negative class has been correctly predicted as negative (true negatives, TN) and to what extent is has been wrongly predicted as negative (false negatives, FN) TN FN FP TP

Performance measures confusion table (aka contingency table) True class Accuracy ACC = TP + FP TP + FP + FN + FN Predicted class negative class positive class negative class TN FN positive class FP TP Sensitivity or Recall Specificity TPR = FNR = TNR = TP TP + FN FN TP + FN =1 TN TN + FP TPR F1-score F 1= 2TP 2TP + FP + FN FPR = NPV = FP FP + TN =1 TN TN + FN TNR Precision PPV = TP TP + FP

Computing performance measures with R

Exercises 1.Try to improve prediction performance by using a z-score standardization and alternative values of k (e.g. 5 or 9) 2.Generate automatically 10 random train/test splits (similar to previous in size) and compute for each split the prediction accuracy. Print out the average of such accuracies. (hint: in R the sample function is able to generate a random permutation of a vector, see help for more details)