k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

Similar documents
Introduction to Artificial Intelligence

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Distribution-free Predictive Approaches

Machine Learning: Algorithms and Applications Mockup Examination

k-nearest Neighbors + Model Selection

arulescba: Classification for Factor and Transactional Data Sets Using Association Rules

Data Mining - Data. Dr. Jean-Michel RICHER Dr. Jean-Michel RICHER Data Mining - Data 1 / 47

Experimental Design + k- Nearest Neighbors

CSCI567 Machine Learning (Fall 2014)

Nearest Neighbor Classification

An Introduction to Cluster Analysis. Zhaoxia Yu Department of Statistics Vice Chair of Undergraduate Affairs

Introduction to R and Statistical Data Analysis

CS 584 Data Mining. Classification 1

STAT 1291: Data Science

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Intro to R for Epidemiologists

Work 2. Case-based reasoning exercise

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Input: Concepts, Instances, Attributes

Nearest Neighbor Methods

Identification Of Iris Flower Species Using Machine Learning

MULTIVARIATE ANALYSIS USING R

Data analysis case study using R for readily available data set using any one machine learning Algorithm

Linear discriminant analysis and logistic

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

USING OF THE K NEAREST NEIGHBOURS ALGORITHM (k-nns) IN THE DATA CLASSIFICATION

A Tour of Sweave. Max Kuhn. March 14, Pfizer Global R&D Non Clinical Statistics Groton

DATA VISUALIZATION WITH GGPLOT2. Coordinates

Decision Trees In Weka,Data Formats

Performance Analysis of Data Mining Classification Techniques

EPL451: Data Mining on the Web Lab 5

The Curse of Dimensionality

Graphing Bivariate Relationships

Linear Regression and K-Nearest Neighbors 3/28/18

Introduction to R for Epidemiologists

CS 8520: Artificial Intelligence. Weka Lab. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Machine Learning for. Artem Lind & Aleskandr Tkachenko

Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

CSC411/2515 Tutorial: K-NN and Decision Tree

Data Preprocessing. Supervised Learning

Classification: Decision Trees

BL5229: Data Analysis with Matlab Lab: Learning: Clustering

Nearest Neighbor Classification. Machine Learning Fall 2017

Summary. Machine Learning: Introduction. Marcin Sydow

Classification and K-Nearest Neighbors

Manuel Oviedo de la Fuente and Manuel Febrero Bande

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Basic Concepts Weka Workbench and its terminology

LaTeX packages for R and Advanced knitr

Application of Fuzzy Logic Akira Imada Brest State Technical University

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Hands on Datamining & Machine Learning with Weka

Cluster Analysis and Visualization. Workshop on Statistics and Machine Learning 2004/2/6

Advanced Statistics 1. Lab 11 - Charts for three or more variables. Systems modelling and data analysis 2016/2017

CS4618: Artificial Intelligence I. Accuracy Estimation. Initialization

CSCI567 Machine Learning (Fall 2018)

Introduction to Machine Learning

Statistical Methods in AI

Data Mining: Exploring Data

While not exactly the same, these definitions highlight four key elements of statistics.

Creating publication-ready Word tables in R

Fitting Classification and Regression Trees Using Statgraphics and R. Presented by Dr. Neil W. Polhemus

USE IBM IN-DATABASE ANALYTICS WITH R

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

Support Vector Machines - Supplement

Machine Learning with MATLAB --classification

Clojure & Incanter. Introduction to Datasets & Charts. Data Sorcery with. David Edgar Liebke

Data Mining: Exploring Data. Lecture Notes for Chapter 3

k-nn classification with R QMMA

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Tools. Jean-Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, Paris, Cedex 05

DEPARTMENT OF BIOSTATISTICS UNIVERSITY OF COPENHAGEN. Graphics. Compact R for the DANTRIP team. Klaus K. Holst

Nearest Neighbor Predictors

ANN exercise session

k-nearest Neighbor (knn) Sept Youn-Hee Han

Formellement. Exemples

Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

K-Nearest Neighbour Classifier. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CS6716 Pattern Recognition

bigml-java Documentation Release latest

Chapter 60 The STEPDISC Procedure. Chapter Table of Contents

Practical Data Mining COMP-321B. Tutorial 1: Introduction to the WEKA Explorer

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Bias-Variance Analysis of Ensemble Learning

Machine Learning nearest neighbors classification. Luigi Cerulo Department of Science and Technology University of Sannio

Orange3 Educational Add-on Documentation

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

Visualizing high-dimensional data:

NeuroMem API Open Source Library PATTERN LEARNING AND RECOGNITION WITH A NEUROMEM NETWORK

Comparision between Quad tree based K-Means and EM Algorithm for Fault Prediction

2002 Journal of Software.. (stacking).

Machine Learning Chapter 2. Input

Visualizing class probability estimators

Transcription:

k Nearest Neighbors

k Nearest Neighbors To classify an observation: Look at the labels of some number, say k, of neighboring observations. The observation is then classified based on its nearest neighbors labels. Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

Example Let s try to classify the unknown green point by looking at k = 3 and k = 5 nearest neighbors For k = 3, we see 2 triangles and 1 square; so we classify the point as a triangle For k = 5, we see 2 triangles and 3 squares; so we classify the point as a square Typically, we use an odd value of k to avoid ties

Design Decisions Choose k Define neighbor Define a measure of distance/closeness Choose a threshold Decide how to classify based on neighbors labels By a majority vote, or By a weighted majority vote (weighted by distance), or

Classifying iris We re going to demonstrate the use of k-nn on the iris data set (the flower, not the part of your eye)

iris 3 species (i.e., classes) of iris Iris setosa Iris versicolor Iris virginica 50 observations per species 4 variables per observation Sepal length Sepal width Petal length Petal width Image Source

Iris vericolor Iris setosa Iris virginica

Visualizing the data

knn in R R provides a knn function in the class library > library(class) # Use classification package > knn(train, test, cl, k) train: training data for the k-nn classifier test: testing data to classify cl: class labels k: the number of neighbors to use

A new observation New orange point new_point <Sepal.Length Sepal.Width Petal.Length Petal.Width data.frame( = 7.2, = 3.2, = 6.4, = 2.4) Let s run k-nn in R with k = 3

knn in R (cont d) > knn(data.frame(iris$petal.length, iris$sepal.length), data.frame(new_point$petal.length, new_point$sepal.length), iris$species, k = 3) [1] virginica Levels: setosa versicolor virginica

How does it work? Since k = 3, R finds the new point s closest 3 neighbors These are all virginica, so it s easy to classify the new point The new observation also gets classified as virginica

Another observation This one s a bit more ambiguous new_point <Sepal.Length Sepal.Width Petal.Length Petal.Width data.frame( = 6.4, = 2.8, = 4.9, = 1.3) What should we expect?

Another observation [1] k = 3 [1] versicolor

Another observation [1] [1] [1] [1] k = 3 versicolor k = 5 virginica

Another observation [1] [1] [1] [1] [1] [1] k = 3 versicolor k = 5 virginica k = 11 versicolor

Decision Regions Decision regions are regions where observations are classified one way or another For iris, there are three species, and three (approximate) decision regions Versicolor Setosa Virginica

Decision Boundaries (Linear) decision boundaries are lines that separate decision regions On the boundaries, classifiers may give ambiguous results Changing parameters, like k in k-nn, changes both the decision regions and the corresponding boundaries Let s vary k and see what happens Versicolor Setosa Virginica

Decision Boundaries Let s take a closer look at decision boundaries for k-nn To do so, let s use a new data set Here are some random data, classified as either 0 or 1 These two classes overlap quite a bit, compared to the iris data

k=1

k=3

k=7

k = 15

k = 25

k = 51

k = 101

Small k k=1 Models like this are overfit Reading too much into training data, extrapolating based on things that aren t really there

Large k k = 101 Rather than being overfit, this model is underfit The decision boundary doesn t capture enough information encoded in the training data

Just right k = 15

3D and Beyond For visualization reasons, our examples were limited to only two features But k-nn is not limited to only two features Can also use 3, 4,..., n dimensions knn will use all available numeric dimensions

Working in 3D knn(data.frame(iris$petal.length, iris$sepal.length, iris$sepal.width), data.frame(new_point$petal.length, new_point$sepal.length, new_point$sepal.width), iris$species, k = 11)

k-nn caveats k-nn can be very slow, especially for very large data sets k-nn works better with quantitative data than categorical data k-nn is not a learning algorithm in the traditional sense, because it doesn t actually do any learning: i.e., it doesn t preprocess the data Instead, when it is given a new observation, it calculates the distance between that observation and every existing observation in the data set Data must be quantitative to calculate distances So qualitative data must be converted Without clusters in the training data, k-nn cannot work well

Model Selection Find a model that appropriately balances complexity and generalization capabilities: i.e., that optimizes the bias-variance tradeoff. High bias, low variance A high value of k indicates a high degree of bias, but contains the variance Low bias, high variance With low values of k, the very jagged decision boundaries are a sign of high variance