k-nn classification with R QMMA

Size: px

Start display at page:

Download "k-nn classification with R QMMA"

Rosaline Walters
5 years ago
Views:

1 k-nn classification with R QMMA Emanuele Taufer file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 1/16

2 HW (Height and weight) of adults Statistics on height and weight of adult males and females tell us that the distribution of these two characteristics in the populations are well approximated by normal variables. In more detail we have the distribution of heights for adult males is Normal with mean 70in. and sd 4in.. Let s write this in more compact form as Height Male N(70, 4) Analogously, Height F emale N(65, 3.5) W eight Male N(143, 10) W eight F emale N(121, 8) The correlation between Height and Weight (for both groups) is 0.5 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 2/16

3 HW data The data set HW.csv contains 200 observations of height and weight of male and female adults: Y : M of F (qualitative) X 1 X 2 : Height of unit in inches (quantitative) : Weight of unit in inches (quantitative) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 3/16

4 Objective Build a model to classify an unknown unit as M or F based on Height and Weight This toy example will allow us to visualize the results A validation data set is available in the file `HWTest.csv file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 4/16

5 Load and plot the data Download the HW.csv dataset in the working directory of R, read it as data.table and plot the two groups library(ggplot2) HW<-read.csv(" head(hw) ## X Gender Height Weight ## 1 1 F ## 2 2 F ## 3 3 F ## 4 4 F ## 5 5 F ## 6 6 F str(hw) ## 'data.frame': 200 obs. of 4 variables: ## $ X : int ## $ Gender: Factor w/ 2 levels "F","M": ## $ Height: num ## $ Weight: num file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 5/16

6 Scatter plot gg1<-ggplot(hw,aes(x=height,y=weight, color=gender,shape=gender))+geom_point(size=3) gg1 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 6/16

7 Classify using k-nn classification We will now perform a k-nn classification using the knn() function, which is part of the class library. knn() forms predictions using a single command. The function requires four inputs: A matrix containing the predictors associated with the training data, labeled XTrain below. A matrix containing the predictors associated with the data for which we wish to make predictions, labeled XTest below. A vector containing the class labels for the training observations, labeled YTrain below. A value for k, the number of nearest neighbors to be used by the classifier. NOte Output of the functions are the predicted values file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 7/16

8 Using the knn() function library(class) knn(xtrain,xtest,ytrain,k) Important! The syntax above produces predictions for a XTest given the k-nn model is build using the XTrain and YTrain data These predictions can be used to estimate the test error rate The syntax below will produce predictions for the XTrain given the k-nn model is build using the XTrain and YTrain data These predictions can be used to estimate the training error rate library(class) knn(xtrain,xtrain,ytrain,k) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 8/16

9 k-nn with the HW data In the HW data, separate the Y and X variables to apply the function knn() In the code below k-nn regression is performed with k = 5 (arbitrarily chosen). The predicted classes (p.ytrain) for the training data are stored in the HW data set. library(class) XTrain=HW[,c(3,4)] YTrain=HW[,2] p.ytrain=knn(xtrain,xtrain,ytrain,k=5) HW$Predict=p.YTrain The training error is mean(ytrain!= p.ytrain) ## [1] file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 9/16

10 Plotting the results Note. Color is the predicted Gender. Shape is the observed Gender. For example, a blue circle is a F incorrectly classified as M gg2<-ggplot(hw,aes(x=height,y=weight, color=predict,shape=gender))+geom_point(size=3) gg2 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 10/16

11 Using k=50 instead of k=5 XTrain=HW[,c(3,4)] YTrain=HW[,2] p.ytrain=knn(xtrain,xtrain,ytrain,k=50) HW$Predict=p.YTrain The training error is mean(ytrain!= p.ytrain) ## [1] gg2<-ggplot(hw,aes(x=height,y=weight, color=predict,shape=gender))+geom_point(size=3) gg2 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 11/16

12 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 12/16

13 Choosing the model (i.e. find out the optimal k) Let s use the HWtest data to obtain estimates of the Test error rate for different values of k. Choose the model (i.e. select k) which has the lowest test error rate Prepare the data HWTest<-read.csv(" XTest=HWTest[,c(3,4)] YTest=HWTest[,2] file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 13/16

14 A loop to make everything automatic The code below, for values of k in the interval 1 50, produces an estimate of the test error rate based on the HWTest data p.ytest = NULL test.error.rate = NULL for(i in 1:50){ set.seed(1) p.ytest = knn(xtrain,xtest,ytrain,k=i) test.error.rate[i] = mean(ytest!= p.ytest) } The value of k minimizing the Test error rate and the estimated test error rate are respectively which.min(test.error.rate) ## [1] 9 min(test.error.rate) ## [1] 0.05 We can do the same in order to compute the train error rate p.ytrain = NULL train.error.rate = NULL for(i in 1:50){ set.seed(1) p.ytrain = knn(xtrain,xtrain,ytrain,k=i) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 14/16

15 } train.error.rate[i] = mean(ytrain!= p.ytrain) file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 15/16

Plot the error rates Error.rates<-data.frame("k"=1:50, "Test.error.rate"=test.error.rate,"Train.error.rate"=trai gg4<-ggplot(error.rates)+geom_line(aes(x=1/k,y=test.error.rate), color="blue")+ geom_line(aes(x=1/k,y=train.

16 Plot the error rates Error.rates<-data.frame("k"=1:50, "Test.error.rate"=test.error.rate,"Train.error.rate"=trai gg4<-ggplot(error.rates)+geom_line(aes(x=1/k,y=test.error.rate), color="blue")+ geom_line(aes(x=1/k,y=train.error.rate), color="red")+xlab("1/k")+ylab("error rates")+ ggtitle("test ER (Blue) and Train ER (Red)") gg4 file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l1-knn-eng.html#(1) 16/16

10/5/2017 MIST.6060 Business Intelligence and Data Mining 1. Nearest Neighbors. In a p-dimensional space, the Euclidean distance between two records,

10/5/2017 MIST.6060 Business Intelligence and Data Mining 1 Distance Measures Nearest Neighbors In a p-dimensional space, the Euclidean distance between two records, a = a, a,..., a ) and b = b, b,...,