Homework 3: Solutions

Size: px

Start display at page:

Download "Homework 3: Solutions"

Ann Haynes
5 years ago
Views:

1 Homework 3: Solutions Statistics 43 Fall 207 Data Analysis: Note: All data analysis results are provided by Yixin Chen

2 STAT43 - HW3 Yixin Chen Data Analysis in R Pipeline (a) Since the purpose is to find the best multi-class classifier, I would like to use the same analysis pipeline in Assignment 2. I randomly split the dataset into three parts which are used separately in my three phases of analysis pipeline : 60% of the data is used in train/val phase where I perform K-fold cross validation to choose the optimal tuning parameters for all tunable models ; 20% of the data goes to query phase in which I used all models with chosen tuning parameters if applicable to do classification and then concluded which model performed most accurate classification ; The left 20% of the data is treated as test data on which I do final prediction to assess our final classifier..5 Data Preprocessing (a) Before going into multi-class classification, I need to convert all labels into numbers/factors. Therefore, I used 0-6 as representations of brickface, sky, foliage, cement, window, path, grass. (The reason for starting from 0 as factor is simply that R s packages start classification from 0 as a norm). Besides the random splitting of data for analysis pipeline, I also manipulated a feature column. Because there is a feature that has the same value for all observations, I decide to delete that feature before I can scale and center the data matrix. Below is my code for data preprocessing. ## import the data 2 dat read.csv( data.csv,header=f) 3 dat2 read.csv( test.csv,header=f) 4 data0 as.matrix(rbind(dat,dat2)) 5 data00 data0[,-4] 6 data00[,2:9] = scale(data00[,2:9],center=true,scale=true) 7 ## split data 8 set.seed(00); 9 idx sample(:nrow(data00)) 0 idx idx[:floor(0.6*nrow(data00))] idx2 idx[(floor(0.6*nrow(data00))+):floor(0.8*nrow(data00))] 2 idx3 idx[(floor(0.8*nrow(data00))+):nrow(data00)] 3 train data00[idx,] # 60% Training Data 4 xtrain train[,2:9] 5 ytrain train[,] # 20% Query 6 query data00[idx2,] 7 xquery query[,2:9] 8 yquery query[,] 9 test data00[idx3,] # 20% Test Data 20 xtest test[,2:9] 2 ytest test[,]

3 2CompareandContrast (a) Naive Bayes ## Naive Bayes 2 # define training control 3 library(naivebayes) 4 train_control traincontrol(method="cv", number=5) 5 # fix the parameters 6 grid expand.grid(.fl = seq(0,0,0.),.usekernel = c(false,true),.adjust=seq(0,20,) ) 7 model_nb = train(as.factor(v).,data=as.data.frame(train),trcontrol = train_control, method = naive_bayes,tunegrid=grid) 8 prob_nb = predict(model_nb$finalmodel,newdata=as.data.frame(xquery),type = "prob") 9 pred_nb = max.col(prob_nb, first )- 0 error_nb = ce(as.factor(yquery),pred_nb) In order to tune the parameters of Naive Bayes Classifier, I used a built-in function named "train" in R that does 5-fold cross validation for NBC with 3 different tuning parameters(laplace smoothing,distribution type,bandwidth adjustment) and picks the model with best tuning parameter. For my training dataset, my best tuning parameters were [laplace smoothing = 0, distribution = use kernel,bandwidth = 0]. Then I used this tuned model to classify on query data. The misclassification error is (b) LDA model_lda = lda(as.factor(v)., data=as.data.frame(train)) 2 result_lda = predict(model_lda,newdata=as.data.frame(xquery)) 3 err_lda = ce(yquery,result_lda$class) Since LDA generally has no tuning parameters, I simply inputed query data to the LDA built-in function in R. I used the same query data as before due to consistency for comparison. As a result, its misclassification error is (c) Non-regularized Multinomial Regression ## Multinomial Regression 2 # Non-regularized MR 3 library(nnet) 4 model_mr0 = multinom(as.factor(v).,data=as.data.frame(train),maxnwts=3000,maxit =400) 5 prob_mr0 = predict(model_mr0,newdata=as.data.frame(xquery),type= probs ) 6 pred_mr0 = max.col(prob_mr0, first )- 7 error_mr0 = ce(as.factor(yquery),pred_mr0) As implied by its name, non-regularized multinomial regression does not have tuning parameters. Thus, I used the function "multinom" to fit the model with training data and then use this fitted model to classify query data. Its misclassification error is (d) Regularized MR 2

4 # Regularized MR 2 library(glmnet) 3 set.seed(00) 4 error_mr = matrix(0,,) 5 for (a in seq(0,0,)){ 6 model_mr = cv.glmnet(x=xtrain,y=ytrain,type.measure="class",alpha=a/0,nfolds=5, family="multinomial") 7 pred = predict(model_mr,newx=xquery,type="class") 8 error_mr[a+] = ce(yquery,pred) 9 } 0 idx_opt = which.min(error_mr) alpha_opt = (idx_opt-)/0 2 model_mr_opt = cv.glmnet(x=xtrain,y=ytrain,type.measure="class",alpha=alpha_opt, nfolds=5,family="multinomial") 3 pred_mr = predict(model_mr_opt,newx=xquery,type="class") 4 error_mr_opt = ce(yquery,pred_mr) Since there is no guarantee what kind of penalty results in better classification performance, I decide to try 0 alpha values from 0 to. With each trial alpha, find the model with optimum lambda by "cv.glmnet" function and then use this model as the classifier for query data. In the end, I had different query errors corresponding to different alpha values. We can see that Table Regularized MR Query Misclassification Errors Alpha Error when alpha =, the model has the best prediction performance. Therefore, I would use Lasso l2 ` Penalty for regularized multinomial regression. Then I retrained the L2 ` MR model with alpha= and got an misclassification error of on query data. (e) Linear SVM As for SVM in general, I used one against all method as its extension to multi-class classification problem since One Against All method requires less computation time and is more robustly implemented in all kinds of packages. # Linear SVM 2 library(liblinear) 3 # one against all+cv contained 4 train_control traincontrol(method="cv", number=5) 5 # fix the parameters 6 grid_lsvm_ovr expand.grid(.cost=seq(,30,0.),.loss = "L2") 7 model_lsvm_ovr2 = train(as.factor(v).,data=as.data.frame(train),trcontrol = train_ control,method = svmlinear3,tunegrid=grid_lsvm_ovr) 8 pred_lsvm_ovr = predict(model_lsvm_ovr2$finalmodel,newx=xquery,type = "raw") 9 error_lsvm_ovr = ce(as.factor(yquery),pred_lsvm_ovr$predictions) Again I used "train" function in R to tune the cost function of linear SVM from to 30 with step size of 0.. The resulted best tuning cost function is Then I used this final model to classify on query dataset and got a misclassification error of (f) Polynomial SVM # Polynomial SVM 3

5 2 library(kernlab) 3 # one against all+cv contained 4 train_control traincontrol(method="cv", number=5) 5 # fix the parameters 6 grid_psvm_ovr expand.grid(.degree=seq(2,0,),.scale=seq(,0,),.c = seq(,20,)) 7 model_psvm_ovr = train(as.factor(v).,data=as.data.frame(train),trcontrol = train_ control,method = svmpoly,tunegrid=grid_psvm_ovr) 8 pred_psvm_ovr = predict(model_psvm_ovr$finalmodel,newdata=xquery,type = "response") 9 error_psvm_ovr = ce(as.factor(yquery),pred_psvm_ovr) Similarly, I changed the method of "train" function to the corresponding one for polynomial SVM and constructed its appropriate tuning parameter vectors of polynomial degree, scale and cost function. As a result of the 5-fold CV, we had a best tuned final model with degree=2, scale=2 and cost =. Use this model to predict the labels for query and I had a misclassification error of (g) Radial SVM # Radial SVM 2 library(kernlab) 3 # one against all+cv contained 4 train_control traincontrol(method="cv", number=5) 5 # fix the parameters 6 grid_rsvm_ovr expand.grid(.sigma=seq(,0,),.c = seq(,20,)) 7 model_rsvm_ovr = train(as.factor(v).,data=as.data.frame(train),trcontrol = train_ control,method = svmradial,tunegrid=grid_rsvm_ovr) 8 pred_rsvm_ovr = predict(model_rsvm_ovr$finalmodel,newdata=xquery,type = "response") 9 error_rsvm_ovr = ce(as.factor(yquery),pred_rsvm_ovr) Still, with tuning parameters of sigma value and cost functions, I used "train" function to find that the best tuning parameters are sigma=0 and cost=4. As a result, the query s misclassification error is (h) Testing As we can see, the polynomial SVM with the best tuning parameters has yielded the best Table 2 Query Misclassification Errors for Best Tuned Models Above Model NB LDA Non-Regularized MR L2 MR Linear SVM Poly SVM Radial SVM Error performance in classifying query data. Therefore, I continue using this model to classify on the test dataset. ## Test 2 grid_test = model_psvm_ovr$besttune 3 old_data = rbind(train,query) 4 model_test = train(as.factor(v).,data=as.data.frame(old_data),trcontrol = train_ control,method = svmpoly,tunegrid=grid_test) 5 pred_test = predict(model_test$finalmodel,newdata=xtest,type = "response") 6 error_test = ce(as.factor(ytest),pred_test) After combining training and query data, the polynomial SVM trained model did even a better job in classifying test data. As a final model assessment, I would say the polynomial SVM model 4

with degree=2, scale=2 and cost = is our best classifier for multi-class classification with a very low misclassification error of 0.

The second is only for training dataset while the last one is for query dataset.

correspond to foliage, cement and window). The query dataset is sparser and also has different orientation and shape for each class of data.

6 with degree=2, scale=2 and cost = is our best classifier for multi-class classification with a very low misclassification error of on test data. 3Visualization (a) PCA I used PCA and plotted out PC2 vs. PC as shown below. The first graph is for a complete dataset. The second is only for training dataset while the last one is for query dataset. In the training and query dataset, we can see that the data is quite separable except for some slight mixture between 2, 3 and 4(which correspond to foliage, cement and window). The query dataset is sparser and also has different orientation and shape for each class of data. Therefore, it is quite reasonable that my Polynomial SVM with degree 2 works the best(only a small amount of curvature is needed). The detailed interpretation will be shown in the following section named Reflection. 5

7 4Reflection (a) As discussed above, 2-degree polynomial SVM makes the best classifier in this case. The reason can be shown that in the above plot : data with label, 5 and 6 are easily separable from the rest of the data(since we are using One vs. Rest method) by a quadratic hyperplane ; although data with label 2, 3 and 4 are mixed to some extent, this cannot be separated by linear hyperplanes but still be separated by soft margin polynomial SVM. As for the radial non-linear SVM, each class of data especially 2, 3 and 4 does not have a good round shape and therefore although radial SVM can be used to separate the data but can lead to over-fit on training data and therefore does not classify well on query data. Therefore, as a result, 2-degree polynomial SVM was the best classifier. As predicted by the PCA plots above, we can see that all other labels except 2, 3 and 4 have zero misclassification error. 6

8 Theoretical Problems:. Show that the following two optimization problems are equivalent: β 0,β 2 β γ ξ i subject to y i (x T i β + β 0 ) ξ i i =...n ξ i 0 i =...n () β 0,β Note that the first constraint in Problem () can be re-written as: ( y i (x T i β + β 0 )) + + λ 2 β 2 2. (2) Combining this with the second constraint, we see ξ i y i (x T i β + β 0 ), i ξ i [ y i (x T i β + β 0 ) ] +, i Substituting this into the objective function of Problem (), the objective becomes: [ β 0,β 2 β γ yi (x T i β + β 0 ) ] + Multiplying through by λ = γ which is Problem (2), as desired. we obtain β 0,β λ 2 β [ yi (x T i β + β 0 ) ] + For completeness, we now need to justify the substitution (the constraints only give an inequality, not an equality): ξ i [ y i (x T i β + β 0 ) ] +, i Suppose, for the sake of argument, that this weren t true: that is, that (β 0, β, ξ) forms a solution to Problem () with ξ i = δ + [ y i (x T i β + β 0 ) ] + for some fixed i and some δ>0. Then (β 0, β, ξ = ξ δe i ) is also feasible point for Problem () with a strictly lower value of the objective function. To see this, note that the only value to have changed is ξ i = ξ δ and that the constraints involving this quantity ξ i [ y i (x T i β + β 0 ) ] + remain true by construction. Next note that the objective function is lower by f(β 0, β, ξ) f(β 0, β, ξ )= 2 β γ 2 β γ We may assume δ>0fromtheconstraints. = γ ξ j ξ j j= j= = γ(ξ i (ξ i δ)) = γδ ξ j > 0 assuming γ>0 j= ξ j 2

9 This is a contradiction to our assumption that (β 0, β, ξ) is a solution to Problem (), however, and hence we must have δ = 0 giving and our substitution is valid. ξ i = [ y i (x T i β + β 0 ) ] +, i It is also possible to prove the validity of the substitution using the KKT conditions: we can write Problem () in the standard form for convex optimization problems: β 0,β The corresponding Lagrangian form is: L = 2 β γ = 2 β = 2 β γ subject to ξ i y i (x T i β + β 0 ) 0 i {,...n} ξ i + λ i ( ξ i )+ ξ i ξ i 0 µ i ( ξ i y i (x T i β + β 0 )) ξ i (γ λ i µ i )+µ i ( y i (x T i β + β 0 )) ( 2 β 2 2 ) µ i y i x T i β β 0 From the KKT conditions, we get n µ i y i + ξ i (γ λ i µ i )+ ξ i y i (x T i β + β 0 ) 0 i {,...,n} (Primal feasability) ξ i 0 i {,...,n} λ i 0 i {,...,n} (Dual feasability) µ i 0 i {,...,n} λ i ( ξ i y i (x T i β + β 0 )) = 0 i {,...,n} (Complementary slackness) β T µ i ( ξ i )=0 i {,...,n} µ i y i x T i =0 (Gradient condition: β) µ i y i =0 (Gradient condition: β 0 ) Complementary slackness gives us γ λ i µ i =0 i {,...,n} (Gradient condition: ξ i ) λ i > 0 = ξ i = y i (x T i β + β 0 ) µ i > 0 = ξ i =0 so it suffices to show that at least one of λ i,µ i is positive for each i. The final KKT condition gives us γ = λ i + µ i, i and we have λ i,µ i 0 from dual feasability so we must have at least one of λ i,µ i positive for γ>0, giving ξ i = [ y i (x T i β + β 0) ] for all i as desired. + µ i 3

Linear methods for supervised learning

Linear methods for supervised learning LDA Logistic regression Naïve Bayes PLA Maximum margin hyperplanes Soft-margin hyperplanes Least squares resgression Ridge regression Nonlinear feature maps Sometimes