Stat 45/75 1/7. Stat 45/75 Homework 4 Instructions: Please list your name and student number clearly. In order to receive credit for a problem, your solution must show sufficient details so that the grader can determine how you obtained your answer. 1. The file tumor.csv was created from data compiled in the mid 1990s. Each record was generated from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. It is of interest to classify the mass as benign or malignant based on a number of features which describe the mass. The columns of the dataset are as follows: Radius (mean of distances from center to points on the perimeter) Texture (standard deviation of gray-scale values) Perimeter Area smoothness (local variation in radius lengths) Compactness (perimeter 2 /area 1.0) Concavity (severity of concave portions of the contour) Concave points (number of concave portions of the contour) Symmetry Fractal dimension ( coastline approximation - 1) (a). Explore the data graphically in order to investigate the association between Diagnosis and the other features. Which features seem useful in predicting Diagnosis? Are any features highly related to each other? Describe your findings. tumor=read.csv("tumor.csv") pairs(tumor) 1
Stat 45/75 2/7 25 50 150 0.06 0.16 0.0 0.3 0. 0.30 1.8 25 1.0 Diagnosis 30 Radius 50 150 Texture 500 2500 Perimeter 0.16 Area 0.35 0.06 Smoothness 0.3 0.05 Compactness 0.20 0.0 Concavity 0.30 0.00 Concave.Points Fractal.Dimension 1.0 1.8 30 500 2500 0.05 0.35 0.00 0.20 0.05 0.09 0. Symmetry 0.05 0.09 Radius, Perimeter, Area, Compactness, and Concave.Points all appear to be useful in classifying Diagnosis. Radius, Perimeter, and Area, however, are very highly correlated. (b). Create a plot of Radius vs Symmetry, coloring the points based on diagnosis. Do these variables seem to do a good job of explaining diagnosis? Additionally, try to predict which classification method will perform best and briefly explain why. plot(tumor$symmetry, tumor$radius,xlab="symmetry",ylab="radius", col=tumor$diagnosis) legend("topright",col=1:2,legend=c("benign","malignent"),pch=21) 2
Stat 45/75 3/7 benign malignent Radius 15 20 25 0. 0.15 0.20 0.25 0.30 Symmetry Yes, large values of the variables Symmetry and Radius tend to align with malignant tumors, whereas small values of the two variables are more likely benign. I wouldn t expect there to be much differences in the different approaches - the two groups appear clustered - and therefore would likely choose logistic regression. LDA will have similar results. (c). Split the data into a 90% training set and a % test set, being sure to set a seed of 1 for consistency. How many rows are in the test set? set.seed(1) train.obs=sample(1:nrow(tumor),.9*nrow(tumor),replace=false) tumor.train = tumor[train.obs,] tumor.test = tumor[-train.obs,] Diag.test = tumor.test$diagnosis 3
Stat 45/75 4/7 The test dataset has 57 observations. (d). Using the training data, fit a logistic regression model predicting the probability of a malignant tumor. Which features matter? Note: Be sure to remove features which are highly correlated! glm.fit = glm(diagnosis ~ Texture+Area+Smoothness+Concave.Points+Symmetry, data = tumor, family = binomial, subset = train.obs) Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred summary(glm.fit) Call: glm(formula = Diagnosis ~ Texture + Area + Smoothness + Concave.Points + Symmetry, family = binomial, data = tumor, subset = train.obs) Deviance Residuals: Min 1Q Median 3Q Max -1.72846-0.11662-0.02843 0.075 3.787 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -27.80904 5.00349-5.558 2.73e-08 *** Texture 0.39509 0.07016 5.631 1.79e-08 *** Area 0.01114 0.00234 4.762 1.92e-06 *** Smoothness 47.67845 28.50206 1.673 0.0944. Concave.Points 88.21814 20.12835 4.383 1.17e-05 *** Symmetry 19.54694 11.43733 1.709 0.0874. --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 674.30 on 511 degrees of freedom Residual deviance: 124.74 on 506 degrees of freedom AIC: 136.74 Number of Fisher Scoring iterations: 8 The variables Texture, Area, Smoothness, Concave.Points, and Symmetry all appear significant. [Note: Answers may vary slightly.] (e). If a tumor is truly malignant, the cost of an initial misclassification is very high, but if the tumor is benign, a misclassification is not so severe because further testing would discover 4
Stat 45/75 5/7 this. Using a probability threshold of 0.5, report the training misclassification rate and create a confusion matrix for these predictions and discuss, keeping this in mind. glm.probs = predict(glm.fit, tumor.test, type = "response") glm.pred = rep(0, length(glm.probs)) glm.pred[glm.probs > 0.5] = 1 mean(glm.pred!= as.numeric(diag.test)-1) [1] 0.122807 table(glm.pred,diag.test) Diag.test glm.pred Benign Malignant 0 31 4 1 3 19 The misclassification rate is 0.1228 where 4 were false negative and 3 were false positives. Since false negatives are bad, we might want to change our method for making decisions, in risk of more false positives, to try to limit the false negative error. (f). Repeat part (e) but decrease the threshold to 0.25 and discuss the differences. What is the misclassification rate for this threshold? glm.pred = rep(0, length(glm.probs)) glm.pred[glm.probs > 0.25] = 1 mean(glm.pred!= as.numeric(diag.test)-1) [1] 0.52632 table(glm.pred,diag.test) Diag.test glm.pred Benign Malignant 0 30 2 1 4 21 Using the 0.25 threshold, the misclassification rate decreases to 0.53. Now, the number of false negatives decreases to 2 but the number of false positives increases to 4. (g). Perform LDA on the training data to predict Diagnosis based on the same variables used in part (d). What is the test error? library(mass) lda.fit = lda(diagnosis ~ Texture+Area+Smoothness+Concave.Points+Symmetry, 5
Stat 45/75 6/7 data = tumor, family = binomial, subset = train.obs) lda.pred = predict(lda.fit, tumor.test) mean(lda.pred$class!= Diag.test) [1] 0.122807 Misclassification rate is 0.1228. (h). Perform QDA on the training data to predict Diagnosis based on the same variables used in part (d). What is the test error? library(mass) qda.fit = qda(diagnosis ~ Texture+Area+Smoothness+Concave.Points+Symmetry, data = tumor, family = binomial, subset = train.obs) qda.pred = predict(qda.fit, tumor.test) mean(qda.pred$class!= Diag.test) [1] 0.52632 Misclassification rate is 0.53. (i). Perform KNN on the training data using several values of k to predict Diagnosis based on the same variables used in (d). Report test errors and which value of k works best for these data. library(class) tumor.train.x = tumor.train[,-1] tumor.test.x = tumor.test[,-1] Diag.train = tumor.train$diagnosis knn.rate=na for(k in 1:0){ knn.pred = knn(tumor.train.x,tumor.test.x, Diag.train, k) knn.rate[k]=mean(knn.pred!= Diag.test)} The minimum misclassification rate is 0.0702 when k = 4, 5, or 7. 6
Stat 45/75 7/7 2. 75 Recall that in LDA, the decision rule for assigning an observation x to class k is to choose the class for which δ k (x) = x µ k σ 2 µ2 k 2σ 2 + log(π k) is largest. Show that for a two class problem, if the prior probabilities π 1 and π 2 are each equal to 0.5, the decision boundary is given by x = µ 1 + µ 2 2 δ 1 (x) =x µ 1 σ 2 µ2 1 2σ 2 + log(.5) δ 2 (x) =x µ 2 σ 2 µ2 2 2σ 2 + log(.5) The decision boundary is given when δ 1 (x) = δ 2 (x). δ 1 (x) = δ 2 (x) = x µ 1 σ 2 µ2 1 2σ 2 + log(.5) = xµ 2 σ 2 µ2 2 2σ 2 + log(.5) = x µ 1 σ 2 µ2 1 2σ 2 = xµ 2 σ 2 µ2 2 2σ 2 = x µ 1 σ 2 xµ 2 σ 2 = µ2 1 2σ 2 µ2 2 2σ 2 = x σ 2 (µ 1 µ 2 ) = 1 2σ 2 [(µ 1 µ 2 ) (µ 1 + µ 2 )] = x = µ 1 + µ 2 2 7