Stat 4510/7510 Homework 4

Similar documents
Discriminant analysis in R QMMA

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015

STAT Statistical Learning. Predictive Modeling. Statistical Learning. Overview. Predictive Modeling. Classification Methods.

Lecture 25: Review I

k-nn Disgnosing Breast Cancer

Machine Learning nearest neighbors classification. Luigi Cerulo Department of Science and Technology University of Sannio

URLs identification task: Istat current status. Istat developed and applied a procedure consisting of the following steps:

Comparison of Linear Regression with K-Nearest Neighbors

Binary Regression in S-Plus

Package kdevine. May 19, 2017

Classification of Breast Cancer Cells Using JMP Marie Gaudard, North Haven Group

Poisson Regression and Model Checking

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang

Predictive Checking. Readings GH Chapter 6-8. February 8, 2017

10-701/15-781, Fall 2006, Final

Dynamic Network Regression Using R Package dnr

Machine Learning: Practice Midterm, Spring 2018

CS145: INTRODUCTION TO DATA MINING

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Classification. Data Set Iris. Logistic Regression. Species. Petal.Width

Multinomial Logit Models with R

Differentiation of Malignant and Benign Breast Lesions Using Machine Learning Algorithms

GLM II. Basic Modeling Strategy CAS Ratemaking and Product Management Seminar by Paul Bailey. March 10, 2015

Bias-variance trade-off and cross validation Computer exercises

Unit 5 Logistic Regression Practice Problems

Network Traffic Measurements and Analysis

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018

CS249: ADVANCED DATA MINING

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Classification and Regression Analysis of the Prognostic Breast Cancer using Generation Optimizing Algorithms

Logistic Regression. (Dichotomous predicted variable) Tim Frasier

Tutorials Case studies

Package hmeasure. February 20, 2015

Classification and Regression

CHAPTER 6 DETECTION OF MASS USING NOVEL SEGMENTATION, GLCM AND NEURAL NETWORKS

Generalized Additive Models

The linear mixed model: modeling hierarchical and longitudinal data

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

Fast or furious? - User analysis of SF Express Inc

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

Toward Automated Cancer Diagnosis: An Interactive System for Cell Feature Extraction

Stat 4510/7510 Homework 6

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Section 2.3: Simple Linear Regression: Predictions and Inference

INF 4300 Classification III Anne Solberg The agenda today:

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Region-based Segmentation

Chapter 6: Linear Model Selection and Regularization

Regression Analysis and Linear Regression Models

Product Catalog. AcaStat. Software

CSSS 510: Lab 2. Introduction to Maximum Likelihood Estimation

Multiple Linear Regression

SLStats.notebook. January 12, Statistics:

610 R12 Prof Colleen F. Moore Analysis of variance for Unbalanced Between Groups designs in R For Psychology 610 University of Wisconsin--Madison

Resting state network estimation in individual subjects

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

GxE.scan. October 30, 2018

Econ 3790: Business and Economics Statistics. Instructor: Yogesh Uppal

UNIVERSITY OF OSLO. Faculty of Mathematics and Natural Sciences

Lecture 13: Model selection and regularization

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Nina Zumel and John Mount Win-Vector LLC

06: Logistic Regression

Package mcemglm. November 29, 2015

Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

MACHINE LEARNING TOOLBOX. Logistic regression on Sonar

Using the SemiPar Package

Jeff Howbert Introduction to Machine Learning Winter

Package GWRM. R topics documented: July 31, Type Package

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

VCEasy VISUAL FURTHER MATHS. Overview

More on Neural Networks. Read Chapter 5 in the text by Bishop, except omit Sections 5.3.3, 5.3.4, 5.4, 5.5.4, 5.5.5, 5.5.6, 5.5.7, and 5.

The brlr Package. March 22, brlr... 1 lizards Index 5. Bias-reduced Logistic Regression

PSY 9556B (Feb 5) Latent Growth Modeling

The class imbalance problem

A weighted fuzzy classifier and its application to image processing tasks

k-nearest Neighbor (knn) Sept Youn-Hee Han

CIS 520, Machine Learning, Fall 2015: Assignment 7 Due: Mon, Nov 16, :59pm, PDF to Canvas [100 points]

Detecting Network Intrusions

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

1 RefresheR. Figure 1.1: Soy ice cream flavor preferences

Supervised vs unsupervised clustering

Machine Learning / Jan 27, 2010

Gelman-Hill Chapter 3

Statistics & Analysis. Fitting Generalized Additive Models with the GAM Procedure in SAS 9.2

Evaluating Classifiers

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Chapter 3: Supervised Learning

Stat 5303 (Oehlert): Response Surfaces 1

Microarray Analysis Classification by SVM and PAM

Orange Juice data. Emanuele Taufer. 4/12/2018 Orange Juice data (1)

Linear Methods for Regression and Shrinkage Methods

Package RPEnsemble. October 7, 2017

Package rocc. February 20, 2015

Chapitre 2 : modèle linéaire généralisé

Transcription:

Stat 45/75 1/7. Stat 45/75 Homework 4 Instructions: Please list your name and student number clearly. In order to receive credit for a problem, your solution must show sufficient details so that the grader can determine how you obtained your answer. 1. The file tumor.csv was created from data compiled in the mid 1990s. Each record was generated from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe characteristics of the cell nuclei present in the image. It is of interest to classify the mass as benign or malignant based on a number of features which describe the mass. The columns of the dataset are as follows: Radius (mean of distances from center to points on the perimeter) Texture (standard deviation of gray-scale values) Perimeter Area smoothness (local variation in radius lengths) Compactness (perimeter 2 /area 1.0) Concavity (severity of concave portions of the contour) Concave points (number of concave portions of the contour) Symmetry Fractal dimension ( coastline approximation - 1) (a). Explore the data graphically in order to investigate the association between Diagnosis and the other features. Which features seem useful in predicting Diagnosis? Are any features highly related to each other? Describe your findings. tumor=read.csv("tumor.csv") pairs(tumor) 1

Stat 45/75 2/7 25 50 150 0.06 0.16 0.0 0.3 0. 0.30 1.8 25 1.0 Diagnosis 30 Radius 50 150 Texture 500 2500 Perimeter 0.16 Area 0.35 0.06 Smoothness 0.3 0.05 Compactness 0.20 0.0 Concavity 0.30 0.00 Concave.Points Fractal.Dimension 1.0 1.8 30 500 2500 0.05 0.35 0.00 0.20 0.05 0.09 0. Symmetry 0.05 0.09 Radius, Perimeter, Area, Compactness, and Concave.Points all appear to be useful in classifying Diagnosis. Radius, Perimeter, and Area, however, are very highly correlated. (b). Create a plot of Radius vs Symmetry, coloring the points based on diagnosis. Do these variables seem to do a good job of explaining diagnosis? Additionally, try to predict which classification method will perform best and briefly explain why. plot(tumor$symmetry, tumor$radius,xlab="symmetry",ylab="radius", col=tumor$diagnosis) legend("topright",col=1:2,legend=c("benign","malignent"),pch=21) 2

Stat 45/75 3/7 benign malignent Radius 15 20 25 0. 0.15 0.20 0.25 0.30 Symmetry Yes, large values of the variables Symmetry and Radius tend to align with malignant tumors, whereas small values of the two variables are more likely benign. I wouldn t expect there to be much differences in the different approaches - the two groups appear clustered - and therefore would likely choose logistic regression. LDA will have similar results. (c). Split the data into a 90% training set and a % test set, being sure to set a seed of 1 for consistency. How many rows are in the test set? set.seed(1) train.obs=sample(1:nrow(tumor),.9*nrow(tumor),replace=false) tumor.train = tumor[train.obs,] tumor.test = tumor[-train.obs,] Diag.test = tumor.test$diagnosis 3

Stat 45/75 4/7 The test dataset has 57 observations. (d). Using the training data, fit a logistic regression model predicting the probability of a malignant tumor. Which features matter? Note: Be sure to remove features which are highly correlated! glm.fit = glm(diagnosis ~ Texture+Area+Smoothness+Concave.Points+Symmetry, data = tumor, family = binomial, subset = train.obs) Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred summary(glm.fit) Call: glm(formula = Diagnosis ~ Texture + Area + Smoothness + Concave.Points + Symmetry, family = binomial, data = tumor, subset = train.obs) Deviance Residuals: Min 1Q Median 3Q Max -1.72846-0.11662-0.02843 0.075 3.787 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -27.80904 5.00349-5.558 2.73e-08 *** Texture 0.39509 0.07016 5.631 1.79e-08 *** Area 0.01114 0.00234 4.762 1.92e-06 *** Smoothness 47.67845 28.50206 1.673 0.0944. Concave.Points 88.21814 20.12835 4.383 1.17e-05 *** Symmetry 19.54694 11.43733 1.709 0.0874. --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 674.30 on 511 degrees of freedom Residual deviance: 124.74 on 506 degrees of freedom AIC: 136.74 Number of Fisher Scoring iterations: 8 The variables Texture, Area, Smoothness, Concave.Points, and Symmetry all appear significant. [Note: Answers may vary slightly.] (e). If a tumor is truly malignant, the cost of an initial misclassification is very high, but if the tumor is benign, a misclassification is not so severe because further testing would discover 4

Stat 45/75 5/7 this. Using a probability threshold of 0.5, report the training misclassification rate and create a confusion matrix for these predictions and discuss, keeping this in mind. glm.probs = predict(glm.fit, tumor.test, type = "response") glm.pred = rep(0, length(glm.probs)) glm.pred[glm.probs > 0.5] = 1 mean(glm.pred!= as.numeric(diag.test)-1) [1] 0.122807 table(glm.pred,diag.test) Diag.test glm.pred Benign Malignant 0 31 4 1 3 19 The misclassification rate is 0.1228 where 4 were false negative and 3 were false positives. Since false negatives are bad, we might want to change our method for making decisions, in risk of more false positives, to try to limit the false negative error. (f). Repeat part (e) but decrease the threshold to 0.25 and discuss the differences. What is the misclassification rate for this threshold? glm.pred = rep(0, length(glm.probs)) glm.pred[glm.probs > 0.25] = 1 mean(glm.pred!= as.numeric(diag.test)-1) [1] 0.52632 table(glm.pred,diag.test) Diag.test glm.pred Benign Malignant 0 30 2 1 4 21 Using the 0.25 threshold, the misclassification rate decreases to 0.53. Now, the number of false negatives decreases to 2 but the number of false positives increases to 4. (g). Perform LDA on the training data to predict Diagnosis based on the same variables used in part (d). What is the test error? library(mass) lda.fit = lda(diagnosis ~ Texture+Area+Smoothness+Concave.Points+Symmetry, 5

Stat 45/75 6/7 data = tumor, family = binomial, subset = train.obs) lda.pred = predict(lda.fit, tumor.test) mean(lda.pred$class!= Diag.test) [1] 0.122807 Misclassification rate is 0.1228. (h). Perform QDA on the training data to predict Diagnosis based on the same variables used in part (d). What is the test error? library(mass) qda.fit = qda(diagnosis ~ Texture+Area+Smoothness+Concave.Points+Symmetry, data = tumor, family = binomial, subset = train.obs) qda.pred = predict(qda.fit, tumor.test) mean(qda.pred$class!= Diag.test) [1] 0.52632 Misclassification rate is 0.53. (i). Perform KNN on the training data using several values of k to predict Diagnosis based on the same variables used in (d). Report test errors and which value of k works best for these data. library(class) tumor.train.x = tumor.train[,-1] tumor.test.x = tumor.test[,-1] Diag.train = tumor.train$diagnosis knn.rate=na for(k in 1:0){ knn.pred = knn(tumor.train.x,tumor.test.x, Diag.train, k) knn.rate[k]=mean(knn.pred!= Diag.test)} The minimum misclassification rate is 0.0702 when k = 4, 5, or 7. 6

Stat 45/75 7/7 2. 75 Recall that in LDA, the decision rule for assigning an observation x to class k is to choose the class for which δ k (x) = x µ k σ 2 µ2 k 2σ 2 + log(π k) is largest. Show that for a two class problem, if the prior probabilities π 1 and π 2 are each equal to 0.5, the decision boundary is given by x = µ 1 + µ 2 2 δ 1 (x) =x µ 1 σ 2 µ2 1 2σ 2 + log(.5) δ 2 (x) =x µ 2 σ 2 µ2 2 2σ 2 + log(.5) The decision boundary is given when δ 1 (x) = δ 2 (x). δ 1 (x) = δ 2 (x) = x µ 1 σ 2 µ2 1 2σ 2 + log(.5) = xµ 2 σ 2 µ2 2 2σ 2 + log(.5) = x µ 1 σ 2 µ2 1 2σ 2 = xµ 2 σ 2 µ2 2 2σ 2 = x µ 1 σ 2 xµ 2 σ 2 = µ2 1 2σ 2 µ2 2 2σ 2 = x σ 2 (µ 1 µ 2 ) = 1 2σ 2 [(µ 1 µ 2 ) (µ 1 + µ 2 )] = x = µ 1 + µ 2 2 7