Homework: Data Mining

Size: px

Start display at page:

Download "Homework: Data Mining"

Aleesha Weaver
5 years ago
Views:

1 : Data Mining This homework sheet will test your knowledge of data mining using R. 3 a) Load the files Titanic.csv into R as follows. This dataset provides information on the survival of the passengers of the Titanic. titanic <- as.data.frame(read.csv("titanic.csv", header = TRUE, sep = ",")) titanic$survived <- as.factor(titanic$survived) # Remove NA observations titanic <- na.omit(titanic[, c("survived", "pclass", "sex", "age", "sibsp", "parch", "fare", "embarked")]) # Number of observations nrow(titanic) ## [] 45 # List of column names colnames(titanic) ## [] "survived" "pclass" "sex" "age" "sibsp" "parch" ## [7] "fare" "embarked" # View first rows of the dataset head(titanic) ## survived pclass sex age sibsp parch fare embarked ## female S ## 2 male S ## 3 female S ## 4 male S ## 5 female S ## 6 male S Column Name survival pclass sex age sibsp parch fare embarked Interpretation Survival ( = No; = Yes) Passenger Class ( = st; 2 = 2nd; 3 = 3rd) Sex Age Number of Siblings/Spouses Aboard Number of Parents/Children Aboard Passenger Fare Port of Embarkation (C = Cherbourg; Q = Queenstown; S = Southampton) b) Use the k-nearest neighbor method with k = 3 to predict if a 35-years-old person from st class survived? Page: of 5

2 library(class) train <- as.data.frame(cbind(titanic$age, titanic$pclass)) knn(train, c(35, ), titanic$survived, k = 3) ## [] ## Levels: c) Build a decision tree with the train dataset, that shows which people on the Titanic survived. Use only the variables pclass, age, sex, fare, sibsp, parch and embarked. Plot and interpret your result. library(rpart) library(party) library(partykit) dt <- rpart(survived ~ pclass + age + sex + fare + sibsp + parch + embarked, plot(as.party(dt)) method="class", data=titanic) Page: 2 of 5

3 sex male female 2 age 7 pclass 9.5 < < sibsp 8 fare 23.9 < 23.9 embarked 2.5 < 2.5 Q, S C age Node 3 (n = 64) Node 5 (n = 6) Node 6 (n = 27) Node 9 (n = 2) Node 2 (n = 34) 27.5 < Node 3 (n = 75) Node 4 (n = 22) Node 5 (n = 236) Generally speaking, a strong distinguishing power comes from the root note. Thus, Sex as the root node determines survival to a large extent. The next nodes are represented by Age and Pclass respectively. Finally, one can see that females traveling on first/second class have a very high survival rate. d) Prune the decision tree. Plot the pruned tree and interpret your result. Use R to calculate the number of nodes in the pruned tree. # Optimal size of nodes according to cross-validation error which.min(dt$cptable[, "xerror"]) ## 5 ## 5 Page: 3 of 5

4 pruned <- prune(dt, cp = dt$cptable[which.min(dt$cptable[, "xerror"]), "CP"]) plot(as.party(pruned)) sex male female 2 age 7 pclass 9.5 < < sibsp 8 fare 23.9 < 23.9 embarked 2.5 < 2.5 Q, S C age Node 3 (n = 64) Node 5 (n = 6) Node 6 (n = 27) Node 9 (n = 2) Node 2 (n = 34) 27.5 < Node 3 (n = 75) Node 4 (n = 22) Node 5 (n = 236) Sex remains the most distinguishing feature. Elder males (with age above 9.5 years) have a relatively low chance of surviving, in contrast to women from st/2nd class e) Does the decision tree support the hypothesis that women and children are saved first? Why is it difficult to analyze this hypothesis with OLS? When looking at the position of Age and Sex, we can see that other males have, overall, a very low possibility of survival. As we clearly have a non-linear effect, this hypothesis is difficult to verify by a linear model. Page: 4 of 5

5 As a remedy, one has to use e. g. interaction terms. f) Split the data into two subsets for training (2 %) and testing (8 %). Recalculate the above pruned decision tree and measure the prediction performance using the confusion matrix, as well as accuracy, precision, sensitivity and F score. intrain <- runif(nrow(titanic)) <.2 dt <- rpart(survived ~ pclass + age + sex + fare + sibsp + parch + embarked, method = "class", data = titanic[intrain, ]) pruned <- prune(dt, cp = dt$cptable[which.min(dt$cptable[, "xerror"]), "CP"]) pred <- predict(pruned, titanic[-intrain, ], type = "class") # Confusion matrix cm <- table(pred = pred, true = titanic$survived[-intrain]) cm ## true ## pred ## ## # Accuracy sum(diag(cm))/sum(sum(cm)) ## [].793 # Precision cm[, ]/(cm[, ] + cm[, 2]) ## [].7624 # Sensitivity cm[, ]/(cm[, ] + cm[2, ]) ## [].945 # F score 2 * cm[, ]/(2 * cm[, ] + cm[, 2] + cm[2, ]) ## [].8439 g) Split the data into two subsets for training (2 %) and testing (8 %). Train a neural network with nodes in the hidden layer. Use only the variables pclass, age, sex, fare, sibsp, parch and embarked. Calculate the confusion matrix, as well as accuracy, precision, sensitivity and F score! Page: 5 of 5

6 library(nnet) intrain <- runif(nrow(train)) <.2 ann <- nnet(survived ~ pclass + age + sex + fare + sibsp + parch + embarked, data = titanic[intrain, ], size =, maxit =, rang =., decay = 5e-4) ## # weights: ## initial value ## iter value ## iter 2 value ## iter 3 value ## iter 4 value ## iter 5 value ## iter 6 value ## iter 7 value ## iter 8 value ## iter 9 value ## iter value ## final value ## stopped after iterations pred <- predict(ann, titanic[-intrain, ], type = "class") cm <- table(pred = pred, true = titanic$survived[-intrain]) # Confusion matrix cm <- table(pred = pred, true = titanic$survived[-intrain]) cm ## true ## pred ## ## 7 26 # Accuracy sum(diag(cm))/sum(sum(cm)) ## [].7739 # Precision cm[, ]/(cm[, ] + cm[, 2]) ## [].7683 # Sensitivity cm[, ]/(cm[, ] + cm[2, ]) ## [].885 # F score 2 * cm[, ]/(2 * cm[, ] + cm[, 2] + cm[2, ]) ## [].8226 Page: 6 of 5

7 h) Now train a neural network with 5 nodes in the hidden layer and a maximum of iterations (maxit=). Compare the performance to the previous neural network. What is the cause of the drop in prediction performance, even though we increased both the number of neurons and the training effort? ann2 <- nnet(survived ~ pclass + age + sex + fare + sibsp + parch + embarked, data = titanic[intrain, ], size = 5, maxit =, rang =., decay = 5e-4) ## # weights: 55 ## initial value ## iter value ## iter 2 value ## iter 3 value ## iter 4 value ## iter 5 value ## iter 6 value ## iter 7 value ## iter 8 value ## iter 9 value ## iter value ## iter value ## iter 2 value ## iter 3 value ## iter 4 value ## iter 5 value ## iter 6 value ## iter 7 value.532 ## iter 8 value ## iter 9 value ## iter 2 value ## iter 2 value ## iter 22 value ## iter 23 value ## iter 24 value ## iter 25 value ## iter 26 value ## iter 27 value ## iter 28 value ## iter 29 value ## iter 3 value ## iter 3 value ## iter 32 value ## iter 33 value ## iter 34 value ## iter 35 value ## iter 36 value ## iter 37 value Page: 7 of 5

8 ## iter 38 value ## iter 39 value ## iter 4 value ## iter 4 value ## iter 42 value ## iter 43 value ## iter 44 value ## iter 45 value ## iter 46 value ## iter 47 value ## iter 48 value ## iter 49 value ## iter 5 value ## iter 5 value ## iter 52 value ## iter 53 value ## iter 54 value ## iter 55 value ## iter 56 value ## iter 57 value ## iter 58 value ## iter 59 value ## iter 6 value ## iter 6 value ## iter 62 value ## iter 63 value ## iter 64 value ## iter 65 value ## iter 66 value ## iter 67 value ## iter 68 value ## iter 69 value ## iter 7 value ## iter 7 value ## iter 72 value ## iter 73 value ## iter 74 value ## iter 75 value ## iter 76 value ## iter 77 value ## iter 78 value ## iter 79 value ## iter 8 value ## iter 8 value ## iter 82 value ## iter 83 value ## iter 84 value ## iter 85 value ## iter 86 value ## iter 87 value ## iter 88 value ## iter 89 value ## iter 9 value ## iter 9 value Page: 8 of 5

9 ## iter 92 value ## iter 93 value ## iter 94 value ## iter 95 value ## iter 96 value ## iter 97 value ## iter 98 value ## iter 99 value ## iter value ## final value ## stopped after iterations pred <- predict(ann2, titanic[-intrain, ], type = "class") cm <- table(pred = pred, true = titanic$survived[-intrain]) # Confusion matrix cm <- table(pred = pred, true = titanic$survived[-intrain]) cm ## true ## pred ## ## 273 # Accuracy sum(diag(cm))/sum(sum(cm)) ## [].7577 # Precision cm[, ]/(cm[, ] + cm[, 2]) ## [].772 # Sensitivity cm[, ]/(cm[, ] + cm[2, ]) ## [].8382 # F score 2 * cm[, ]/(2 * cm[, ] + cm[, 2] + cm[2, ]) ## [].837 Both accuracy and precision decrease. A possible cause of this behavior might be overfitting. With only 7 input values, a total of 5 neurons in the hidden layer might be too much. i) Split the data into two subsets for training (2 %) and testing (8 %). Train a support vector machine and predict the survival. Use only the variables pclass, age, sex, fare, sibsp, parch and embarked. Calculate the confusion matrix, as well as accuracy, precision, sensitivity and F score! Page: 9 of 5

10 library(e7) intrain <- runif(nrow(titanic)) <.2 model.svm <- svm(survived ~ pclass + age + sex + fare + sibsp + parch + embarked, data=titanic[intrain,], type="c-classification") pred <- predict(model.svm, titanic[-intrain,]) # Confusion matrix cm <- table(pred=pred, true=titanic$survived[-intrain]) cm ## true ## pred ## ## # Accuracy sum(diag(cm))/sum(sum(cm)) ## [].7893 # Precision cm[, ]/(cm[, ] + cm[, 2]) ## [].7884 # Sensitivity cm[, ]/(cm[, ] + cm[2, ]) ## [].883 # F score 2*cm[, ]/(2*cm[,]+cm[, 2]+cm[2, ]) ## [].838 j) Which of the above machine learning approaches has the best performance? Model Accuracy Precision Sensitivity F Decision Tree (Pruned) Neural Network ( Hidden Neurons) Support Vector Machine In the given setting, one achieves the highest accuracy with decision trees, while support vector machines Page: of 5

11 have the highest precision rate. The best trade-off between precision and sensitivity originates from using the SVM. k) Plot the receiver operating curve (ROC) for the trained SVM. library(proc) pred <- predict(model.svm, titanic[-intrain, ], decision.values = TRUE) dv <- attributes(pred)$decision.values plot.roc(as.numeric(titanic$survived[-intrain]), dv, xlab = " - Specificity") Sensitivity Specificity ## ## Call: ## plot.roc.default(x = as.numeric(titanic$survived[-intrain]), predictor = dv, xlab = " - Specificity ## ## Data: dv in 68 controls (as.numeric(titanic$survived[-intrain]) ) > 426 cases (as.numeric(titanic$surv ## Area under the curve:.84 l) Perform a k-means clustering to determine the origin of wines. Use k = 3 means with n = random initializations. What is the within-cluster sum of squares (WCSS) value? As the input data we use the dataset wines that is included in the kohonen package. library(kohonen) data(wines) The dataset contains 77 rows and thirteen columns; object vintages contains the class labels. For compatibility with older versions of the package, variable wine.classes is retained too. This data is the Page: of 5

12 result of chemical analyses of wines grown in the same region of Italy (Piedmont) but derived from three different cultivars: Nebbiolo, Barberas and Grignolino grapes. The wine from the Nebbiolo grape is called Barolo. The data contains the quantities of several constituents found in each of the three types of wines, as well as some spectroscopic variables. head(wines) ## alcohol malic acid ash ash alkalinity magnesium tot. phenols ## [,] ## [2,] ## [3,] ## [4,] ## [5,] ## [6,] ## flavonoids non-flav. phenols proanth col. int. col. hue OD ratio ## [,] ## [2,] ## [3,] ## [4,] ## [5,] ## [6,] ## proline ## [,] 5 ## [2,] 85 ## [3,] 48 ## [4,] 735 ## [5,] 45 ## [6,] 29 km <- kmeans(wines, 3, nstart = ) km ## K-means clustering with 3 clusters of sizes 62, 46, 69 ## ## Cluster means: ## alcohol malic acid ash ash alkalinity magnesium tot. phenols ## ## ## ## flavonoids non-flav. phenols proanth col. int. col. hue OD ratio proline ## ## ## ## ## Clustering vector: ## [] ## [36] ## [7] ## [6] ## [4] ## [76] 3 ## Page: 2 of 5

13 ## Within cluster sum of squares by cluster: ## [] ## (between_ss / total_ss = 86.5 %) ## ## Available components: ## ## [] "cluster" "centers" "totss" "withinss" ## [5] "tot.withinss" "betweenss" "size" km$tot.withinss ## [] m) Use the above result from the clustering procedure to plot data points and clusters in a 2-dimensional plot showing only the dimensions alcohol and ash. plot(wines[, "alcohol"], wines[, "ash"], col = km$cluster) points(km$centers[, "alcohol"], km$centers[, "ash"], col = :nrow(km$centers), pch = 8) Page: 3 of 5

14 wines[, "alcohol"] wines[, "ash"] n) What is the recommended number of clusters according to the Elbow plot? ev <- c() for (i in :5) { km <- kmeans(wines, i, nstart = ) ev[i] <- sum(km$betweenss)/km$totss } plot(:5, ev, type = "l", xlab = "Number of Clusters", ylab = "Explained Variance") Page: 4 of 5

15 Explained Variance Number of Clusters The recommended number seems to be 3 clusters, matching the data description naming 3 origins. Page: 5 of 5

Data Mining. Exercise: Business Intelligence (Part 5) Summer Term 2014 Stefan Feuerriegel

Data Mining. Exercise: Business Intelligence (Part 5) Summer Term 2014 Stefan Feuerriegel Data Mining Exercise: Business Intelligence (Part 5) Summer Term 2014 Stefan Feuerriegel Today s Lecture Objectives 1 Recapitulating common concepts of machine learning 2 Understanding the k-nearest neighbor