STAT 48 - STAT 48 - December 5, 27
STAT 48 -
STAT 48 - Here are a few questions to consider: What does statistical learning mean to you? Is statistical learning different from statistics as a whole? What about terms like: data science, data mining, data analytics, machine learning, predictive analytics, how are these different from statistics and statistical learning?
Definition STAT 48 - learning refers to a set of tools for modeling and understanding complex datasets. It is a recently developed area in statistics and blends with parallel developments in computer science and, in particular, machine learning. The field encompasses many methods such as the lasso and sparse regression, classification and regression trees, and boosting and support vector machines. Courtesy of An Introduction to : with Applications in R, by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. Note: a free e-version of this textbook can be obtain for free through the MSU Library.
STAT 48 -
STAT 48 - Recall the Seattle housing data set, how would you: Build a model to predict housing prices in King County Determine if your model was good or useful? price bedrooms bathrooms sqft_living sqft_lot floors waterfront zipcode 35 3 2.5 2753 655. 987 228 3. 9 999. 9848 289 3.75 26 84. 9848 72 4 2.5 345 39683 2. 98 2475 3.75 96 568. 9832 8583 3 2.5 27 324.5 982 89 4. 255 4 2. 989 258 5 2. 226 25. 9832 44 3 2.5 9 662 2. 9824 23 2. 2. 9824
Loss functions STAT 48 - A loss function is a principled way to compare a set of predictive models. Squared Error: (Price pred Price actual ) 2 Zero - One Loss (binary setting): f (x) = {, if ypred y actual, y pred = y actual
Model Evaluation STAT 48 - Suppose we fit a model using all of the Seattle housing data, can that model be used to predict prices for homes in that data set? 49 lat 48 47 price 32697.4 44243.4 46 25 23 2 9 7 long
Model Evaluation STAT 48 - We cannot assess the predictive performance by fitting a model to data and then evaluating the model using the same data. King County Housing Sales 75 5 price 25 25 5 75 25 sqft_living
Test / Training and Cross-Validation STAT 48 - There are two common options to give valid estimates of model performance: Test / Training approach. Generally 7% of the data is used to fit the model and the other 3% is held out for prediction. Cross-Validation. Cross validation breaks your data into k groups, or folds. Then a model is fit on the data on the k- groups and then used to make predictions on data in the held out *k th group. This process continues until all groups have been held out once.
Constructing a test and training set STAT 48 - set.seed(427) num.houses <- nrow(seattle) Seattle$zipcode <- as.factor(seattle$zipcode) test.ids <- base::sample(:num.houses, size=round(num.houses*.3)) test.set <- Seattle[test.ids,] train.set <- Seattle[(:num.houses)[!(:num.houses) %in% test.ids],] dim(seattle) ## [] 869 4 dim(test.set) ## [] 26 4 dim(train.set) ## [] 68 4
Linear Regression STAT 48 - lm. <- lm(price ~ bedrooms + bathrooms + sqft_living + zipcode + waterfront, data=train.set) summary(lm.) ## ## Call: ## lm(formula = price ~ bedrooms + bathrooms + sqft_living + zipcode + ## waterfront, data = train.set) ## ## Residuals: ## Min Q Median 3Q Max ## -9295-6697 455 6875 2735 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) -63462.32 48953.9-3.339.893 *** ## bedrooms -45354.47 45.5-3.23.384 ** ## bathrooms -3367.6 9383.88 -.74.86253 ## sqft_living 34. 6.2 2.288 <.2 *** ## zipcode984 32655.66 39869.47.89.4378 ## zipcode9824 2444.49 45547.27 2.732.648 ** ## zipcode9832-36965.37 4636.6 -.9.363365 ## zipcode9839 27586.45 5239.52 24.376 <.2 *** ## zipcode987 9925.8 4295.3 2.32.294 * ## zipcode982 44437.35 4824.37.625 <.2 *** ## zipcode989 49332.89 455.54.987 <.2 *** ## zipcode9848 5752.2 52453.35.968.333654 ## waterfront 24398.8 6238.9 3.437.629 *** ## --- ## Signif. codes: '***'. '**'. '*'.5 '.'. ' ' ## ## Residual standard error: 256 on 595 degrees of freedom
Linear Regression STAT 48 - mad.lm <- mean(abs(test.set$price - predict(lm.,test.set))) The mean absolute deviation in housing price predictions using the linear model is $62669
Polynomial Regression STAT 48 - Now include squared terms for square foot of living space too. train.set$sqft_living2 <- train.set$sqft_living^2 test.set$sqft_living2 <- test.set$sqft_living^2 lm.2 <- lm(price ~ bedrooms + bathrooms + sqft_living + sqft_living2 + zipcode + waterfront, data=train.set) summary(lm.2) ## ## Call: ## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_living2 + ## zipcode + waterfront, data = train.set) ## ## Residuals: ## Min Q Median 3Q Max ## -98374-8999 -96 842 929376 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 36424.967588 44686.362972 3.53.237 ## bedrooms -2426.578 233.86275 -.22.8427 ## bathrooms 4668.85399 632.873996.289.77238 ## sqft_living 4.224443 24.5992.72.86366 ## sqft_living2.48465.2973 6.32 <.2 ## zipcode984 25967.543326 3369.7572.783.4342 ## zipcode9824 98824.56284 37923.79692 2.66.939 ## zipcode9832-75745.595 33888.49534-2.235.2578 ## zipcode9839 9673.3695 43784.38758 27.32 <.2 ## zipcode987 95397.878255 35693.22367 2.673.773
Polynomial Regression STAT 48 - mad.lm2 <- mean(abs(test.set$price - predict(lm.2,test.set))) Including this squared term lowers our predictive error from $62669 in the first case to $622.
Decision Trees STAT 48 - zipcode=abcdfghi sqft_living< 325 sqft_living< 485 zipcode=abcdfi zipcode=abdf sqft_living< 333 sqft_living< 4655 3.942e+6 sqft_living< 723 sqft_living< 22 7.725e+5.38e+6 2.245e+6 2.79e+5 4.43e+5.28e+6 2.e+6 6.68e+5.6e+6
Decision Trees STAT 48 - rmse.tree <- sqrt(mean((test.set$price - predict(tree,test.set))^2)) mad.tree <- mean(abs(test.set$price - predict(tree,test.set))) mad.tree ## [] 68524.9 mad.lm ## [] 62668.7 mad.lm2 ## [] 62.7 The predictive error for this tree, $68525 is similar to the first linear model $62669 and not quite as good as our second linear model $622.
Ensemble - Random Forest STAT 48 - Ensemble methods combine a large set of predictive models into a single framework. One example is a random forest - which combines a large number of trees. While these methods are very effective in a predictive setting, it is often difficult to directly assess the impact of particular variables in the model.
Random Forest STAT 48 - One specific kind of ensemble method is known as a random forest, which combines several decision trees. rf <- randomforest(price~., data=seattle) mad.rf <- mean(abs(test.set$price - predict(rf,test.set))) The prediction error for the random forest is substantially better than the other models we have identified $46588.
Exercise - Prediction for Capital Bike Share STAT 48 - bikes <- read.csv('http://www.math.montana.edu/ahoegh/teaching/stat48/datasets/bike.csv') set.seed(427) num.obs <- nrow(bikes) test.ids <- base::sample(:num.obs, size=round(num.obs*.3)) test.bikes <- bikes[test.ids,] train.bikes <- bikes[(:num.obs)[!(:num.obs) %in% test.ids],] dim(bikes) ## [] 886 2 dim(test.bikes) ## [] 3266 2 dim(train.bikes) ## [] 762 2
Exercise - Prediction for Capital Bike Share STAT 48 - lm.bikes <- lm(count ~ holiday + atemp, data=train.bikes) lm.mad <- mean(abs(test.bikes$count - predict(lm.bikes,test.bikes))) Create another predictive model and compare the results to the MAD of the linear model above (29). However, don t use casual and registered in your model as those two will sum to the total count.
A Solution - Prediction for Capital Bike Share STAT 48 - rf.bikes <- randomforest(count ~ holiday + atemp + humidity + season + workingday + weather, data=train.bikes) tree.mad <- mean(abs(test.bikes$count - predict(rf.bikes,test.bikes))) The random forest has a prediction error of (8).
STAT 48 -
STAT 48 - - Given New Points (*) how do we classify them? * * * *
Logistic Regression STAT 48 - ## glm(formula = labels ~ x + y, family = "binomial", data = supervised) ## ## Call: ## ## Deviance Residuals: ## Min Q Median 3Q Max ## -.93887 -.46..285.4429 ## ## Coefficients: ## Estimate Std. Error z value Pr(> z ) ## (Intercept) 23.898.83 2.23.43 * ## x -47.482 23.973 -.98.476 * ## y -6.24 3.456 -.798.722. ## --- ## Signif. codes: '***'. '**'. '*'.5 '.'. ' ' ## ## (Dispersion parameter for binomial family taken to be ) ## ## Null deviance: 38.629 on 99 degrees of freedom ## Residual deviance: 3.24 on 97 degrees of freedom ## AIC: 9.24 ## ## Number of Fisher Scoring iterations:
Logistic Regression STAT 48 - x y Prob[Val = ].2.7..45.35.588.48.3.39.9.4.
Decision Trees STAT 48 - x>=.4849
Decision Trees - code STAT 48 - tree <- rpart(labels ~., data=supervised, method = 'class') plot(tree) text(tree) kable(cbind(pred.points, round(predict(tree,as.data.frame(pred.points))[,2],3)), col.names=c('x','y','prob[val = ]'))
STAT 48 - Decision Trees - Boundary * * * *
Exercise: Predict Titanic Survival STAT 48 - titanic <- read.csv( 'http://www.math.montana.edu/ahoegh/teaching/stat48/datasets/titanic. set.seed(427) titanic <- titanic %>% filter(!is.na(age)) num.pass <- nrow(titanic) test.ids <- base::sample(:num.pass, size=round(num.pass*.3)) test.titanic <- titanic[test.ids,] train.titanic <- titanic[(:num.pass)[!(:num.pass) %in% test.ids],] dim(titanic) ## [] 74 2 dim(test.titanic) ## [] 24 2 dim(train.titanic) ## [] 5 2
Exercise: Predict Titanic Survival STAT 48 - See if you can improve the classification error from the model below. glm.titanic <- glm(survived ~ Age, data=train.titanic, family='binomial' Class.Error <- mean(test.titanic$survived!= round(predict(glm.titanic, The logistic regression model only using age is wrong 4% of the time.