STAT Statistical Learning. Predictive Modeling. Statistical Learning. Overview. Predictive Modeling. Classification Methods.

Similar documents
Stat 4510/7510 Homework 4

Cross-Validation Alan Arnholt 3/22/2016

Discriminant analysis in R QMMA

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015

Section 2.1: Intro to Simple Linear Regression & Least Squares

Contents Cont Hypothesis testing

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018

Nina Zumel and John Mount Win-Vector LLC

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Machine Learning: Practice Midterm, Spring 2018

Among those 14 potential explanatory variables,non-dummy variables are:

Unit 5 Logistic Regression Practice Problems

URLs identification task: Istat current status. Istat developed and applied a procedure consisting of the following steps:

Chapter 6: Linear Model Selection and Regularization

Lasso Regression: Regularization for feature selection

This is called a linear basis expansion, and h m is the mth basis function For example if X is one-dimensional: f (X) = β 0 + β 1 X + β 2 X 2, or

Section 2.3: Simple Linear Regression: Predictions and Inference

Section 2.1: Intro to Simple Linear Regression & Least Squares

1 The SAS System 23:01 Friday, November 9, 2012

scikit-learn (Machine Learning in Python)

Generalized Additive Models

ISyE 6416 Basic Statistical Methods - Spring 2016 Bonus Project: Big Data Analytics Final Report. Team Member Names: Xi Yang, Yi Wen, Xue Zhang

Dynamic Network Regression Using R Package dnr

Predicting housing price

CH9.Generalized Additive Model

Introduction to hypothesis testing

Gelman-Hill Chapter 3

Binary Regression in S-Plus

Classification. Data Set Iris. Logistic Regression. Species. Petal.Width

Stat 342 Exam 3 Fall 2014

Section 2.2: Covariance, Correlation, and Least Squares

From Building Better Models with JMP Pro. Full book available for purchase here.

Weighted Sample. Weighted Sample. Weighted Sample. Training Sample

Introduction to Classification & Regression Trees

Stat 579: More Preliminaries, Reading from Files

Multinomial Logit Models with R

Machine Learning: An Applied Econometric Approach Online Appendix

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

CSC 411: Lecture 02: Linear Regression

S2 Text. Instructions to replicate classification results.

Logistic Regression. (Dichotomous predicted variable) Tim Frasier

The Data. Math 158, Spring 2016 Jo Hardin Shrinkage Methods R code Ridge Regression & LASSO

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

Model selection. Peter Hoff. 560 Hierarchical modeling. Statistics, University of Washington 1/41

Orange Juice data. Emanuele Taufer. 4/12/2018 Orange Juice data (1)

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Regression Analysis and Linear Regression Models

mcssubset: Efficient Computation of Best Subset Linear Regressions in R

Classification/Regression Trees and Random Forests

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Practice in R. 1 Sivan s practice. 2 Hetroskadasticity. January 28, (pdf version)

22s:152 Applied Linear Regression

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

Fast or furious? - User analysis of SF Express Inc

Lab 10 - Ridge Regression and the Lasso in Python

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

run ld50 /* Plot the onserved proportions and the fitted curve */ DATA SETR1 SET SETR1 PROB=X1/(X1+X2) /* Use this to create graphs in Windows */ gopt

Model Inference and Averaging. Baging, Stacking, Random Forest, Boosting

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Cross-validation and the Bootstrap

Oliver Dürr. Statistisches Data Mining (StDM) Woche 12. Institut für Datenanalyse und Prozessdesign Zürcher Hochschule für Angewandte Wissenschaften

Statistics & Analysis. A Comparison of PDLREG and GAM Procedures in Measuring Dynamic Effects

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

STAT R Overview. R Intro. R Data Structures. Subsetting. Graphics. January 11, 2018

Machine Learning / Jan 27, 2010

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Lecture 16: High-dimensional regression, non-linear regression

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Multiple Linear Regression

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute

22s:152 Applied Linear Regression

Lecture 06 Decision Trees I

Yelp Recommendation System

Statistics & Analysis. Fitting Generalized Additive Models with the GAM Procedure in SAS 9.2

Lecture 05 Additive Models

Linear Methods for Regression and Shrinkage Methods

Resampling methods (Ch. 5 Intro)

Exercise 2.23 Villanova MAT 8406 September 7, 2015

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

6 Model selection and kernels

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

Nina Zumel and John Mount Win-Vector LLC

Regression on the trees data with R

Supervised Learning Classification Algorithms Comparison

Biology Project 1

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Solution to Series 7

Random Forest A. Fornaser

GxE.scan. October 30, 2018

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

ZunZun.com. User-Selectable Polynomial. Sat Jan 14 09:49: local server time

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Math 263 Excel Assignment 3

Regression III: Lab 4

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:

MULTIPLE REGRESSION IN EXCEL EXCEL LAB #8

k-nearest Neighbor (knn) Sept Youn-Hee Han

Transcription:

STAT 48 - STAT 48 - December 5, 27

STAT 48 -

STAT 48 - Here are a few questions to consider: What does statistical learning mean to you? Is statistical learning different from statistics as a whole? What about terms like: data science, data mining, data analytics, machine learning, predictive analytics, how are these different from statistics and statistical learning?

Definition STAT 48 - learning refers to a set of tools for modeling and understanding complex datasets. It is a recently developed area in statistics and blends with parallel developments in computer science and, in particular, machine learning. The field encompasses many methods such as the lasso and sparse regression, classification and regression trees, and boosting and support vector machines. Courtesy of An Introduction to : with Applications in R, by Gareth James, Daniela Witten, Trevor Hastie, and Robert Tibshirani. Note: a free e-version of this textbook can be obtain for free through the MSU Library.

STAT 48 -

STAT 48 - Recall the Seattle housing data set, how would you: Build a model to predict housing prices in King County Determine if your model was good or useful? price bedrooms bathrooms sqft_living sqft_lot floors waterfront zipcode 35 3 2.5 2753 655. 987 228 3. 9 999. 9848 289 3.75 26 84. 9848 72 4 2.5 345 39683 2. 98 2475 3.75 96 568. 9832 8583 3 2.5 27 324.5 982 89 4. 255 4 2. 989 258 5 2. 226 25. 9832 44 3 2.5 9 662 2. 9824 23 2. 2. 9824

Loss functions STAT 48 - A loss function is a principled way to compare a set of predictive models. Squared Error: (Price pred Price actual ) 2 Zero - One Loss (binary setting): f (x) = {, if ypred y actual, y pred = y actual

Model Evaluation STAT 48 - Suppose we fit a model using all of the Seattle housing data, can that model be used to predict prices for homes in that data set? 49 lat 48 47 price 32697.4 44243.4 46 25 23 2 9 7 long

Model Evaluation STAT 48 - We cannot assess the predictive performance by fitting a model to data and then evaluating the model using the same data. King County Housing Sales 75 5 price 25 25 5 75 25 sqft_living

Test / Training and Cross-Validation STAT 48 - There are two common options to give valid estimates of model performance: Test / Training approach. Generally 7% of the data is used to fit the model and the other 3% is held out for prediction. Cross-Validation. Cross validation breaks your data into k groups, or folds. Then a model is fit on the data on the k- groups and then used to make predictions on data in the held out *k th group. This process continues until all groups have been held out once.

Constructing a test and training set STAT 48 - set.seed(427) num.houses <- nrow(seattle) Seattle$zipcode <- as.factor(seattle$zipcode) test.ids <- base::sample(:num.houses, size=round(num.houses*.3)) test.set <- Seattle[test.ids,] train.set <- Seattle[(:num.houses)[!(:num.houses) %in% test.ids],] dim(seattle) ## [] 869 4 dim(test.set) ## [] 26 4 dim(train.set) ## [] 68 4

Linear Regression STAT 48 - lm. <- lm(price ~ bedrooms + bathrooms + sqft_living + zipcode + waterfront, data=train.set) summary(lm.) ## ## Call: ## lm(formula = price ~ bedrooms + bathrooms + sqft_living + zipcode + ## waterfront, data = train.set) ## ## Residuals: ## Min Q Median 3Q Max ## -9295-6697 455 6875 2735 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) -63462.32 48953.9-3.339.893 *** ## bedrooms -45354.47 45.5-3.23.384 ** ## bathrooms -3367.6 9383.88 -.74.86253 ## sqft_living 34. 6.2 2.288 <.2 *** ## zipcode984 32655.66 39869.47.89.4378 ## zipcode9824 2444.49 45547.27 2.732.648 ** ## zipcode9832-36965.37 4636.6 -.9.363365 ## zipcode9839 27586.45 5239.52 24.376 <.2 *** ## zipcode987 9925.8 4295.3 2.32.294 * ## zipcode982 44437.35 4824.37.625 <.2 *** ## zipcode989 49332.89 455.54.987 <.2 *** ## zipcode9848 5752.2 52453.35.968.333654 ## waterfront 24398.8 6238.9 3.437.629 *** ## --- ## Signif. codes: '***'. '**'. '*'.5 '.'. ' ' ## ## Residual standard error: 256 on 595 degrees of freedom

Linear Regression STAT 48 - mad.lm <- mean(abs(test.set$price - predict(lm.,test.set))) The mean absolute deviation in housing price predictions using the linear model is $62669

Polynomial Regression STAT 48 - Now include squared terms for square foot of living space too. train.set$sqft_living2 <- train.set$sqft_living^2 test.set$sqft_living2 <- test.set$sqft_living^2 lm.2 <- lm(price ~ bedrooms + bathrooms + sqft_living + sqft_living2 + zipcode + waterfront, data=train.set) summary(lm.2) ## ## Call: ## lm(formula = price ~ bedrooms + bathrooms + sqft_living + sqft_living2 + ## zipcode + waterfront, data = train.set) ## ## Residuals: ## Min Q Median 3Q Max ## -98374-8999 -96 842 929376 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 36424.967588 44686.362972 3.53.237 ## bedrooms -2426.578 233.86275 -.22.8427 ## bathrooms 4668.85399 632.873996.289.77238 ## sqft_living 4.224443 24.5992.72.86366 ## sqft_living2.48465.2973 6.32 <.2 ## zipcode984 25967.543326 3369.7572.783.4342 ## zipcode9824 98824.56284 37923.79692 2.66.939 ## zipcode9832-75745.595 33888.49534-2.235.2578 ## zipcode9839 9673.3695 43784.38758 27.32 <.2 ## zipcode987 95397.878255 35693.22367 2.673.773

Polynomial Regression STAT 48 - mad.lm2 <- mean(abs(test.set$price - predict(lm.2,test.set))) Including this squared term lowers our predictive error from $62669 in the first case to $622.

Decision Trees STAT 48 - zipcode=abcdfghi sqft_living< 325 sqft_living< 485 zipcode=abcdfi zipcode=abdf sqft_living< 333 sqft_living< 4655 3.942e+6 sqft_living< 723 sqft_living< 22 7.725e+5.38e+6 2.245e+6 2.79e+5 4.43e+5.28e+6 2.e+6 6.68e+5.6e+6

Decision Trees STAT 48 - rmse.tree <- sqrt(mean((test.set$price - predict(tree,test.set))^2)) mad.tree <- mean(abs(test.set$price - predict(tree,test.set))) mad.tree ## [] 68524.9 mad.lm ## [] 62668.7 mad.lm2 ## [] 62.7 The predictive error for this tree, $68525 is similar to the first linear model $62669 and not quite as good as our second linear model $622.

Ensemble - Random Forest STAT 48 - Ensemble methods combine a large set of predictive models into a single framework. One example is a random forest - which combines a large number of trees. While these methods are very effective in a predictive setting, it is often difficult to directly assess the impact of particular variables in the model.

Random Forest STAT 48 - One specific kind of ensemble method is known as a random forest, which combines several decision trees. rf <- randomforest(price~., data=seattle) mad.rf <- mean(abs(test.set$price - predict(rf,test.set))) The prediction error for the random forest is substantially better than the other models we have identified $46588.

Exercise - Prediction for Capital Bike Share STAT 48 - bikes <- read.csv('http://www.math.montana.edu/ahoegh/teaching/stat48/datasets/bike.csv') set.seed(427) num.obs <- nrow(bikes) test.ids <- base::sample(:num.obs, size=round(num.obs*.3)) test.bikes <- bikes[test.ids,] train.bikes <- bikes[(:num.obs)[!(:num.obs) %in% test.ids],] dim(bikes) ## [] 886 2 dim(test.bikes) ## [] 3266 2 dim(train.bikes) ## [] 762 2

Exercise - Prediction for Capital Bike Share STAT 48 - lm.bikes <- lm(count ~ holiday + atemp, data=train.bikes) lm.mad <- mean(abs(test.bikes$count - predict(lm.bikes,test.bikes))) Create another predictive model and compare the results to the MAD of the linear model above (29). However, don t use casual and registered in your model as those two will sum to the total count.

A Solution - Prediction for Capital Bike Share STAT 48 - rf.bikes <- randomforest(count ~ holiday + atemp + humidity + season + workingday + weather, data=train.bikes) tree.mad <- mean(abs(test.bikes$count - predict(rf.bikes,test.bikes))) The random forest has a prediction error of (8).

STAT 48 -

STAT 48 - - Given New Points (*) how do we classify them? * * * *

Logistic Regression STAT 48 - ## glm(formula = labels ~ x + y, family = "binomial", data = supervised) ## ## Call: ## ## Deviance Residuals: ## Min Q Median 3Q Max ## -.93887 -.46..285.4429 ## ## Coefficients: ## Estimate Std. Error z value Pr(> z ) ## (Intercept) 23.898.83 2.23.43 * ## x -47.482 23.973 -.98.476 * ## y -6.24 3.456 -.798.722. ## --- ## Signif. codes: '***'. '**'. '*'.5 '.'. ' ' ## ## (Dispersion parameter for binomial family taken to be ) ## ## Null deviance: 38.629 on 99 degrees of freedom ## Residual deviance: 3.24 on 97 degrees of freedom ## AIC: 9.24 ## ## Number of Fisher Scoring iterations:

Logistic Regression STAT 48 - x y Prob[Val = ].2.7..45.35.588.48.3.39.9.4.

Decision Trees STAT 48 - x>=.4849

Decision Trees - code STAT 48 - tree <- rpart(labels ~., data=supervised, method = 'class') plot(tree) text(tree) kable(cbind(pred.points, round(predict(tree,as.data.frame(pred.points))[,2],3)), col.names=c('x','y','prob[val = ]'))

STAT 48 - Decision Trees - Boundary * * * *

Exercise: Predict Titanic Survival STAT 48 - titanic <- read.csv( 'http://www.math.montana.edu/ahoegh/teaching/stat48/datasets/titanic. set.seed(427) titanic <- titanic %>% filter(!is.na(age)) num.pass <- nrow(titanic) test.ids <- base::sample(:num.pass, size=round(num.pass*.3)) test.titanic <- titanic[test.ids,] train.titanic <- titanic[(:num.pass)[!(:num.pass) %in% test.ids],] dim(titanic) ## [] 74 2 dim(test.titanic) ## [] 24 2 dim(train.titanic) ## [] 5 2

Exercise: Predict Titanic Survival STAT 48 - See if you can improve the classification error from the model below. glm.titanic <- glm(survived ~ Age, data=train.titanic, family='binomial' Class.Error <- mean(test.titanic$survived!= round(predict(glm.titanic, The logistic regression model only using age is wrong 4% of the time.