Multiple Linear Regression

Similar documents
Regression Analysis and Linear Regression Models

1 Lab 1. Graphics and Checking Residuals

Lecture 13: Model selection and regularization

Lecture 7: Linear Regression (continued)

Overview. Data Mining for Business Intelligence. Shmueli, Patel & Bruce

Getting Started in R

Applied Regression Modeling: A Business Approach

Getting Started in R

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

Variable selection is intended to select the best subset of predictors. But why bother?

Salary 9 mo : 9 month salary for faculty member for 2004

Exercise 2.23 Villanova MAT 8406 September 7, 2015

Applied Statistics and Econometrics Lecture 6

Multicollinearity and Validation CIVL 7012/8012

36-402/608 HW #1 Solutions 1/21/2010

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

Robust Linear Regression (Passing- Bablok Median-Slope)

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

Model Selection and Inference

Lecture 1: Statistical Reasoning 2. Lecture 1. Simple Regression, An Overview, and Simple Linear Regression

22s:152 Applied Linear Regression

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Week 4: Simple Linear Regression II

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018

Comparison of Linear Regression with K-Nearest Neighbors

Research Methods for Business and Management. Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Cross-validation and the Bootstrap

Solution to Bonus Questions

Section 2.3: Simple Linear Regression: Predictions and Inference

Section 2.1: Intro to Simple Linear Regression & Least Squares

22s:152 Applied Linear Regression

. predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education)

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

Multiple Regression White paper

Applied Regression Modeling: A Business Approach

7. Collinearity and Model Selection

Two-Stage Least Squares

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

Introduction to Data Science

Gelman-Hill Chapter 3

Section 3.4: Diagnostics and Transformations. Jared S. Murray The University of Texas at Austin McCombs School of Business

Week 4: Simple Linear Regression III

THE L.L. THURSTONE PSYCHOMETRIC LABORATORY UNIVERSITY OF NORTH CAROLINA. Forrest W. Young & Carla M. Bann

Section 4.1: Time Series I. Jared S. Murray The University of Texas at Austin McCombs School of Business

WELCOME! Lecture 3 Thommy Perlinger

22s:152 Applied Linear Regression DeCook Fall 2011 Lab 3 Monday October 3

Package nodeharvest. June 12, 2015

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Estimating R 0 : Solutions

Section 2.1: Intro to Simple Linear Regression & Least Squares

Cross-validation and the Bootstrap

IQR = number. summary: largest. = 2. Upper half: Q3 =

Example 1 of panel data : Data for 6 airlines (groups) over 15 years (time periods) Example 1

VISUALIZATION TECHNIQUES UTILIZING THE SENSITIVITY ANALYSIS OF MODELS

Poisson Regression and Model Checking

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables:

Multivariate Analysis Multivariate Calibration part 2

Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding

Orange Juice data. Emanuele Taufer. 4/12/2018 Orange Juice data (1)

Study Guide. Module 1. Key Terms

Show how the LG-Syntax can be generated from a GUI model. Modify the LG-Equations to specify a different LC regression model

Tutorial 1. Linear Regression

Cross-Validation Alan Arnholt 3/22/2016

Chapters 5-6: Statistical Inference Methods

Regression on the trees data with R

Minitab 17 commands Prepared by Jeffrey S. Simonoff

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

Week 5: Multiple Linear Regression II

Cross-validation. Cross-validation is a resampling method.

NEURAL NETWORKS. Cement. Blast Furnace Slag. Fly Ash. Water. Superplasticizer. Coarse Aggregate. Fine Aggregate. Age

Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Chapter 6: DESCRIPTIVE STATISTICS

/4 Directions: Graph the functions, then answer the following question.

Statistical Machine Learning Hilary Term 2018

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Lab #7 - More on Regression in R Econ 224 September 18th, 2018

Predictive Checking. Readings GH Chapter 6-8. February 8, 2017

Analysis of variance - ANOVA

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

An Introduction to Growth Curve Analysis using Structural Equation Modeling

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

9.1 Random coefficients models Constructed data Consumer preference mapping of carrots... 10

Practical 4: Mixed effect models

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Lecture 16: High-dimensional regression, non-linear regression

The Basics of Decision Trees

Splines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric)

Practice in R. 1 Sivan s practice. 2 Hetroskadasticity. January 28, (pdf version)

Introduction to Mixed Models: Multivariate Regression

Linear Methods for Regression and Shrinkage Methods

Splines and penalized regression

Grade 6 Curriculum and Instructional Gap Analysis Implementation Year

Lecture 17: Smoothing splines, Local Regression, and GAMs

Nonparametric Regression

CSE 158 Lecture 2. Web Mining and Recommender Systems. Supervised learning Regression

Non-Linear Regression. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Transcription:

Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49

Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors Model Selection: Forward Selection Model Fit Predictions Application Qualitative Predictors with More than Two Levels Additivity, Interactions, Polynomial Regression Other Issues 2 / 49

Multiple Linear Regression (MLR) SLR is a useful approach for predicting a response on the basis of a single predictor variable. However, in practice we often have more than one predictor. 3 / 49

Advertising data set In the Advertising data, we have examined the relationship between sales and TV advertising. We also have data for the amount of money spent advertising on the radio and in newspapers. How can we extend our analysis of the advertising data in order to accommodate these two additional predictors? 4 / 49

Run many SLR s! We could run three SLRs. Not a good idea! Why? 1. It is unclear how to make a single prediction of sales given levels of the three advertising media budgets, since each of the budgets is associated with a separate regression equation. 2. Each of the three regression equations ignores the other two media in forming estimates for the regression coefficients. 5 / 49

MLR Suppose we have p preditors. Then the MLR takes the form Y = β o + β 1 X 1 + β p X p + ɛ, (1) where X j represents the jth predictor β j quantifies the association between that predictor and the response We interpret β j as the average effect on Y of a one unit increase in X j, holding all other predictors fixed. 6 / 49

Advertising In the advertising example, the MLR becomes sales = β o + β 1 TV + β 2 radio + β 3 newspaper + ɛ 7 / 49

Estimating the Regression Coefficients As was the case in SLR, β o, β 1,..., β p are unknown and must be estimated by ˆβ o, ˆβ 1,..., ˆβ p. Given the estimated coefficients (found by minimizing the RSS), we can make predictions using ŷ = ˆβ o + ˆβ 1 X 1 + + ˆβ p X p. Solving for the LSE is beyond the scope of this class given the calculations are quite tedious and require matrix algebra. 8 / 49

Important Questions to Keep in Mind 1. Is at least one of the predictors (X) useful in predicting the response? 2. Do all the predictors help to explain Y, or is only a subset of the predictors useful? 3. How well does the model fit the data? 4. Given a set of predictor values, what response value should we predict, and how accurate is our prediction? We address these now, point by point. 9 / 49

Relationship Between the Response and Predictors In the SLR setting, to determine whether there is a relationship between the response and the predictor we simply check whether β 1 = 0. Are all of the regression coefficients zero? H o : β o = β 1 = = β p = 0 (2) H a : at least one β j is non-zero. (3) The hypothesis test if performed by computing the F-statistic: F = (TSS RSS)/p RSS/(n p 1). Look at p-value and then make a conclusion. Remark: Suppose p > n, there are more coefficients β j to estimate than observations from which to estimate them. One simple approach to overcome this issue is forward selection. This high dimensional setting is discussed more in Chapter 6. 10 / 49

Deciding on Important Variables The first step in a multiple regression analysis is to compute the F-statistic and to examine the associated p-value. If we conclude on the basis of that p-value that at least one of the predictors is related to the response, then it is natural to wonder which are the guilty ones! The task of determining which predictors are associated with the response, in order to fit a single model involving only those predictors, is referred to as variable selection. (See Chapter 6 for a more in depth discussion). 11 / 49

Model selection It is infeasible to look at all possible models since there are 2 p models that contain subsets of p features. For example, if p = 30, then we must consider 2 30 = 1,073,741,824 models! We will focus on forward selection since it can be used in high dimensional settings. 12 / 49

Forward selection 1. We begin with the null model. This is the that contains an intercept but no predictors. 2. We then fit p simple linear regressions and add to the null model the variable that results in the lowest RSS. 3. We then add to that model the variable that results in the lowest RSS for the new two-variable model. 4. This approach is continued until some stopping rule is satisfied. Remark: Forward selection is a greedy approach, and might include variables early that later become redundant. 13 / 49

Model Fit Two of the most common numerical measures of model fit are the RSE and R 2, the fraction of variance explained. These quantities are computed and interpreted in the same fashion as for simple linear regression. Remark: It turns out that R 2 will always increase when more variables are added to the model, even if those variables are only weakly associated with the response. This is due to the fact that adding another variable to the least squares equations must allow us to fit the training data (though not necessarily the testing data) more accurately. 14 / 49

Predictions Predictions are simple to make. But what is the uncertainty about our predictions? 1. The inaccuracy in the coefficient estimates is related to the reducible error from Chapter 2. We can computer a confidence interval in order to see how close ŷ is to f (X) 2. Assume that f (X) is a linear model is almost never correct, so there is an additional source of reducible error here, called model bias. 3. We can calculate prediction intervals (recall these are always wider than confidence intervals). 15 / 49

MRL for Advertising data In order to fit a MLR using least squares we again use the lm() function. The syntax lm(y x1 + x2 + x3) is used to fit a model with three predictors. The summary()function outputs the regression coefficients for the predictors. 16 / 49

MRL for Advertising data attach(boston) names(boston) ## [1] "crim" "zn" "indus" "chas" "nox" " ## [8] "dis" "rad" "tax" "ptratio" "black" " dim(boston) ## [1] 506 14 17 / 49

Histograms of Boston data crim zn indus chas Density 0.00 0.06 Density 0.00 0.06 Density 0.00 0.10 Density 0 4 8 0 20 40 60 80 0 20 40 60 80 100 0 5 10 15 20 25 0.0 0.2 0.4 0.6 0.8 1.0 Boston[, i] Boston[, i] Boston[, i] Boston[, i] nox rm age dis Density 0 2 4 Density 0.0 0.4 Density 0.000 0.025 Density 0.00 0.20 0.4 0.5 0.6 0.7 0.8 0.9 4 5 6 7 8 9 0 20 40 60 80 100 2 4 6 8 10 12 Boston[, i] Boston[, i] Boston[, i] Boston[, i] rad tax ptratio black Density 0.00 0.10 Density 0.000 0.004 Density 0.0 0.2 Density 0.000 0.015 0 5 10 15 20 200 400 600 12 14 16 18 20 22 0 100 200 300 400 Boston[, i] Boston[, i] Boston[, i] Boston[, i] lstat medv Density 0.00 0.04 Density 0.00 0.05 0 10 20 30 40 10 20 30 40 50 Boston[, i] Boston[, i] 18 / 49

MRL for Advertising data lm.fit <- lm(medv~., data=boston) summary(lm.fit) ## ## Call: ## lm(formula = medv ~., data = Boston) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.595-2.730-0.518 1.777 26.199 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 *** ## crim -1.080e-01 3.286e-02-3.287 0.001087 ** ## zn 4.642e-02 1.373e-02 3.382 0.000778 *** ## indus 2.056e-02 6.150e-02 0.334 0.738288 ## chas 2.687e+00 8.616e-01 3.118 0.001925 ** ## nox -1.777e+01 3.820e+00-4.651 4.25e-06 *** ## rm 3.810e+00 4.179e-01 9.116 < 2e-16 *** ## age 6.922e-04 1.321e-02 0.052 0.958229 ## dis -1.476e+00 1.995e-01-7.398 6.01e-13 *** ## rad 3.060e-01 6.635e-02 4.613 5.07e-06 *** ## tax -1.233e-02 3.760e-03-3.280 0.001112 ** ## ptratio -9.527e-01 1.308e-01-7.283 1.31e-12 *** ## black 9.312e-03 2.686e-03 3.467 0.000573 *** ## lstat -5.248e-01 5.072e-02-10.347 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 4.745 on 492 degrees of freedom ## Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338 ## F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16 19 / 49

MRL for Advertising data We can access each part of summary by typing?summary.lm to see what is available #r2 summary(lm.fit)$r.sq ## [1] 0.7406427 #RMSE summary(lm.fit)$sigma ## [1] 4.745298 summary(lm.fit)$residuals ## 1 2 3 4 5 ## -6.003843377-3.425562379 4.132403281 4.792963511 8.256475767 ## 6 7 8 9 10 ## 3.443715538-0.101808268 7.564011571 4.976363147-0.020262107 ## 11 12 13 14 15 ## -3.999496511-2.686795681 0.793478472 0.847097189-1.083482050 ## 16 17 18 19 20 ## 0.602516792 2.572490209 0.588598653 4.021988943-0.206136033 ## 21 22 23 24 25 ## 1.076142473 1.928963305-0.632881292 0.693714654-0.078338315 ## 26 27 28 29 30 ## 0.513314391 1.136023454 0.091525719-1.147372851 0.123571798 ## 31 32 33 34 35 ## 1.244882410-3.559232946 4.388942638-1.182758141-0.206758913 ## 36 37 38 39 40 ## -4.914635265-2.341937076-2.108911425 1.784973884-0.557625688 20 / 49

MRL for Advertising data summary(lm.fit)$coefficients ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 3.645949e+01 5.103458811 7.14407419 3.283438e-12 ## crim -1.080114e-01 0.032864994-3.28651687 1.086810e-03 ## zn 4.642046e-02 0.013727462 3.38157628 7.781097e-04 ## indus 2.055863e-02 0.061495689 0.33431004 7.382881e-01 ## chas 2.686734e+00 0.861579756 3.11838086 1.925030e-03 ## nox -1.776661e+01 3.819743707-4.65125741 4.245644e-06 ## rm 3.809865e+00 0.417925254 9.11614020 1.979441e-18 ## age 6.922246e-04 0.013209782 0.05240243 9.582293e-01 ## dis -1.475567e+00 0.199454735-7.39800360 6.013491e-13 ## rad 3.060495e-01 0.066346440 4.61289977 5.070529e-06 ## tax -1.233459e-02 0.003760536-3.28000914 1.111637e-03 ## ptratio -9.527472e-01 0.130826756-7.28251056 1.308835e-12 ## black 9.311683e-03 0.002685965 3.46679256 5.728592e-04 ## lstat -5.247584e-01 0.050715278-10.34714580 7.776912e-23 21 / 49

MRL for Advertising data Note that age has by far a very high p-value. How do we re-run the regression, omitting age? lm.fit1 <- lm(medv~. -age, data=boston) summary(lm.fit1) ## ## Call: ## lm(formula = medv ~. - age, data = Boston) ## ## Residuals: ## Min 1Q Median 3Q Max ## -15.6054-2.7313-0.5188 1.7601 26.2243 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 36.436927 5.080119 7.172 2.72e-12 *** ## crim -0.108006 0.032832-3.290 0.001075 ** ## zn 0.046334 0.013613 3.404 0.000719 *** ## indus 0.020562 0.061433 0.335 0.737989 ## chas 2.689026 0.859598 3.128 0.001863 ** ## nox -17.713540 3.679308-4.814 1.97e-06 *** ## rm 3.814394 0.408480 9.338 < 2e-16 *** ## dis -1.478612 0.190611-7.757 5.03e-14 *** ## rad 0.305786 0.066089 4.627 4.75e-06 *** ## tax -0.012329 0.003755-3.283 0.001099 ** ## ptratio -0.952211 0.130294-7.308 1.10e-12 *** ## black 0.009321 0.002678 3.481 0.000544 *** ## lstat -0.523852 0.047625-10.999 < 2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 22 / 49

Qualitative Predictors So far, we have only considered quantitative predictors. What if the predictors are qualitative? As an example, the Credit data set records balance (average credit card debt for individuals) for quantiative predictors: age, cards, education, rating qualitative predictors: gender, student (status), status (marital), and ethnicity. 23 / 49

Qualitative Predictors with Two Levels If the qualitative predictor (factor) has to levels, we can create an indicator or dummy variable. Suppose we are dealing with gender (male, female) Then we create a new variable, where x i = { 1, if the ith person is female 0, otherwise 24 / 49

Qualitative Predictors with Two Levels Using our indicator variable { 1, if the ith person is female x i = 0, otherwise, we can then use this in our regression model: y i = β 0 + β 1 x i + ɛ i (4) { β0 + β 1 + ɛ i, if the ith person is female = (5) β 0 + ɛ i, otherwise Now β 0 can be interpreted as the overall average credit card balance (ignoring the gender effect), and β 1 is the amount that females are above the average and males are below the average. 25 / 49

Qualitative Predictors with More than Two Levels When a qualitative predictor has more than two levels, a single dummy variable cannot represent all possible values. In this situation, we can create additional dummy variables. For example, for the ethnicity variable we create two dummy variables. x i1 = and { 1, if the ith person is Asian 0, if the ith person is not Asian x i2 = { 1, if the ith person is Caucasian 0, if the ith person is not Caucasian 26 / 49

Qualitative Predictors with More than Two Levels Then both of these variables can be used in the regression equation, in order to obtain the model y i = β 0 + β 1 x i1 + β 2 x i2 + ɛ i (6) β 0 + β 1 + ɛ i, if the ith person is Asian = β 0 + β 2 + ɛ i, if the ith person is Caucasian (7) β 0 + ɛ i, if the ith person is African American 27 / 49

Qualitative Predictors with More than Two Levels Now β 0 can be interpreted as the average credit card balance for African Americans, β 1 can be interpreted as the difference in the average balance between the Asian and African American categories, and β 2 can be interpreted as the difference in the average balance between the Caucasian and African American categories. There will always be one fewer dummy variable than the number of levels. The level with no dummy variable African American in this example is known as the baseline. 28 / 49

Extensions of the Linear Model We discuss removing two of the most important assumptions of the linear model: additivity and linearity. 29 / 49

Additivity The additive assumption means that the effect of changes in a predictor X j on the response Y is independent of the values of the other predictors. 30 / 49

Removing the Additive Effect Suppose that spending money on radio advertising actually increases the effectiveness of TV ad- vertising, so that the slope term for TV should increase as radio increases. Given a fixed budget of $100,000, spending half on radio and half on TV may increase sales more than allocating the entire amount to either TV or to radio. This is referred to as an interaction effect. 31 / 49

Interaction Effect One way of extending this model to allow for interaction effects is to include a third predictor, called an interaction term, which is constructed by computing the product of X 1 and X 2, which results in Y = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 1 X 2 + ɛ. (8) How does inclusion of this interaction term relax the additive assumption? Equation 8 can be rewritten as Y = β 0 + (β 1 + β 3 X 2 )X 1 + β 2 X 2 + ɛ (9) = β 0 + β 1 X 1 + β 2 X 2 + ɛ (10) (11) where β 1 = β 1 + β 3 X 2. 32 / 49

Interaction Effect Since β 1 changes with X 2, the effect of X 1 on Y is no longer constant: adjusting X 2 will change the impact of X 1 on Y. 33 / 49

Linear Assumption The linear assumption means that the change in the response Y due to a one-unit change in X j is constant, regardless of the value of X j. In our previous analysis of the Advertising data, we concluded that both TV and radio seem to be associated with sales. For example, the linear model states that the average effect on sales of a one-unit increase in TV is always β 1, regardless of the amount spent on radio. However, the simple model may be incorrect. 34 / 49

Polynomial regression We present a very simple way to directly extend the linear model to accommodate non-linear relationships, using polynomial regression 35 / 49

Polynomial regression Let us consider a motivating example from the Auto data set. attach(auto) plot(horsepower, mpg) mpg 10 20 30 40 50 100 150 200 horsepower 36 / 49

Polynomial regression ATTN: I can t get the polynomail onto the regression line! ## Warning in abline(lm.fit.poly2, lwd = 3, col = "brown"): only using the ## first two of 3 regression coefficients mpg 10 20 30 40 50 100 150 200 horsepower 37 / 49

Linear regression Residuals vs Fitted Residuals 15 10 5 0 5 10 15 20 334 323 330 Standardized residuals 3 2 1 0 1 2 3 4 5 10 15 20 25 30 Fitted values lm(mpg ~ horsepower) 38 / 49

Polynomial regression Residuals vs Fitted Residuals 15 10 5 0 5 10 15 331 153 321 Standardized residuals 2 0 2 4 15 20 25 30 35 Fitted values lm(mpg ~ poly(horsepower, 2)) 39 / 49

Potential Problems There are many issues that can arise in linear regression since it is a simple method. 40 / 49

Non-linearity of the response-predictor relationships This can be checked using residual plots to identify for non-linearity. That is plot the residuals e i versus x i. In the MLR setting, plot the residual versus the predicted (fitted) values ŷ i. An ideal plot will have no pattern! 41 / 49

Correlation of Error Terms An important assumption of the linear model is that the error terms are uncorrelated. If there is correlation in the error terms, then the estimation standard errors will tend to understimate the true standard errors. As a result, confidence and prediction intervals will be more narrow than they should be. Why might we have correlated errors? We could have time series data, where such data points that appear are correlated by the time structure. 42 / 49

Non-constant Variance of Error Terms Another important assumption is that the linear regresion model is that the error terms have constant variance (σ 2 ). Often, the variance of the error terms are non-constant. One can identify non-constant errors (or heterscedarcity) from the presence of a funnel shape in the residual plot. One easy solution is transforming the response using a concave function such as log or sqrt. Then recheck the residual plot. 43 / 49

Outliers An outlier is a point for which y i is far from the value predicted by the model. Outliers can arise for a variety of reasons, such as incorrect recording of an observation during data collection. Residual (studentized) plots can be used to identify outliers. 44 / 49

High Leverage Points We just saw that outliers are observations for which the response y i is unusual given the predictor x i. In contrast, observations with high leverage high leverage have an unusual value for x i. High leverage observations tend to have a sizable impact on the estimated regression line. For this reason, it is important to identify high leverage observations. In order to quantify an observation s leverage, we compute the leverage statistic. A large value of this statistic indicates an observation with high leverage leverage. For SLR, h i = 1 n + (x i x) nj=1 (x j x. 45 / 49

Collinearity Collinearity refers to the situation in which two or more predictor variables are closely related to one another. 46 / 49

Collinearity Let s consider the Credit data set. Credit <- read.csv("data/credit.csv",header=true, sep=",") pdf(file = "examples/collinear.pdf") par(mfrow=c(1, 2)) plot(credit$limit, Credit$Age, xlab="limit", ylab="age") plot(credit$limit, Credit$Rating, xlab="limit", ylab="rating") dev.off() ## pdf ## 2 47 / 49

Collinearity 2000 6000 10000 14000 20 40 60 80 100 Limit Age 2000 6000 10000 14000 200 400 600 800 1000 Limit Rating Figure 1: Scatterplots of the observations from the Credit data set. Left: Plot of age versus limit. Right: A plot of age versus rating, with a high 48 / 49

The Marketing Plan Exercise: Now that we have walked through this, see Section 3.4 and answer these questions on your own. 49 / 49