Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33
Relationship between numerical variable Investigate possible linear relationship between two numerical variables. The Pearson s correlation coefficient Quantify the strenght and direction of a linear relationship Given two numerical variable X and Y N i=1 ρ = (x i µ x )(y i µ Y ) Nσ x σ y where µ x and µ y are the population means of X and Y, σ x and σ y the population standard deviations and N is the population size. It s a number in [-1,1] The stronger the relationship the closer ρ to 1 The sign of ρ indicates the direction of the relationship (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 2 / 33
Relationship between numerical variable We cannot measure ρ directly; we do not have access to the whole population Estimation of ρ from the data Given n pairs of values (x 1, y 1 ),..., (x n, y n) of the observed data The estimation r of rho is: N i=1 r = (x i x)(y i ȳ) (n 1)s x s y (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 3 / 33
Relationship between numerical variable Examples with real data Example With the bodyweight dataset Examine the relationship between percent body fat (response) and abdomen circumference(explanatory variable) Dataset can be found at http://lib.stat.cmu.edu/datasets/bodyfat cor(bw[,c("abdomen2", "bodyfat")]) ## abdomen2 bodyfat ## abdomen2 1.00000 0.81343 ## bodyfat 0.81343 1.00000 ## [1] 252 Examine the relationship between height and percent body fat cor(bw[,c("bodyfat","height")]) ## bodyfat height ## bodyfat 1.000000-0.089495 ## height -0.089495 1.000000 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 4 / 33
Relationship between numerical variable Correlation tests Recall: When ρ is close to 0 means that the two variables are not related Or they are related BUT the relationship is not linear Be cautious to intepret rho close to 0 as no relationship!! Evaluate statistical significance of rho R H 0 : ρ = 0 T = (1 R 2 )/(n 2) R is the sample correlation coefficient and n the sample size If null hypothesis is true, the T distribution is the t-distribution with n 2 degree of freedom Observed statistic t = H 1 : ρ 0 r (1 r 2 )/(n 2) (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 5 / 33
Example on correlation test Example With the bodyweight dataset Examine the relationship between height and percent body fat Compute the t-score from the sample aa <- cor(bw[,c("bodyfat","height")]) t <- aa[1,2] / (sqrt((1-aa[1,2]**2)/(nrow(bw) - 2))) Testing the alternative hypothesis H 1 : ρ 0 based on a t-distribution with 252-2 =250 degree of freedom Compute the p-value as p obs = 2P(T 1.42) 2 * pt(t,df=nrow(bw)-2) ## [1] 0.15664 With the commonly used significance levels (0.01, 0.05, 0.1) we reject the alternative hypothesis Therefore we cannot conclude the two variables are linearly correlated (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 6 / 33
Example on correlation test Example With the bodyweight dataset. Testing the alternative hypothesis H 1 : ρ 0 Examine the relationship between height and percent body fat cor.test(bw$bodyfat, bw$height, alternative="two.sided") ## ## Pearson's product-moment correlation ## ## data: bw$bodyfat and bw$height ## t = -1.4207, df = 250, p-value = 0.1566 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## -0.210738 0.034459 ## sample estimates: ## cor ## -0.089495 Examine the relationship between percent body fat and abdomen circumference cor.test(bw$bodyfat, bw$abdomen2, alternative="two.sided") ## ## Pearson's product-moment correlation ## ## data: bw$bodyfat and bw$abdomen2 ## t = 22.112, df = 250, p-value < 2.2e-16 ## alternative hypothesis: true correlation is not equal to 0 ## 95 percent confidence interval: ## 0.76695 0.85142 ## sample estimates: ## cor ## 0.81343 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 7 / 33
Linear regression models Aim: Investigate the relationships between numerical variables Examining linear relationships between a response variable and one or more explanatory variable Testing the hypothesis regarding relationships between one or more explanatory variable and a response variable Predicting unknown values of the response variable using one or more predictors Denote with X the set of explanatory variables Denote with Y the response variables Try to fit the equation: Y = f (X) + ɛ Defining that f (X) is linear: Y = Xβ + ɛ thus we can estimate β minimizing the prediction error: ˆβ = (X T X) 1 X T y (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 8 / 33
Linear regression models One binary Explanatory variable X: is a binary variable 0,1 Y: is a numerical variable X = 0 0. 1 1 Example Investigate relationship between sodium chloride intake and blood pressure among elderly people 25 people (= 25) 15 of them (0.6 of our sample) keep a low sodium chloride diet (X = 0) 10 of them (0.4 of our sample) keep a high sodium chloride diet (X = 1) Measure of the systolic blood pressure (Y ) For each individual i we have a pair of observation (x i, y i ) (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 9 / 33
Example Dotplot of systolic blood pressure for each diet group 145 140 BP 135 a 130 0 1 For each group compute the mean estimation of blood pressure (red point in the graph) The sample mean provides a reasonable point estimate if a new sample arrives For group X = 0: ŷ x=0 = mean(y x=0 ) For group X = 1: ŷ x=1 = mean(y x=1 ) (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 10 / 33
Example Example We can compute ŷ x=0 and ŷ x=1 : ## 0 1 ## 133.17 139.43 Compute the line parameters connecting the two points: a <- mm["0"] b <- (mm["1"] - mm["0"]) / 1 ## [1] 133.17 ## [1] 6.2563 We can draw the black line connecting the two means In general The regression line is defined as: ŷ = a + bx that captures the linear relationship between response variable and explanatory variables The slope b is interpreted as our estimate of the expected (average) change in response variable associated to unit increase in the value of the explanatory variable (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 11 / 33
Linear regression models Prediction and Errors The regression line Given the regression line: Define the prediction for each sample: ŷ i = a + bx i Define the residuals for each sample: e i = y i ŷ i Thus the real y i value will be: y i = ŷ i + e i = a + bx i + e i (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 12 / 33
Linear regression models Prediction and Errors Example With the same example on blood pressure compute the prediction for each group: Predictions x i = 0 ŷ i = a = 133.17 x i = 1 ŷ i = a + b = 139.429 Errors x 4 = 0 The true value is y 4 = 135.08 the error is e 4 = y 4 ŷ 4 = 135.08 133.17 = 1.91 x 25 = 1 The true value is y 25 = 134.84 the error is e 25 = y 25 ŷ 25 = 134.84 139.43 = 4.6 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 13 / 33
Linear regression models Measure discrepancy Measure the discrepacy: Residual Sum of Squares (RSS) Measure the distance between predicted values and true values Depend on the resisual and on sample size n For the mean as predictor: e i i = 0 RSS = n i e 2 i We decide to draw the the line connecting the mean between two groups We can draw almost any line between the two groups The line connecting the means is the one which give the minimum RSS which is called the least-squares regression line (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 14 / 33
Generalization Generalized to the whole population The linear relationship between Y and X in the entire population: Y = α + βx + ɛ This is defined as the linear regression model α and β are the regression parameters β is the regression coefficient fitting is the process of finding the regression parameters Confidence Interval for the regression coefficient Standard Error: Confidence Intervals: SE b = RSS/(n 2) i (x i x) 2 [b t crit SE b, b + t crit SE b ] where t crit depends on the level c of confidence (i.e.1.96 for c = 0.95) (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 15 / 33
Hypothesis testing Linear regression models can be used to test hypothesis regarding possible relationships between response variable and explanatory variable null hypothesis H 0 : β = 0 no linear relationship alternative hypothesis H 0 : β 0, p obs = 2 P(T t ) t t = b SE b Example SE b = 1.593 for b = 6.25 t <- b/1.593 p.value <- pt(t,df=(nrow(saltbp)-2), lower.tail=false) (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 16 / 33
Exercise I With the previous dataset saltbp try to estimate coefficient β 0 and β 1 from the matrix X and y using the least square regression line. Recall the definition of the X matrix when β 0 should be estimated For each sample compute the prediction ŷ i and the error e i. Compute also the RSS, the SE for this model and the C.I. at 90% of confidence. (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 17 / 33
Linear regression models Example Example Use the lm function to predict the least square regression line aa <- lm(bp~saltlevel,data=saltbp) summary(aa) ## ## Call: ## lm(formula = BP ~ saltlevel, data = saltbp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -8.299-3.563 0.687 3.211 5.591 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 133.17 1.01 132.22 < 2e-16 *** ## saltlevel1 6.26 1.59 3.93 0.00067 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.9 on 23 degrees of freedom ## Multiple R-squared: 0.402, Adjusted R-squared: 0.376 ## F-statistic: 15.4 on 1 and 23 DF, p-value: 0.000672 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 18 / 33
Linear Regression Models One Numerical Explanatory variable X: is a numerical variable Y: is a numerical variable Example Investigate relationship between sodium chloride intake and blood pressure among elderly people X Daily salt intake (numerical values) Y Blood Pressure (numerical values) (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 19 / 33
Explore the data first I Look at the scatter plot of the data BP 130 135 140 145 2 4 6 8 10 12 salt (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 20 / 33
Explore the data first II BP 130 135 140 145 2 4 6 8 10 12 salt (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 21 / 33
Model on one numerical variable Model definition model ŷ i = a + bx i error e i = y i ŷ i n RSS e 2 i i We can estimate: slope b given by the r coefficient: intercept a given by b = r sy s x where s x and s y are the sample variances where x and ȳ are the sample means a = ȳ b x (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 22 / 33
Example on the blood data set Compute manually the regression model: sy <- sd(saltbp$bp) ## sd of y sx <- sd(saltbp$salt) ## sd of x r <- cor(saltbp$bp, saltbp$salt) ## Correlation coefficient b <- r * (sy/sx) ## The slope a <- mean(saltbp$bp) - b*mean(saltbp$salt) ## The intercept sy;sx;r;b;a ## [1] 4.9364 ## [1] 3.4595 ## [1] 0.8388 ## [1] 1.1969 ## [1] 128.62 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 23 / 33
Example on blood data set Compute the prediction value for a sample in the dataset xi <- saltbp$salt[10] ## Extract a sample yi <- saltbp$bp[10] yhi <- a + b * xi ## Compute the prediction for the sample ei <- yi - yhi ## Compute the error yhi; ei ## [1] 133.02 ## [1] -4.721 yhi <- a + b * saltbp$salt ei <- saltbp$bp - yhi RSS <- sum(ei^2) SE <- sqrt(rss/(25-2))/sqrt(sum((saltbp$salt - mean(saltbp$salt))^2)) sqrt(rss/(25-2)); SE ## [1] 2.7454 ## [1] 0.16199 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 24 / 33
Let R working for us!! Compute the model using the least regression model in R mymod <- lm(bp~salt, data=saltbp) summary(mymod) ## ## Call: ## lm(formula = BP ~ salt, data = saltbp) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.039-1.675 0.366 1.882 5.344 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 128.616 1.102 116.72 < 2e-16 *** ## salt 1.197 0.162 7.39 1.6e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.75 on 23 degrees of freedom ## Multiple R-squared: 0.704, Adjusted R-squared: 0.691 ## F-statistic: 54.6 on 1 and 23 DF, p-value: 1.63e-07 ## ## Manually Computed: ## Residual Std Error 2.745 ## SE 0.162 ## pvalue 1.63120748106763e-07 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 25 / 33
Analyze the output mymod$coefficient ## Parameters of the linear model ## (Intercept) salt ## 128.6164 1.1969 mymod$residuals ## error for each sample ## 1 2 3 4 5 6 7 8 ## 1.71842 1.87111-0.59724 1.54437-0.62158 2.61017 3.89831-1.67547 ## 9 10 11 12 13 14 15 16 ## -2.27817-4.72097-2.12713 0.36618 2.19296 0.79778 2.28898-3.77798 ## 17 18 19 20 21 22 23 24 ## 0.61068 1.88238-5.03880-0.23622 2.13233-0.99761 5.34430-1.02136 ## 25 ## -4.16544 mymod$fitted.values ## Predicted values for the response variable ## 1 2 3 4 5 6 7 8 9 10 ## 130.47 129.97 134.46 133.54 130.47 134.23 131.20 131.29 131.79 133.02 ## 11 12 13 14 15 16 17 18 19 20 ## 131.42 135.77 135.31 134.85 134.54 139.57 137.51 142.79 136.17 141.02 ## 21 22 23 24 25 ## 142.00 138.17 139.68 143.66 139.01 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 26 / 33
Plots I plot(mymod, which=1:2) Residuals vs Fitted Residuals 6 4 2 0 2 4 6 10 19 23 130 132 134 136 138 140 142 144 Fitted values lm(bp ~ salt) (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 27 / 33
Plots II Normal Q Q Standardized residuals 2 1 0 1 2 19 10 23 2 1 0 1 2 Theoretical Quantiles lm(bp ~ salt) (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 28 / 33
Histogram of the residuals hist(mymod$residuals, col="grey") Histogram of mymod$residuals Frequency 0 1 2 3 4 5 6 7 6 4 2 0 2 4 6 mymod$residuals (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 29 / 33
Fitted values True values vs predicted values plot(bp~salt, data=saltbp) points(saltbp$salt, mymod$fitted.values, pch=20) 2 4 6 8 10 12 130 135 140 145 salt BP (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 30 / 33
Goodness of Fit Definition: R 2 Measures how well the regression model fits the observed data It depends by the RSS It quantifies the discrepancies between observed data and the regression line The higher the RSS the higher the discrepancy n RSS lack of fit e 2 i i n TSS Total variation in the response variable (y i i ȳ) 2 R 2 Total variation explained by the model: R 2 = 1 RSS TSS For simple regression line with one variable R 2 = r Pearson s correlation (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 31 / 33
Assumtpions Linear model regression assumptions 1 Linearity: we assume the relationship between X and Y is linear! 2 Independence: observations should be independent (random sampling) 3 Constant Variance and Normality: Y should be distributed normally. In general we check for the normality of ɛ given the relationship between Y and ɛ. In particular ɛ N (0, σ 2 ) (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 32 / 33
Exercise I 1 We want to examine the relationship between body temperature Y and heart rate X. Further, we would like to use heart rate to predict the body temperature. 1 Use the BodyTemperature.txt data set to build a simple linear regression model for body temperature using heart rate as the predictor. 2 Interpret the estimate of regression coefficient and examine its statistical significance. 3 Find the 95% confidence interval for the regression coefficient. 4 Find the value of R 2 and show that it is equal to sample correlation coefficient 5 Create simple diagnostic plots for your model and identify possible outliers. 6 If someone s heart rate is 75, what would be your estimate of this person s body temperature? 2 We would like to predict a baby s birthweight (bwt) before she is born using her mother s weight at last menstrual period (lwt). 1 Use the birthwt data set to build a simple linear regression model, where bwt is the response variable and lwt is the predictor. 2 Interpret your estimate of regression coefficient and examine its statistical significance 3 Find the 90% confidence interval for the regression coefficient. 4 If mother s weight at last menstrual period is 170 pounds, what would be your estimate for the birthweight of her baby? 3 We want to predict percent body fat using the measurement for neck circumference 1 Use the bodyfat data set to build a simple linear regression model for percent body fat (bodyfat), where neck circumference (neck) is the predictor. In this data set, neck is measured in centimeters. 2 What is the expected (mean) increase in the percent body fat corresponding to one unit increase in neck circumference. 3 Create a new variable, neck.in, whose values are neck circumference in inches. Rebuild the regression model for percent body fat using neck.in as the predictor. (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 33 / 33