Lecture 1: Statistical Reasoning 2. Lecture 1. Simple Regression, An Overview, and Simple Linear Regression

Similar documents
Section E. Measuring the Strength of A Linear Association

The Normal Distribution. John McGready, PhD Johns Hopkins University

IQR = number. summary: largest. = 2. Upper half: Q3 =

STA 570 Spring Lecture 5 Tuesday, Feb 1

3. Data Analysis and Statistics

Chapter 6: DESCRIPTIVE STATISTICS

Multiple Linear Regression

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

BIO 360: Vertebrate Physiology Lab 9: Graphing in Excel. Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student

BIOSTATISTICS LABORATORY PART 1: INTRODUCTION TO DATA ANALYIS WITH STATA: EXPLORING AND SUMMARIZING DATA

CHAPTER 2 DESCRIPTIVE STATISTICS

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

A straight line is the graph of a linear equation. These equations come in several forms, for example: change in x = y 1 y 0

Correctly Compute Complex Samples Statistics

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Multiple Regression White paper

Chapter Two: Descriptive Methods 1/50

The results section of a clinicaltrials.gov file is divided into discrete parts, each of which includes nested series of data entry screens.

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Summarising Data. Mark Lunt 09/10/2018. Arthritis Research UK Epidemiology Unit University of Manchester

Section 2.3: Simple Linear Regression: Predictions and Inference

MAT 110 WORKSHOP. Updated Fall 2018

An Introduction to Growth Curve Analysis using Structural Equation Modeling

Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding

TYPES OF VARIABLES, STRUCTURE OF DATASETS, AND BASIC STATA LAYOUT

WELCOME! Lecture 3 Thommy Perlinger

Week 4: Simple Linear Regression III

Modelling Proportions and Count Data

Linear Regression. Problem: There are many observations with the same x-value but different y-values... Can t predict one y-value from x. d j.

8. MINITAB COMMANDS WEEK-BY-WEEK

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975.

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Lab #9: ANOVA and TUKEY tests

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

Modelling Proportions and Count Data

SLStats.notebook. January 12, Statistics:

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

STANDARDS OF LEARNING CONTENT REVIEW NOTES. ALGEBRA I Part I. 4 th Nine Weeks,

Chapter 1. Looking at Data-Distribution

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

8: Statistics. Populations and Samples. Histograms and Frequency Polygons. Page 1 of 10

Using Large Data Sets Workbook Version A (MEI)

. predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education)

Research Methods for Business and Management. Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel

To make sense of data, you can start by answering the following questions:

Coding Categorical Variables in Regression: Indicator or Dummy Variables. Professor George S. Easton

Statistical Good Practice Guidelines. 1. Introduction. Contents. SSC home Using Excel for Statistics - Tips and Warnings

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016)

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation

Scatterplot: The Bridge from Correlation to Regression

Table Of Contents. Table Of Contents

STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA I. 4 th Nine Weeks,

Using Excel for Graphical Analysis of Data

Tabular & Graphical Presentation of data

One Factor Experiments

Averages and Variation

1. Assumptions. 1. Introduction. 2. Terminology

Regression Analysis and Linear Regression Models

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

STANDARDS OF LEARNING CONTENT REVIEW NOTES. ALGEBRA I Part II. 3 rd Nine Weeks,

Data analysis using Microsoft Excel

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

Regression III: Advanced Methods

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file

Table of Contents (As covered from textbook)

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

Graphical Analysis of Data using Microsoft Excel [2016 Version]

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

MINITAB 17 BASICS REFERENCE GUIDE

Week 2: Frequency distributions

STATS PAD USER MANUAL

Reference

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

Correctly Compute Complex Samples Statistics

CS 237: Probability in Computing

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Applied Statistics and Econometrics Lecture 6

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Exercise: Graphing and Least Squares Fitting in Quattro Pro

Chapter 5: The standard deviation as a ruler and the normal model p131

The main issue is that the mean and standard deviations are not accurate and should not be used in the analysis. Then what statistics should we use?

Example how not to do it: JMP in a nutshell 1 HR, 17 Apr Subject Gender Condition Turn Reactiontime. A1 male filler

Lecture 25: Review I

Correlation. January 12, 2019

Week 4: Simple Linear Regression II

Key: 5 9 represents a team with 59 wins. (c) The Kansas City Royals and Cleveland Indians, who both won 65 games.

Excel 2010 with XLSTAT

1. Determine the population mean of x denoted m x. Ans. 10 from bottom bell curve.

Chapter 4: Analyzing Bivariate Data with Fathom

AND NUMERICAL SUMMARIES. Chapter 2

Descriptive Statistics, Standard Deviation and Standard Error

Transcription:

Lecture Simple Regression, An Overview, and Simple Linear Regression

Learning Objectives In this set of lectures we will develop a framework for simple linear, logistic, and Cox Proportional Hazards Regression in the first section The remaining sections will focus on simple linear regression, a general framework for estimating the mean of a continuous outcome based on a single predictor (which may be binary, categorical or continuous) 2 2

Section A Simple Regression: An Overview 3

Learning Objectives Re-familiarize yourself with the properties of a linear equation Identify the group comparison(s) being made by a simple regression coefficient regardless of the outcome variable type (continuous, binary, or time-to-event) 4 4

Link to Methods From Statistical Reasoning Regression provides a general framework for the estimation and testing procedures that we covered in the first term All methods we covered in term can be done as simple regression models Additionally, these models can be extended to allow for analyses beyond the scope of comparing outcomes across levels of a single predictor (adjustment, prediction with multiple predictors) 5 5

Link to Methods From Statistical Reasoning For example: Comparing means between two or more groups (t-test, ANOVA) can be done via a simple linear regression model Comparing proportions between two or more groups (Chisquare) can be done via a simple logistic regression model Comparing incidence rates between two or more groups (log rank) can be done via a simple Cox Proportional Hazards regression model 6 6

Basic Structure The basic structure of these regression models will be a linear equation intercept slope(x) x Where x is the predictor of interest o 7 7

Basic Structure: The Left Hand Side The left hand side depends on what variable type the outcome of interest is For continuous outcomes, the left hand side is the mean of the outcome, y For binary outcomes, the left hand side is the ln(odds) of the binary outcome, ie: ln p p For time-to-event outcomes, the left hand side is the ln(hazard rate) 8 8

Basic Structure: The Right Hand Side x The right hand side, o, includes the predictor of interest, x This predictor, x, can be binary, categorical or continuous 9 9

Interpretations When x is Binary Suppose x is binary predictor, such as sex ( = female, 0 = male) x o 0 0

Interpretations When x is Categorical (Nominal) How to code x when the predictor of interest is nominal categorical, for example clinic site (Hopkins, U of Maryland, U of Michigan) For handling multiple nominal categories, the approach is to designate one of the groups as the reference category, and create binary x s for each of the other groups. For example, if we make Hopkins the reference, we will need to additional variables:

Interpretations When x is Categorical (Nominal) The equation will be as follows: o x x 2 2 2 2

Interpretations When x is Continuous The beauty of regression is that it allows for continuous predictors, unlike the methods we learned in Statistical Reasoning This is an efficient to handle measurements that are made continuously (age, height, etc..) without having to arbitrarily categorize them (if the outcome/predictor association is well characterized by a line). For example, suppose x is age in years x o 3 3

The Intercept, β o The intercept β o is the value of the left hand side when x is 0 It is the point on the graph where the line crosses the y (vertical) axis, at the coordinate (0, β o ) β o x o 4 4

The Slope, β The slope β is the change in the left hand side corresponding to a unit increase in x x o 5 5

The Slope, β The slope β is the change in left hand side corresponding to a unit increase in x β x o 6 6

The Slope, β The slope β is the change in left hand side corresponding to a unit increase in x Another interpretation: β is difference in the left hand side for x + compared to x This change/difference is the same across the entire line 7 7

The Slope, β The slope β is the change in left hand side corresponding to a unit increase in x β x o β β 8 8

The Slope, β The slope β is the change in left hand side corresponding to a unit increase in x: β is difference in y-values for x + compared to x All information about the difference in the left hand side for two differing values of x is contained in the slope! For example: two values of x three units apart will have a difference in left hand side values of 3* β 9 9

The Slope, β For example: two values of x three units apart will have a difference in left hand side values of 3* β β β β 20 20

The Slope, β For example: two values of x three units apart will have a difference in left hand side values of 3 β (3β ) β β β 3β 2 2

Summary Regression is a general set of methods for relating a function of an outcome variable to a predictor via a linear equation 22 22

Section B Simple Linear Regression With a Binary (or Nominal Categorical) Predictor 23

Learning Objectives Understand that linear regression provides a framework for estimating means, and mean differences Interpret the estimated slope(s) and intercept from a simple linear regression model with a binary predictor, and a nominal categorical predictor 24 24

The Left Hand Side For linear regression, the equation is relatively straightforward: the regression models the mean value of a continuous outcome (y) as a function of the predictor x y x o As noted in the previous section, x can be binary, nominal categorical or continuous 25 25

The Left Hand Side As with everything else we have done thus far, we will only be able to estimate the regression equation from a sample of data: to indicate the estimates, can write as: y ˆ o ˆ x, which is frequently represented as yˆ ˆ ˆ x o 26 26

The Left Hand Side For a given value of x, we can estimate the mean of y via the equation yˆ ˆ ˆ x o The slope compared the mean value of y for two groups who differ by one unit of x, and hence is interpretable as a mean difference 27 27

Example : Arm Circumference and Sex Data on anthropometric measures from a random sample of 50 Nepali children [0, 2) months old Question: what is the relationship between average arm circumference and sex of a child? Data: Arm circumference: mean 2.4 cm, SD.5 cm, range 7.3 cm 5.6 cm Sex: 5% female 28 28

Visualizing Arm Circumference and Sex Relationship Boxplot display 29 29

Visualizing Arm Circumference and Sex Relationship Scatterplot display 30 30

Example : Arm Circumference and Sex Here, y is arm circumference, a continuous measure: x is not continuous, but binary male or female How to handle sex as a x in regression? One possibility: x = 0 for male children, x = for female children The equation we will estimate yˆ ˆ 0 ˆ x 3 3

Example : Arm Circumference and Sex Notice: this equation is only estimating two values: mean arm circumference for male children, and the mean for female children For female children: yˆ ˆ 0 ˆ ˆ 0 ˆ For male children yˆ ˆ 0 ˆ 0 ˆ 0 So ˆ is still a slope estimating mean difference in y for one-unit difference in x : but only possible one-unit difference is (females) to 0 (males) 32 32

Example : Arm Circumference and Sex The resulting equation yˆ 2.5 0.3x ˆ 0.3 : the estimated mean difference in arm circumference for female children compared to male children is -0.3 cm; female children have lower arm circumference by 0.3 cm on average ˆ o 2.5 : the mean arm circumference for male children (reference group) is 2.5 cm 33 33

Visualizing Arm Circumference and Sex Relationship Scatterplot display with regression line 34 34

Question The coding choice for a binary predictor is completely arbitrary. For this arm circumference and sex analysis, what would the values of and be if sex was coded as a for males, and 0 for ˆo ˆ females? 35 35

Example 2: Length of Stay and Age of First Claim Data on 20 hospitalizations from 2,928,members of Heritage Health Question: what is the relationship between average length of stay and age of first claim (binary if age of first claim is less than 40 years)? Data: Length of stay 4.3, SD 4.9 days, range -4 days Age of first claim: 29% of claims for persons less that 40 years at first claim 36 36

Example 2: Length of Stay and Age of First Claim Box plot display Length of Stay By Age at First Claim Category Heritage Health Plan Data Length of Stay (Days) 0 0 20 30 40 >= 40 years < 40 years 37 37

Example 2: Length of Stay and Age of First Claim The resulting equation yˆ 4.9 2.x ˆ 2. : the estimated mean difference in length of stay for persons less than 40 at first claim compared to persons over 40 is - 2. days ; the younger group has average length of stays of 2. days less ˆ 4.9 : the mean length of stay for persons over 40 at first claim (reference group) is 4.9 days o 38 38

Categorical Predictor Sometimes, regression scenarios include predictors which are not continuous, not binary, but multi-categorical Examples Subject s race (White, African-American, Hispanic, Asian, Other) City of residence (Baltimore, Chicago, Tokyo, Madrid) 39 39

The Situation How can this type of situation be handled in a regression framework? We ll explore this using an example based on the academic physician salary analysis results Jagsi R, et al. Gender Differences in the Salaries of Physician Researchers. Journal of the American Medical Association (202); 307(22); 240-247. 40 40

Example 3: Physician Salaries Data were collected on 800 U.S. academic physicians, including yearly salary Additional information on each physician includes geographical region of the United States where their job is located (West, Northeast, South, Midwest) 4 4

Example 3: Physician Salaries Question: Do average salaries differ by geographical region and, if so, what is the magnitude of these differences? 42 42

Example 3: Physician Salaries Could this analysis be done by a linear regression relating salaries to region? How can we handle a predictor that takes on four categories? 43 43

Example 3: Physician Salaries APPROACH : Arbitrarily give each region a numerical value ( x = for West, 2 for Midwest, 3 for South, and 4 for Northeast for example), and fit SLR of yˆ ˆ 0 ˆ x Where ŷ is estimated mean salary, and x is region as defined above 44 44

Example 3: Physician Salaries This is not a good idea!!! Coding is arbitrary, could have assigned x = for Midwest, etc.... Estimated coefficient of region will depend on arbitrary coding Coding assumes mean salary differences between regions incremental Example difference in average salaries between physicians in South (x = 3) and West (x = ) is twice the difference between physicians in Midwest (x = 2) and West (x = ) 45 45

Example 3: Physician Salaries Alternative approach designate one region as reference region, say the West, and make binary indicators for each of the three other regions x = if Midwest, 0 otherwise x 2 = if South, 0 otherwise x 3 = if Northeast, 0 otherwise 46 46

ANOVA as a Regression Model Here is a table showing the x values for each region Region x x 2 x 3 West 0 0 0 Midwest 0 0 South 0 0 Northeast 0 0 47 47

Example 3: Physician Salaries Fit the regression model yˆ ˆ 0 ˆ x ˆ 2 x 2 ˆ 3 x 3 Here, each coefficient estimates mean salary difference between region that has a corresponding x value of and the reference region (Western states) The intercept has meaning is the estimated mean salary for physicians from the West 48 48

Example 3: Physician Salaries Example For physicians in Midwest (x =, x 2 = 0, x 3 = 0), the model predicts y ˆ ˆ * ˆ * 0 ˆ * 0 ˆ 0 2 3 ˆ ˆ 0 For physicians in West (x =0, x 2 = 0, x 3 = 0) model predicts yˆ ˆ ˆ * 0 ˆ * 0 ˆ 0 2 3 * 0 ˆ 0 49 49

Example 3: Physician Salaries Resulting regression equation yˆ yˆ ˆ ˆ x 0 94, 474 4,46x ˆ x 2 2 ˆ x 3 3 35x 2 2,322x 3 50 50

Summary Simple linear regression is a method for estimating the relationship between the mean value of an outcome, y, and a predictor x, via a linear equation When x is binary, the slope estimate ˆ estimates the mean difference in y for the group with x = compared to the group with x = 0; the intercept estimate ˆo is the estimated mean of y for the group with x =0 When x is nominal categorical (can also be done with ordinal), designate one category the reference group, and make separate binary x s for all other categories 5 5

Section C Simple Linear Regression With a Continuous Predictor 52

Learning Objectives Understand why treating a continuous predictor as continuous (as opposed to making it binary, or categorical) can be beneficial Use a scatterplot display to assess whether an outcome/predictor relationship is reasonably described by a line Interpret the estimated slope and intercept from a simple linear regression model with a continuous x 53 53

Example : Arm Circumference and Height Data on anthropometric measures from a random sample of 50 Nepali children [0, 2) months old Question: what is the relationship between average arm circumference and height? Data: Arm circumference: mean 2.4 cm, SD.5 cm, range 7.3 cm 5.6 cm Height: mean 6.6 cm, SD 6.3 cm, range 40.9 cm 73.3 cm 54 54

Approach : Arm Circumference and Height Dichotomize height at median, compare mean arm circumference with t-test and 95% CI 55 55

Approach : Arm Circumference and Height Potential Advantages: We know how to do it! Gives a single summary measure (sample mean difference) for quantifying the arm circumference/height association Potential Disadvantages: Throws away a lot of information in the height data that was originally measured as continuous Only allows for a single comparison between two crudely defined height categories 56 56

Approach 2 Arm Circumference and Height Categorize height into 4 categories by quartile, compare mean arm circumference with ANOVA, 95% CIs 57 57

Approach 2: Arm Circumference and Height Potential Advantages: We know how to do it! Uses a less crude categorization of height than the previous approach of dichotomizing Potential Disadvantages: Still throws away a lot of information in the height data that was originally measured as continuous Requires multiple summary measures (6 sample mean differences between each unique combination of height categories) to quantify arm circumference/height relationship Does not exploit the structure we see in the previous boxplot: as height increases so does arm circumference 58 58

Approach 2 Arm Circumference and Height Categorize height into 4 categories by quartile, compare mean arm circumference with ANOVA, 95% CIs 59 59

Approach 3: Arm Circumference and Height What about treating height as continuous when estimating the arm circumference/height relationship? Linear regression is a potential option: allows us to associate a continuous outcome with a continuous predictor via a line The line estimates the mean value of the outcome for each continuous value of height in the sample used Makes a lot of sense: but only if a line reasonably describes the outcome/predictor relationship 60 60

Visualizing Arm Circumference and Height Relationship A useful visual display for assessing nature of association between two continuous variables: a scatterplot 6 6

Visualizing Arm Circumference and Height Relationship Question : does a line reasonably describe the general shape of the relationship between arm circumference and height? We can estimate a line, using the computer The line we estimate will be of the form: yˆ o x Here: ŷ is the average arm circumference for a group of children all of the same height, x 62 62

Example : Arm Circumference and Height Equation of regression line relating estimated mean arm circumference (cm) to height (cm) : from computer yˆ 2.7 0.6 x Here, ŷ estimated average arm circumference (like what we previously would call y ), x = height, ˆ 2. 7 and ˆ 0. 6 o This is the estimated line from the sample of 50 Nepali children 63 63

Example : Arm Circumference and Height Scatterplot with regression line superimposed yˆ 2.7 0.6 x 64 64

Example : Arm Circumference and Height Estimated mean arm circumference for children 60 cm in height yˆ 2.7 0.6 x for x 60 cm y ˆ 2.7 0.6 60 2.3 cm 65 65

Example : Arm Circumference and Height Notice, most points don t fall directly on the line: we are estimating the mean arm circumference of children 60 cm tall: observed points vary about the estimated mean yˆ 2.7 0.6 x for x 60 cm y ˆ 2.7 0.6 60 2.3 cm 66 66

Example : Arm Circumference and Height How to interpret estimated slope? yˆ 2.7 0.6 x Here, ˆ 0.6 Two ways to say the same thing: ˆ ˆ is the average change in arm circumference for a oneunit ( cm) increase in height is the mean difference in arm circumference for two groups of children who differ by one-unit ( cm) in height, taller to shorter This result estimates that the mean difference in arm circumferences for a one cm difference in height is 0.6 cm, with taller children having greater average arm circumference. 67 67

Example : Arm Circumference and Height This mean difference estimate is constant across the entire height range in the sample: definition of a slope of a line yˆ 2.7 0.6 x 68 68

Example : Arm Circumference and Height What is estimated mean difference in arm circumference for: Children 60 cm tall versus children 59 cm tall? Children 45 cm tall versus children 44 cm tall? Children 72 cm tall versus children 7 cm tall? Etc.? Answer is the same for all of the above: 0.6 cm 69 69

Example : Arm Circumference and Height What is estimated mean difference in arm circumference for: Children 60 cm tall versus children 50 cm tall? yˆ 0 x 60 ˆ yˆ x 50 0 0.6 cm.6 cm 70 70

Example : Arm Circumference and Height What is estimated mean difference in arm circumference for: Children 90 cm tall versus children 89 cm tall? Children 34 cm tall versus children 33 cm tall? Children 0 cm tall versus children 09 cm tall? Etc.? This is a trick question!!!! 7 7

Example : Arm Circumference and Height The range of observed heights in the sample is 40.9 cm 73.3 cm: our regression results only apply to the relationship between arm circumference and height for this height range yˆ 2.7 0.6 x 72 72

Example : Arm Circumference and Height How to interpret estimated intercept? yˆ 2.7 0.6 x Here, ˆ o 2. 7 cm This is the estimated y when x =0: the estimated mean arm circumference for children 0 cm tall Does this make sense given our sample? As we noted before, estimate of mean arm circumferences only apply to observed height range. Frequently, the scientific interpretation of the intercept is scientifically meaningless: but this intercept is necessary to fully specify equation of line and to make estimates of mean arm circumference for groups of children with heights in sample range. 73 73

Example 2: Arm Circumference and Height Notice that x =0 is not even on this graph (the vertical axis is at x =39) yˆ 2.7 0.6 x 74 74

Example: Arm Circumference and Height Notice that x =0 is not even on this graph (the vertical axis is at x =39) yˆ 2.7 0.6 x 75 75

Example 2: Hb and PCV Data on laboratory measurements on a random sample of 2 clinic patients 20-67 years old Question: what is the relationship between hemoglobin levels (g/dl) and packed cell volume (percent of packed cells) Data: Hemoglobin (Hb): mean 4. g/dl, SD 2.3 g/dl, range 9.6 g/dl 7. g/dl Packed Cell Volume (PCV): mean 4. %, SD 8. %, range 25% to 55% 76 76

Visualizing Hb and PCV Relationship Scatterplot display 77 77

Example 2: Hb and PCV Equation of regression line relating estimated mean Hemoglobin (g/dl) to packed cell volume : from computer yˆ 5.77 0.20 x Here, ŷ estimated average Hemoglobin (like what we previously would call y ), x = PCV (%), ˆ 5. 77 and ˆ 0.20 o This is the estimated line from the sample of 2 subjects 78 78

Example 2: Hb and PCV Equation of regression line relating estimated mean Hemoglobin (g/dl) to packed cell volume : from computer yˆ 5.77 0.20 x ˆ 0.20 : what are the units? Well, ŷ is in g/dl, x in percent; so ˆ is in units of g/dl per percent This results estimates that the mean difference in Hemoglobin levels for two groups of subjects who differ by % in PCV is 0.20 g/dl: subjects with greater PCV have greater Hb levels in average. 79 79

Visualizing Hb and PCV Relationship Scatterplot display with regression line yˆ 5.77 0.20 x 80 80

Example 2: Hb and PCV What is average difference in Hb levels for subjects with PCV of 40% compared to subjects with 32%? ˆ 0.20 : compares groups of subjects who differ in PCV by % (it is positive, so those with the greater PCV have hemoglobin levels of.20 g/dl greater on average) To compare subjects with PCV of 40% versus subjects with 32%, which is an 8 unit difference in x, take 8 ˆ 8 0.20.6 g / dl 8 8

Example 2: Hb and PCV What is estimated Hb level for subjects with PCV of 4%? Plugging 4% into the equation, yˆ 5.77 0.20 x y ˆ 5.77 0.20 4 3.97 g / dl What is the interpretation of the intercept? 82 82

Example 3: Wages and Education Level Data on hourly wages from a random sample of 534 U.S. workers in 985 Question: what is the relationship between hourly wage (US$) and years of formal education Data: Hourly wages : mean $9.04/hr, SD $5.3/hr, range $.00/hr $44.50/hr Year of formal education: mean 3.0 years, SD 2.6 years, range 2 years 8 years 83 83

Visualizing Wages and Education Level Relationship Scatterplot display 84 84

Example: Wages and Education Level Equation of regression line relating estimated mean hourly wages (US $) to years of education : from Stata yˆ 0.75 0.75 x Here, ŷ estimated average hourly wage (like what we previously would call y ), x = years of formal education, ˆ 0.75 and ˆ 0. 75 o This is the estimated line from the sample of 534 subjects 85 85

Visualizing Wages and Education Level Relationship Scatterplot display with regression line 86 86

Wages and Education Level What is interpretation of the slope estimate? What is the interpretation of the intercept? 87 87

Summary Simple linear regression is a method for relating the mean of an outcome y to a predictor x When x is a continuous variable: the estimated slope for x, ˆ, has a mean difference interpretation: the mean difference in y for two groups who differ by one unit of x (the change in mean y per unit change in x ) The estimated intercept, ˆo, is the estimated mean of y when x =0; this is often not a scientifically relevant quantity 88 88

Section D Simple Linear Regression Model: Estimating the Regression Equation Accounting for Uncertainty in the Estimates 89

Learning Objectives Creating confidence intervals for linear regression slopes means creating confidence intervals for mean differences, and the approach is business as usual Similarly, creating a confidence interval for an intercept is creating a confidence interval for a single population mean 90 90

Example : Arm Circumference and Height So in the last section, we showed the results from several simple linear regression models For example, when relating arm circumference to height using a random sample of 50 Nepali children < 2 months old, the resulting regression equation was: yˆ 2.7 0.6 x I told you this came from a computer package: but what is the algorithm to estimate this equation? 9 9

Example : Arm Circumference and Height There must be some algorithm that will always yield the same results for the same data set 92 92

Example : Arm Circumference and Height The algorithm to estimate the equation of the line is called the least squares estimation The idea is to find the line that gets closest to all of the points in the sample How to define closeness to multiple points? In regression, closeness is defined as the cumulative squared distance between each point s y-value and the corresponding value of ŷ for that point s x : in other words the squared distance between an observed y-value and the estimated mean y-value for all points with the same value of x. 93 93

Example : Arm Circumference and Height ˆ Each distance is y yˆ y ( o B x ) : this is computed for each data point in the sample ˆ 94 94

Example : Arm Circumference and Height The algorithm to estimate the equation of the line is called the least squares estimation The values chosen for ˆ ˆ o and are the values that minimize the cumulative distances squared: i.e. min n i y i ( ˆ x o ˆ ) i 2 95 95

Example : Arm Circumference and Height ˆ ˆ The values chosen for o and are just estimates based on a single sample. If were to have a different random sample of 50 Nepali children from the same population of <2 month olds, the resulting estimate would likely be different: i.e. the values that minimized the cumulative squared distance from this second sample of points would likely be different As such, all regression coefficients have an associated standard error that can be used to make statements about the true relationship between mean y and x (for example, the true slope ) based on a single sample 96 96

Example : Arm Circumference and Height The estimated regression equation relating arm circumference to height using a random samples of 50 Nepali children < 2 months old, I told you that the resulting regression equation was: ˆ ˆ o 0.6 2.70 and and yˆ S E ˆ ( 2.7 S E ˆ ( ˆ ˆ o ) 0.6 ) x 0.04 0. 88 97 97

Example : Arm Circumference and Height Random sampling behavior of estimated regression coefficients is normal for large samples (n>60), and centered at true values As such, we can use same ideas to create 95% CIs and get p-values 98 98

Example : Arm Circumference and Height The estimated regression equation relating arm circumference to height using a random samples of 50 Nepali children < 2 months old, the resulting regression equation was: yˆ ˆ 2.7 0.6 0.6 and S E ˆ ( ˆ ) x 0.04 95% CI for β ˆ ˆ ˆ 2 S E ( ) 0.6 2 0.04 ( 0.3,0.9 ) 99 99

Example : Arm Circumference and Height p-value for testing: H o : β =0 H A : β 0 Assume null true, and calculate standardized distance of from 0 ˆ ˆ 0 0.6 t.4 S Eˆ ( ) S Eˆ ( ).04 The p-value is probability of being.4 or more standard errors away from mean of 0 on a normal curve: very low in this example, p <.00 ˆ 00 00

Summarizing findings: Arm Circumference and Height This research used simple linear regression to estimate the magnitude of the association between arm circumference and height in Nepali children less than 2 months old, using data on a random sample of 50. A statistically significant positive association was found (p<.00). The results estimate that two groups of such children who differ by cm in height will differ on average by 0.6 cm in arm circumference. (95% CI 0.3 cm to 0.9 cm) 0 0

Example : Arm Circumference and Height Give an estimate and 95% CI for the mean difference in arm circumference for children 60 cm tall compared to children 50 cm tall From previous set we know this estimated mean difference is ( 60 50 ) ˆ 0 ˆ 0 0.6.6 How to get standard error? Well as it turns out: S Eˆ (0 ˆ ) S Eˆ (0 ˆ ) 0 S Eˆ ( 95% CI for the mean difference 0 ˆ 0.04 ) 0.4 cm 02 02

Example 2: Hemoglobin and Packed Cell Volume Equation of regression line relating estimated mean Hemoglobin (g/dl) to packed cell volume yˆ 5.77 0.20 x ˆ 0.20 and S Ê ( ˆ ) 0.045 03 03

Example 2: Hemoglobin and Packed Cell Volume Same idea with computation of 95% CI and p-value as we saw before: However, with small (n<60) samples, a slight change analogous to what we did with means and differences in means before Sampling distribution of regression coefficients not quite normal, but follow a t-distribution with n-2 degrees of freedom 95% for : ˆ t S Eˆ ( ˆ n ).95, 2 ˆ ˆ ˆ t.95,9 S E ( ) 0.20 2.09.046 ( 0.0,0.30 ) 04 04

Example 2: Hemoglobin and Packed Cell Volume p-value for testing: H o : β =0 H A : β 0 Assume null true, and calculate standardized distance of from 0 ˆ ˆ 0 0.20 t 4.35 S Eˆ ( ) S Eˆ ( ).046 The p-value is probability of being 4.35 or more standard errors away from mean of 0 on a t curve with 9 degrees of freedom: in this example, p <.00 ˆ 05 05

Example 2: Interpreting Result of 95% CI So, the estimated slope is 0.2 with 95% CI 0.0 to 0.30 How to interpret results? Based on a sample of 2 subjects, we estimated that PCV(%) is positively associated with hemoglobin levels We estimated that a one-percent increase in PCV is associated with a 0.2 g/dl increase in hemoglobin on average Accounting for sampling variability, this mean increase could be as small as 0.0 g/dl, or as large as 0.3 g/dl in the population of all such subjects 06 06

Example 2: Interpreting Result of 95% CI In other words: We estimated that the average difference in hemoglobin levels for two groups of subjects who differ by one-percent in PCV to be 0.2 g/dl on average (higher PCV group relative to lower) Accounting for sampling variability, mean difference could be as small as 0.0 g/dl, or as large as 0.3 g/dl in the population of all subjects 07 07

What about Intercepts? In this section, all examples have confidence intervals for the slope, and multiples of the slope We can also create confidence intervals/p-values for the intercept in the same manner (and Stata presents it in the output).when x is a continuous predictor, many times the intercept is just a placeholder and does not describe a useful quantity: as such, 95% CIs and p-values are not always relevant. However, when x is a binary or categorical predictor, the intercept may have a sustantive interpretation, and a 95% CI may be of interest. 08 08

Example 3: Length of Stay and Age of First Claim Box plot display Length of Stay By Age at First Claim Category Heritage Health Plan Data Length of Stay (Days) 0 0 20 30 40 >= 40 years < 40 years 09 09

Example 3: Length of Stay and Age of First Claim The resulting equation yˆ 4.9 2.x ˆ 2. : the estimated mean difference in length of stay for persons less than 40 at first claim compared to persons over 40 is - 2. days ; the younger group has average length of stays of 2. days less ˆ 4.9 : the mean length of stay for persons over 40 at first claim (reference group) is 4.9 days o 0 0

Example 3: Length of Stay and Age of First Claim Confidence intervals and p-values ˆ 2. (-2.3, -.9) p 0.00 ˆ o 4.9 (4.8, 5.0)

Summary The construction of confidence intervals for linear regression slopes and intercepts is business as usual : take the estimate and add/subtract 2 estimated standard errors (or slightly more in smaller samples) Confidence intervals for slopes are confidence intervals for mean differences Confidence intervals for intercepts are confidence intervals for the mean of y for a specific group (x =0) : not always relevant when x is continuous 2 2

Section E Measuring the Strength of A Linear Association 3

Strength of Association The slope of a regression line estimates the magnitude and direction of the relationship between y and x : it encapsulates how much y differs on average with differences in x The slope estimate and standard error can be used to address the uncertainty in the this estimate with regards to the true magnitude and direction of the association in the population from which the sample was taken from Slopes do not impart any information about how well the regression line fits the data in the sample; the slope gives no indication of how close the points get to the estimated regression line 4 4

Example : Arm Circumference and Height Slope depends on the units of both y and x 5 5

This image cannot currently be displayed. Example : Arm Circumference and Height For example, when height (x ) measured in cm How about if height was recorded in inches? yˆ 2.5 0.4x 6 6

Strength of Association Another quantity that can be estimated via linear regression is the coefficient of determination, R 2 : this is a number that ranges from 0 to, with larger values indicate closer fits of the data points and regression line R 2 measures strength of association by comparing variability of points around the regression line to variability in y-values ignoring x 7 7

Example : Arm Circumference and Height How close do the points get to the line can we quantify? 8 8

Example : Arm Circumference and Height (SR Flashback) The sample standard deviation of the y-values ignoring the corresponding potential information in x is s n i ( y i n y ) 2 this measures how far on average each of the sample y values falls from the overall mean all y-values In this example s=.48 cm 9 9

Example : Arm Circumference and Height Visualization on the scatterplot 20 20

Example : Arm Circumference and Height Standard deviation of regression, referred to as root mean square error is average distance of points from the line: how far on average each y falls from its mean predicted by the its corresponding x-value s ( y i y x i n n 2 yˆ i ) 2 In this example, s y x.09 2 2

Example : Arm Circumference and Height y yˆ y ( ˆ o Each distance is : this is computed for each data point in the sample Bˆ x ) 22 22

Example : Arm Circumference and Height If s = s y x, then knowing x does not yield a better guess for the mean of y than using the overall mean y (flat regression line) The smaller s y x is relative to s, the closer the points are to the regression line R 2 functionally measures how much smaller s y x is than s: as such it is an estimate of the amount of variability in y explained by taking x into account 23 23

Example : Arm Circumference and Height The R 2 : from this regression of arm circumference on height is 0.46 (46%); childs height explains (an estimated) 46% of the variation in arm circumferences 24 24

Example : R 2 and r r = the properly signed square root of R 2 ; the proper sign is the same sign as the slope in the regression r is called the correlation coefficient (not to be confused with the regression coefficients great names, huh) Allowable values 0 R 2 If relationship between y and x is positive 0 r If relationship between y and x is negative - r 0 In this example, r R 2 0.46 0.68 25 25

Example : Arm Circumference and Height So from the example: child height explains (an estimated) 46% of the variation in arm circumferences This is just an estimate based on the sample; a 95% CI can be computed but its not easy to do; also the procedure for estimating the 95% CI is not so good So this means an estimated 54% of the variability in arm circumference is not explained by childs height Some if this unexplained variability may be explained by factors other then height Multiple linear regression will allow us to estimate the relationship between arm circumference, height and other child characteristics in one analysis 26 26

Example 2: Hemoglobin and Packed Cell Volume R 2 = 0.5: PCV explains (an estimated) 5% of the variation in hemoglobin levels The corresponding correlation coefficient is r R 2 0.5 0.7 27 27

Example 3: Wages and Years of Education R 2 =0.5: years of education explains (an estimated) 5% of the variation in hourly wages The corresponding correlation coefficient is r R 2 0.5 0.39 28 28

Example 4: Wages and Sex R 2 = 0.042: sex(female=) explains (an estimated) 4.2% of the variation in arm circumference The corresponding correlation coefficient is r R 2 0.042 0.20 29 29

What s a Good R 2 There are a couple of important things to keep in mind about R 2 and r - These quantities are both estimates based on the sample of data; frequently reported without some recognition of sampling variability, for example a 95% confidence interval - Low R 2 and r not necessarily bad - many outcomes can not/ will not be fully or close to fully explained, in terms of variability, by any one single predictor 30 30

What s a Good R 2 The higher the R 2 values, the better the x predicts y for individuals in a sample/population, as individual y-values vary less about their estimated means based on x 3 3

What s a Good R 2 However, there may be important overall associations between mean of y and x even though still a lot of individual variability in y- values about their means estimated by x In the wages example, years of education explained an estimated 5% of the variability in hourly wages The association was statistically significant showing that average wages were greater for persons with more years of education However, for any single education level (year), still a lot of variation in wages for individual workers 32 32

Slope versus R 2 Slope estimates the magnitude and direction of the relationship between y and x Estimates a mean difference in y for two groups who differ by oneunit in x The slope will change if the units change for y and/or for x Larger slopes not indicative of stronger linear association: smaller slopes not indicative of weaker linear association R 2 measures strength of linear association; r measures strength and direction Neither R 2 or r measures magnitude Neither R 2 or r changes with changes in units 33 33

R 2 vs. r If you have r, you can compute R 2 If you have R 2, you can almost compute r 34 34

r As A Quick Summary Measure Table of correlations age weight height armcirc sex -------------+--------------------------------------------- age.0000 weight 0.768.0000 height 0.8673 0.9247.0000 armcirc 0.464 0.8373 0.6756.0000 sex 0.06-0.076-0.0254-0.0432.0000 35 35

Summary R 2 measures strength of association by comparing variability of points around the regression line to variability in y-values ignoring x The correlation coefficient r is the properly signed square root of R 2, and hence provides information about the direction of the association estimated by the regression 36 36