Lecture 1: Statistical Reasoning 2. Lecture 1. Simple Regression, An Overview, and Simple Linear Regression

Size: px
Start display at page:

Download "Lecture 1: Statistical Reasoning 2. Lecture 1. Simple Regression, An Overview, and Simple Linear Regression"

Transcription

1 Lecture Simple Regression, An Overview, and Simple Linear Regression

2 Learning Objectives In this set of lectures we will develop a framework for simple linear, logistic, and Cox Proportional Hazards Regression in the first section The remaining sections will focus on simple linear regression, a general framework for estimating the mean of a continuous outcome based on a single predictor (which may be binary, categorical or continuous) 2 2

3 Section A Simple Regression: An Overview 3

4 Learning Objectives Re-familiarize yourself with the properties of a linear equation Identify the group comparison(s) being made by a simple regression coefficient regardless of the outcome variable type (continuous, binary, or time-to-event) 4 4

5 Link to Methods From Statistical Reasoning Regression provides a general framework for the estimation and testing procedures that we covered in the first term All methods we covered in term can be done as simple regression models Additionally, these models can be extended to allow for analyses beyond the scope of comparing outcomes across levels of a single predictor (adjustment, prediction with multiple predictors) 5 5

6 Link to Methods From Statistical Reasoning For example: Comparing means between two or more groups (t-test, ANOVA) can be done via a simple linear regression model Comparing proportions between two or more groups (Chisquare) can be done via a simple logistic regression model Comparing incidence rates between two or more groups (log rank) can be done via a simple Cox Proportional Hazards regression model 6 6

7 Basic Structure The basic structure of these regression models will be a linear equation intercept slope(x) x Where x is the predictor of interest o 7 7

8 Basic Structure: The Left Hand Side The left hand side depends on what variable type the outcome of interest is For continuous outcomes, the left hand side is the mean of the outcome, y For binary outcomes, the left hand side is the ln(odds) of the binary outcome, ie: ln p p For time-to-event outcomes, the left hand side is the ln(hazard rate) 8 8

9 Basic Structure: The Right Hand Side x The right hand side, o, includes the predictor of interest, x This predictor, x, can be binary, categorical or continuous 9 9

10 Interpretations When x is Binary Suppose x is binary predictor, such as sex ( = female, 0 = male) x o 0 0

11 Interpretations When x is Categorical (Nominal) How to code x when the predictor of interest is nominal categorical, for example clinic site (Hopkins, U of Maryland, U of Michigan) For handling multiple nominal categories, the approach is to designate one of the groups as the reference category, and create binary x s for each of the other groups. For example, if we make Hopkins the reference, we will need to additional variables:

12 Interpretations When x is Categorical (Nominal) The equation will be as follows: o x x

13 Interpretations When x is Continuous The beauty of regression is that it allows for continuous predictors, unlike the methods we learned in Statistical Reasoning This is an efficient to handle measurements that are made continuously (age, height, etc..) without having to arbitrarily categorize them (if the outcome/predictor association is well characterized by a line). For example, suppose x is age in years x o 3 3

14 The Intercept, β o The intercept β o is the value of the left hand side when x is 0 It is the point on the graph where the line crosses the y (vertical) axis, at the coordinate (0, β o ) β o x o 4 4

15 The Slope, β The slope β is the change in the left hand side corresponding to a unit increase in x x o 5 5

16 The Slope, β The slope β is the change in left hand side corresponding to a unit increase in x β x o 6 6

17 The Slope, β The slope β is the change in left hand side corresponding to a unit increase in x Another interpretation: β is difference in the left hand side for x + compared to x This change/difference is the same across the entire line 7 7

18 The Slope, β The slope β is the change in left hand side corresponding to a unit increase in x β x o β β 8 8

19 The Slope, β The slope β is the change in left hand side corresponding to a unit increase in x: β is difference in y-values for x + compared to x All information about the difference in the left hand side for two differing values of x is contained in the slope! For example: two values of x three units apart will have a difference in left hand side values of 3* β 9 9

20 The Slope, β For example: two values of x three units apart will have a difference in left hand side values of 3* β β β β 20 20

21 The Slope, β For example: two values of x three units apart will have a difference in left hand side values of 3 β (3β ) β β β 3β 2 2

22 Summary Regression is a general set of methods for relating a function of an outcome variable to a predictor via a linear equation 22 22

23 Section B Simple Linear Regression With a Binary (or Nominal Categorical) Predictor 23

24 Learning Objectives Understand that linear regression provides a framework for estimating means, and mean differences Interpret the estimated slope(s) and intercept from a simple linear regression model with a binary predictor, and a nominal categorical predictor 24 24

25 The Left Hand Side For linear regression, the equation is relatively straightforward: the regression models the mean value of a continuous outcome (y) as a function of the predictor x y x o As noted in the previous section, x can be binary, nominal categorical or continuous 25 25

26 The Left Hand Side As with everything else we have done thus far, we will only be able to estimate the regression equation from a sample of data: to indicate the estimates, can write as: y ˆ o ˆ x, which is frequently represented as yˆ ˆ ˆ x o 26 26

27 The Left Hand Side For a given value of x, we can estimate the mean of y via the equation yˆ ˆ ˆ x o The slope compared the mean value of y for two groups who differ by one unit of x, and hence is interpretable as a mean difference 27 27

28 Example : Arm Circumference and Sex Data on anthropometric measures from a random sample of 50 Nepali children [0, 2) months old Question: what is the relationship between average arm circumference and sex of a child? Data: Arm circumference: mean 2.4 cm, SD.5 cm, range 7.3 cm 5.6 cm Sex: 5% female 28 28

29 Visualizing Arm Circumference and Sex Relationship Boxplot display 29 29

30 Visualizing Arm Circumference and Sex Relationship Scatterplot display 30 30

31 Example : Arm Circumference and Sex Here, y is arm circumference, a continuous measure: x is not continuous, but binary male or female How to handle sex as a x in regression? One possibility: x = 0 for male children, x = for female children The equation we will estimate yˆ ˆ 0 ˆ x 3 3

32 Example : Arm Circumference and Sex Notice: this equation is only estimating two values: mean arm circumference for male children, and the mean for female children For female children: yˆ ˆ 0 ˆ ˆ 0 ˆ For male children yˆ ˆ 0 ˆ 0 ˆ 0 So ˆ is still a slope estimating mean difference in y for one-unit difference in x : but only possible one-unit difference is (females) to 0 (males) 32 32

33 Example : Arm Circumference and Sex The resulting equation yˆ x ˆ 0.3 : the estimated mean difference in arm circumference for female children compared to male children is -0.3 cm; female children have lower arm circumference by 0.3 cm on average ˆ o 2.5 : the mean arm circumference for male children (reference group) is 2.5 cm 33 33

34 Visualizing Arm Circumference and Sex Relationship Scatterplot display with regression line 34 34

35 Question The coding choice for a binary predictor is completely arbitrary. For this arm circumference and sex analysis, what would the values of and be if sex was coded as a for males, and 0 for ˆo ˆ females? 35 35

36 Example 2: Length of Stay and Age of First Claim Data on 20 hospitalizations from 2,928,members of Heritage Health Question: what is the relationship between average length of stay and age of first claim (binary if age of first claim is less than 40 years)? Data: Length of stay 4.3, SD 4.9 days, range -4 days Age of first claim: 29% of claims for persons less that 40 years at first claim 36 36

37 Example 2: Length of Stay and Age of First Claim Box plot display Length of Stay By Age at First Claim Category Heritage Health Plan Data Length of Stay (Days) >= 40 years < 40 years 37 37

38 Example 2: Length of Stay and Age of First Claim The resulting equation yˆ x ˆ 2. : the estimated mean difference in length of stay for persons less than 40 at first claim compared to persons over 40 is - 2. days ; the younger group has average length of stays of 2. days less ˆ 4.9 : the mean length of stay for persons over 40 at first claim (reference group) is 4.9 days o 38 38

39 Categorical Predictor Sometimes, regression scenarios include predictors which are not continuous, not binary, but multi-categorical Examples Subject s race (White, African-American, Hispanic, Asian, Other) City of residence (Baltimore, Chicago, Tokyo, Madrid) 39 39

40 The Situation How can this type of situation be handled in a regression framework? We ll explore this using an example based on the academic physician salary analysis results Jagsi R, et al. Gender Differences in the Salaries of Physician Researchers. Journal of the American Medical Association (202); 307(22);

41 Example 3: Physician Salaries Data were collected on 800 U.S. academic physicians, including yearly salary Additional information on each physician includes geographical region of the United States where their job is located (West, Northeast, South, Midwest) 4 4

42 Example 3: Physician Salaries Question: Do average salaries differ by geographical region and, if so, what is the magnitude of these differences? 42 42

43 Example 3: Physician Salaries Could this analysis be done by a linear regression relating salaries to region? How can we handle a predictor that takes on four categories? 43 43

44 Example 3: Physician Salaries APPROACH : Arbitrarily give each region a numerical value ( x = for West, 2 for Midwest, 3 for South, and 4 for Northeast for example), and fit SLR of yˆ ˆ 0 ˆ x Where ŷ is estimated mean salary, and x is region as defined above 44 44

45 Example 3: Physician Salaries This is not a good idea!!! Coding is arbitrary, could have assigned x = for Midwest, etc.... Estimated coefficient of region will depend on arbitrary coding Coding assumes mean salary differences between regions incremental Example difference in average salaries between physicians in South (x = 3) and West (x = ) is twice the difference between physicians in Midwest (x = 2) and West (x = ) 45 45

46 Example 3: Physician Salaries Alternative approach designate one region as reference region, say the West, and make binary indicators for each of the three other regions x = if Midwest, 0 otherwise x 2 = if South, 0 otherwise x 3 = if Northeast, 0 otherwise 46 46

47 ANOVA as a Regression Model Here is a table showing the x values for each region Region x x 2 x 3 West Midwest 0 0 South 0 0 Northeast

48 Example 3: Physician Salaries Fit the regression model yˆ ˆ 0 ˆ x ˆ 2 x 2 ˆ 3 x 3 Here, each coefficient estimates mean salary difference between region that has a corresponding x value of and the reference region (Western states) The intercept has meaning is the estimated mean salary for physicians from the West 48 48

49 Example 3: Physician Salaries Example For physicians in Midwest (x =, x 2 = 0, x 3 = 0), the model predicts y ˆ ˆ * ˆ * 0 ˆ * 0 ˆ ˆ ˆ 0 For physicians in West (x =0, x 2 = 0, x 3 = 0) model predicts yˆ ˆ ˆ * 0 ˆ * 0 ˆ * 0 ˆ

50 Example 3: Physician Salaries Resulting regression equation yˆ yˆ ˆ ˆ x 0 94, 474 4,46x ˆ x 2 2 ˆ x x 2 2,322x

51 Summary Simple linear regression is a method for estimating the relationship between the mean value of an outcome, y, and a predictor x, via a linear equation When x is binary, the slope estimate ˆ estimates the mean difference in y for the group with x = compared to the group with x = 0; the intercept estimate ˆo is the estimated mean of y for the group with x =0 When x is nominal categorical (can also be done with ordinal), designate one category the reference group, and make separate binary x s for all other categories 5 5

52 Section C Simple Linear Regression With a Continuous Predictor 52

53 Learning Objectives Understand why treating a continuous predictor as continuous (as opposed to making it binary, or categorical) can be beneficial Use a scatterplot display to assess whether an outcome/predictor relationship is reasonably described by a line Interpret the estimated slope and intercept from a simple linear regression model with a continuous x 53 53

54 Example : Arm Circumference and Height Data on anthropometric measures from a random sample of 50 Nepali children [0, 2) months old Question: what is the relationship between average arm circumference and height? Data: Arm circumference: mean 2.4 cm, SD.5 cm, range 7.3 cm 5.6 cm Height: mean 6.6 cm, SD 6.3 cm, range 40.9 cm 73.3 cm 54 54

55 Approach : Arm Circumference and Height Dichotomize height at median, compare mean arm circumference with t-test and 95% CI 55 55

56 Approach : Arm Circumference and Height Potential Advantages: We know how to do it! Gives a single summary measure (sample mean difference) for quantifying the arm circumference/height association Potential Disadvantages: Throws away a lot of information in the height data that was originally measured as continuous Only allows for a single comparison between two crudely defined height categories 56 56

57 Approach 2 Arm Circumference and Height Categorize height into 4 categories by quartile, compare mean arm circumference with ANOVA, 95% CIs 57 57

58 Approach 2: Arm Circumference and Height Potential Advantages: We know how to do it! Uses a less crude categorization of height than the previous approach of dichotomizing Potential Disadvantages: Still throws away a lot of information in the height data that was originally measured as continuous Requires multiple summary measures (6 sample mean differences between each unique combination of height categories) to quantify arm circumference/height relationship Does not exploit the structure we see in the previous boxplot: as height increases so does arm circumference 58 58

59 Approach 2 Arm Circumference and Height Categorize height into 4 categories by quartile, compare mean arm circumference with ANOVA, 95% CIs 59 59

60 Approach 3: Arm Circumference and Height What about treating height as continuous when estimating the arm circumference/height relationship? Linear regression is a potential option: allows us to associate a continuous outcome with a continuous predictor via a line The line estimates the mean value of the outcome for each continuous value of height in the sample used Makes a lot of sense: but only if a line reasonably describes the outcome/predictor relationship 60 60

61 Visualizing Arm Circumference and Height Relationship A useful visual display for assessing nature of association between two continuous variables: a scatterplot 6 6

62 Visualizing Arm Circumference and Height Relationship Question : does a line reasonably describe the general shape of the relationship between arm circumference and height? We can estimate a line, using the computer The line we estimate will be of the form: yˆ o x Here: ŷ is the average arm circumference for a group of children all of the same height, x 62 62

63 Example : Arm Circumference and Height Equation of regression line relating estimated mean arm circumference (cm) to height (cm) : from computer yˆ x Here, ŷ estimated average arm circumference (like what we previously would call y ), x = height, ˆ 2. 7 and ˆ 0. 6 o This is the estimated line from the sample of 50 Nepali children 63 63

64 Example : Arm Circumference and Height Scatterplot with regression line superimposed yˆ x 64 64

65 Example : Arm Circumference and Height Estimated mean arm circumference for children 60 cm in height yˆ x for x 60 cm y ˆ cm 65 65

66 Example : Arm Circumference and Height Notice, most points don t fall directly on the line: we are estimating the mean arm circumference of children 60 cm tall: observed points vary about the estimated mean yˆ x for x 60 cm y ˆ cm 66 66

67 Example : Arm Circumference and Height How to interpret estimated slope? yˆ x Here, ˆ 0.6 Two ways to say the same thing: ˆ ˆ is the average change in arm circumference for a oneunit ( cm) increase in height is the mean difference in arm circumference for two groups of children who differ by one-unit ( cm) in height, taller to shorter This result estimates that the mean difference in arm circumferences for a one cm difference in height is 0.6 cm, with taller children having greater average arm circumference

68 Example : Arm Circumference and Height This mean difference estimate is constant across the entire height range in the sample: definition of a slope of a line yˆ x 68 68

69 Example : Arm Circumference and Height What is estimated mean difference in arm circumference for: Children 60 cm tall versus children 59 cm tall? Children 45 cm tall versus children 44 cm tall? Children 72 cm tall versus children 7 cm tall? Etc.? Answer is the same for all of the above: 0.6 cm 69 69

70 Example : Arm Circumference and Height What is estimated mean difference in arm circumference for: Children 60 cm tall versus children 50 cm tall? yˆ 0 x 60 ˆ yˆ x cm.6 cm 70 70

71 Example : Arm Circumference and Height What is estimated mean difference in arm circumference for: Children 90 cm tall versus children 89 cm tall? Children 34 cm tall versus children 33 cm tall? Children 0 cm tall versus children 09 cm tall? Etc.? This is a trick question!!!! 7 7

72 Example : Arm Circumference and Height The range of observed heights in the sample is 40.9 cm 73.3 cm: our regression results only apply to the relationship between arm circumference and height for this height range yˆ x 72 72

73 Example : Arm Circumference and Height How to interpret estimated intercept? yˆ x Here, ˆ o 2. 7 cm This is the estimated y when x =0: the estimated mean arm circumference for children 0 cm tall Does this make sense given our sample? As we noted before, estimate of mean arm circumferences only apply to observed height range. Frequently, the scientific interpretation of the intercept is scientifically meaningless: but this intercept is necessary to fully specify equation of line and to make estimates of mean arm circumference for groups of children with heights in sample range

74 Example 2: Arm Circumference and Height Notice that x =0 is not even on this graph (the vertical axis is at x =39) yˆ x 74 74

75 Example: Arm Circumference and Height Notice that x =0 is not even on this graph (the vertical axis is at x =39) yˆ x 75 75

76 Example 2: Hb and PCV Data on laboratory measurements on a random sample of 2 clinic patients years old Question: what is the relationship between hemoglobin levels (g/dl) and packed cell volume (percent of packed cells) Data: Hemoglobin (Hb): mean 4. g/dl, SD 2.3 g/dl, range 9.6 g/dl 7. g/dl Packed Cell Volume (PCV): mean 4. %, SD 8. %, range 25% to 55% 76 76

77 Visualizing Hb and PCV Relationship Scatterplot display 77 77

78 Example 2: Hb and PCV Equation of regression line relating estimated mean Hemoglobin (g/dl) to packed cell volume : from computer yˆ x Here, ŷ estimated average Hemoglobin (like what we previously would call y ), x = PCV (%), ˆ and ˆ 0.20 o This is the estimated line from the sample of 2 subjects 78 78

79 Example 2: Hb and PCV Equation of regression line relating estimated mean Hemoglobin (g/dl) to packed cell volume : from computer yˆ x ˆ 0.20 : what are the units? Well, ŷ is in g/dl, x in percent; so ˆ is in units of g/dl per percent This results estimates that the mean difference in Hemoglobin levels for two groups of subjects who differ by % in PCV is 0.20 g/dl: subjects with greater PCV have greater Hb levels in average

80 Visualizing Hb and PCV Relationship Scatterplot display with regression line yˆ x 80 80

81 Example 2: Hb and PCV What is average difference in Hb levels for subjects with PCV of 40% compared to subjects with 32%? ˆ 0.20 : compares groups of subjects who differ in PCV by % (it is positive, so those with the greater PCV have hemoglobin levels of.20 g/dl greater on average) To compare subjects with PCV of 40% versus subjects with 32%, which is an 8 unit difference in x, take 8 ˆ g / dl 8 8

82 Example 2: Hb and PCV What is estimated Hb level for subjects with PCV of 4%? Plugging 4% into the equation, yˆ x y ˆ g / dl What is the interpretation of the intercept? 82 82

83 Example 3: Wages and Education Level Data on hourly wages from a random sample of 534 U.S. workers in 985 Question: what is the relationship between hourly wage (US$) and years of formal education Data: Hourly wages : mean $9.04/hr, SD $5.3/hr, range $.00/hr $44.50/hr Year of formal education: mean 3.0 years, SD 2.6 years, range 2 years 8 years 83 83

84 Visualizing Wages and Education Level Relationship Scatterplot display 84 84

85 Example: Wages and Education Level Equation of regression line relating estimated mean hourly wages (US $) to years of education : from Stata yˆ x Here, ŷ estimated average hourly wage (like what we previously would call y ), x = years of formal education, ˆ 0.75 and ˆ o This is the estimated line from the sample of 534 subjects 85 85

86 Visualizing Wages and Education Level Relationship Scatterplot display with regression line 86 86

87 Wages and Education Level What is interpretation of the slope estimate? What is the interpretation of the intercept? 87 87

88 Summary Simple linear regression is a method for relating the mean of an outcome y to a predictor x When x is a continuous variable: the estimated slope for x, ˆ, has a mean difference interpretation: the mean difference in y for two groups who differ by one unit of x (the change in mean y per unit change in x ) The estimated intercept, ˆo, is the estimated mean of y when x =0; this is often not a scientifically relevant quantity 88 88

89 Section D Simple Linear Regression Model: Estimating the Regression Equation Accounting for Uncertainty in the Estimates 89

90 Learning Objectives Creating confidence intervals for linear regression slopes means creating confidence intervals for mean differences, and the approach is business as usual Similarly, creating a confidence interval for an intercept is creating a confidence interval for a single population mean 90 90

91 Example : Arm Circumference and Height So in the last section, we showed the results from several simple linear regression models For example, when relating arm circumference to height using a random sample of 50 Nepali children < 2 months old, the resulting regression equation was: yˆ x I told you this came from a computer package: but what is the algorithm to estimate this equation? 9 9

92 Example : Arm Circumference and Height There must be some algorithm that will always yield the same results for the same data set 92 92

93 Example : Arm Circumference and Height The algorithm to estimate the equation of the line is called the least squares estimation The idea is to find the line that gets closest to all of the points in the sample How to define closeness to multiple points? In regression, closeness is defined as the cumulative squared distance between each point s y-value and the corresponding value of ŷ for that point s x : in other words the squared distance between an observed y-value and the estimated mean y-value for all points with the same value of x

94 Example : Arm Circumference and Height ˆ Each distance is y yˆ y ( o B x ) : this is computed for each data point in the sample ˆ 94 94

95 Example : Arm Circumference and Height The algorithm to estimate the equation of the line is called the least squares estimation The values chosen for ˆ ˆ o and are the values that minimize the cumulative distances squared: i.e. min n i y i ( ˆ x o ˆ ) i

96 Example : Arm Circumference and Height ˆ ˆ The values chosen for o and are just estimates based on a single sample. If were to have a different random sample of 50 Nepali children from the same population of <2 month olds, the resulting estimate would likely be different: i.e. the values that minimized the cumulative squared distance from this second sample of points would likely be different As such, all regression coefficients have an associated standard error that can be used to make statements about the true relationship between mean y and x (for example, the true slope ) based on a single sample 96 96

97 Example : Arm Circumference and Height The estimated regression equation relating arm circumference to height using a random samples of 50 Nepali children < 2 months old, I told you that the resulting regression equation was: ˆ ˆ o and and yˆ S E ˆ ( 2.7 S E ˆ ( ˆ ˆ o ) 0.6 ) x

98 Example : Arm Circumference and Height Random sampling behavior of estimated regression coefficients is normal for large samples (n>60), and centered at true values As such, we can use same ideas to create 95% CIs and get p-values 98 98

99 Example : Arm Circumference and Height The estimated regression equation relating arm circumference to height using a random samples of 50 Nepali children < 2 months old, the resulting regression equation was: yˆ ˆ and S E ˆ ( ˆ ) x % CI for β ˆ ˆ ˆ 2 S E ( ) ( 0.3,0.9 ) 99 99

100 Example : Arm Circumference and Height p-value for testing: H o : β =0 H A : β 0 Assume null true, and calculate standardized distance of from 0 ˆ ˆ t.4 S Eˆ ( ) S Eˆ ( ).04 The p-value is probability of being.4 or more standard errors away from mean of 0 on a normal curve: very low in this example, p <.00 ˆ 00 00

101 Summarizing findings: Arm Circumference and Height This research used simple linear regression to estimate the magnitude of the association between arm circumference and height in Nepali children less than 2 months old, using data on a random sample of 50. A statistically significant positive association was found (p<.00). The results estimate that two groups of such children who differ by cm in height will differ on average by 0.6 cm in arm circumference. (95% CI 0.3 cm to 0.9 cm) 0 0

102 Example : Arm Circumference and Height Give an estimate and 95% CI for the mean difference in arm circumference for children 60 cm tall compared to children 50 cm tall From previous set we know this estimated mean difference is ( ) ˆ 0 ˆ How to get standard error? Well as it turns out: S Eˆ (0 ˆ ) S Eˆ (0 ˆ ) 0 S Eˆ ( 95% CI for the mean difference 0 ˆ 0.04 ) 0.4 cm 02 02

103 Example 2: Hemoglobin and Packed Cell Volume Equation of regression line relating estimated mean Hemoglobin (g/dl) to packed cell volume yˆ x ˆ 0.20 and S Ê ( ˆ )

104 Example 2: Hemoglobin and Packed Cell Volume Same idea with computation of 95% CI and p-value as we saw before: However, with small (n<60) samples, a slight change analogous to what we did with means and differences in means before Sampling distribution of regression coefficients not quite normal, but follow a t-distribution with n-2 degrees of freedom 95% for : ˆ t S Eˆ ( ˆ n ).95, 2 ˆ ˆ ˆ t.95,9 S E ( ) ( 0.0,0.30 ) 04 04

105 Example 2: Hemoglobin and Packed Cell Volume p-value for testing: H o : β =0 H A : β 0 Assume null true, and calculate standardized distance of from 0 ˆ ˆ t 4.35 S Eˆ ( ) S Eˆ ( ).046 The p-value is probability of being 4.35 or more standard errors away from mean of 0 on a t curve with 9 degrees of freedom: in this example, p <.00 ˆ 05 05

106 Example 2: Interpreting Result of 95% CI So, the estimated slope is 0.2 with 95% CI 0.0 to 0.30 How to interpret results? Based on a sample of 2 subjects, we estimated that PCV(%) is positively associated with hemoglobin levels We estimated that a one-percent increase in PCV is associated with a 0.2 g/dl increase in hemoglobin on average Accounting for sampling variability, this mean increase could be as small as 0.0 g/dl, or as large as 0.3 g/dl in the population of all such subjects 06 06

107 Example 2: Interpreting Result of 95% CI In other words: We estimated that the average difference in hemoglobin levels for two groups of subjects who differ by one-percent in PCV to be 0.2 g/dl on average (higher PCV group relative to lower) Accounting for sampling variability, mean difference could be as small as 0.0 g/dl, or as large as 0.3 g/dl in the population of all subjects 07 07

108 What about Intercepts? In this section, all examples have confidence intervals for the slope, and multiples of the slope We can also create confidence intervals/p-values for the intercept in the same manner (and Stata presents it in the output).when x is a continuous predictor, many times the intercept is just a placeholder and does not describe a useful quantity: as such, 95% CIs and p-values are not always relevant. However, when x is a binary or categorical predictor, the intercept may have a sustantive interpretation, and a 95% CI may be of interest

109 Example 3: Length of Stay and Age of First Claim Box plot display Length of Stay By Age at First Claim Category Heritage Health Plan Data Length of Stay (Days) >= 40 years < 40 years 09 09

110 Example 3: Length of Stay and Age of First Claim The resulting equation yˆ x ˆ 2. : the estimated mean difference in length of stay for persons less than 40 at first claim compared to persons over 40 is - 2. days ; the younger group has average length of stays of 2. days less ˆ 4.9 : the mean length of stay for persons over 40 at first claim (reference group) is 4.9 days o 0 0

111 Example 3: Length of Stay and Age of First Claim Confidence intervals and p-values ˆ 2. (-2.3, -.9) p 0.00 ˆ o 4.9 (4.8, 5.0)

112 Summary The construction of confidence intervals for linear regression slopes and intercepts is business as usual : take the estimate and add/subtract 2 estimated standard errors (or slightly more in smaller samples) Confidence intervals for slopes are confidence intervals for mean differences Confidence intervals for intercepts are confidence intervals for the mean of y for a specific group (x =0) : not always relevant when x is continuous 2 2

113 Section E Measuring the Strength of A Linear Association 3

114 Strength of Association The slope of a regression line estimates the magnitude and direction of the relationship between y and x : it encapsulates how much y differs on average with differences in x The slope estimate and standard error can be used to address the uncertainty in the this estimate with regards to the true magnitude and direction of the association in the population from which the sample was taken from Slopes do not impart any information about how well the regression line fits the data in the sample; the slope gives no indication of how close the points get to the estimated regression line 4 4

115 Example : Arm Circumference and Height Slope depends on the units of both y and x 5 5

116 This image cannot currently be displayed. Example : Arm Circumference and Height For example, when height (x ) measured in cm How about if height was recorded in inches? yˆ x 6 6

117 Strength of Association Another quantity that can be estimated via linear regression is the coefficient of determination, R 2 : this is a number that ranges from 0 to, with larger values indicate closer fits of the data points and regression line R 2 measures strength of association by comparing variability of points around the regression line to variability in y-values ignoring x 7 7

118 Example : Arm Circumference and Height How close do the points get to the line can we quantify? 8 8

119 Example : Arm Circumference and Height (SR Flashback) The sample standard deviation of the y-values ignoring the corresponding potential information in x is s n i ( y i n y ) 2 this measures how far on average each of the sample y values falls from the overall mean all y-values In this example s=.48 cm 9 9

120 Example : Arm Circumference and Height Visualization on the scatterplot 20 20

121 Example : Arm Circumference and Height Standard deviation of regression, referred to as root mean square error is average distance of points from the line: how far on average each y falls from its mean predicted by the its corresponding x-value s ( y i y x i n n 2 yˆ i ) 2 In this example, s y x

122 Example : Arm Circumference and Height y yˆ y ( ˆ o Each distance is : this is computed for each data point in the sample Bˆ x ) 22 22

123 Example : Arm Circumference and Height If s = s y x, then knowing x does not yield a better guess for the mean of y than using the overall mean y (flat regression line) The smaller s y x is relative to s, the closer the points are to the regression line R 2 functionally measures how much smaller s y x is than s: as such it is an estimate of the amount of variability in y explained by taking x into account 23 23

124 Example : Arm Circumference and Height The R 2 : from this regression of arm circumference on height is 0.46 (46%); childs height explains (an estimated) 46% of the variation in arm circumferences 24 24

125 Example : R 2 and r r = the properly signed square root of R 2 ; the proper sign is the same sign as the slope in the regression r is called the correlation coefficient (not to be confused with the regression coefficients great names, huh) Allowable values 0 R 2 If relationship between y and x is positive 0 r If relationship between y and x is negative - r 0 In this example, r R

126 Example : Arm Circumference and Height So from the example: child height explains (an estimated) 46% of the variation in arm circumferences This is just an estimate based on the sample; a 95% CI can be computed but its not easy to do; also the procedure for estimating the 95% CI is not so good So this means an estimated 54% of the variability in arm circumference is not explained by childs height Some if this unexplained variability may be explained by factors other then height Multiple linear regression will allow us to estimate the relationship between arm circumference, height and other child characteristics in one analysis 26 26

127 Example 2: Hemoglobin and Packed Cell Volume R 2 = 0.5: PCV explains (an estimated) 5% of the variation in hemoglobin levels The corresponding correlation coefficient is r R

128 Example 3: Wages and Years of Education R 2 =0.5: years of education explains (an estimated) 5% of the variation in hourly wages The corresponding correlation coefficient is r R

129 Example 4: Wages and Sex R 2 = 0.042: sex(female=) explains (an estimated) 4.2% of the variation in arm circumference The corresponding correlation coefficient is r R

130 What s a Good R 2 There are a couple of important things to keep in mind about R 2 and r - These quantities are both estimates based on the sample of data; frequently reported without some recognition of sampling variability, for example a 95% confidence interval - Low R 2 and r not necessarily bad - many outcomes can not/ will not be fully or close to fully explained, in terms of variability, by any one single predictor 30 30

131 What s a Good R 2 The higher the R 2 values, the better the x predicts y for individuals in a sample/population, as individual y-values vary less about their estimated means based on x 3 3

132 What s a Good R 2 However, there may be important overall associations between mean of y and x even though still a lot of individual variability in y- values about their means estimated by x In the wages example, years of education explained an estimated 5% of the variability in hourly wages The association was statistically significant showing that average wages were greater for persons with more years of education However, for any single education level (year), still a lot of variation in wages for individual workers 32 32

133 Slope versus R 2 Slope estimates the magnitude and direction of the relationship between y and x Estimates a mean difference in y for two groups who differ by oneunit in x The slope will change if the units change for y and/or for x Larger slopes not indicative of stronger linear association: smaller slopes not indicative of weaker linear association R 2 measures strength of linear association; r measures strength and direction Neither R 2 or r measures magnitude Neither R 2 or r changes with changes in units 33 33

134 R 2 vs. r If you have r, you can compute R 2 If you have R 2, you can almost compute r 34 34

135 r As A Quick Summary Measure Table of correlations age weight height armcirc sex age.0000 weight height armcirc sex

136 Summary R 2 measures strength of association by comparing variability of points around the regression line to variability in y-values ignoring x The correlation coefficient r is the properly signed square root of R 2, and hence provides information about the direction of the association estimated by the regression 36 36

Section E. Measuring the Strength of A Linear Association

Section E. Measuring the Strength of A Linear Association This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

The Normal Distribution. John McGready, PhD Johns Hopkins University

The Normal Distribution. John McGready, PhD Johns Hopkins University The Normal Distribution John McGready, PhD Johns Hopkins University General Properties of The Normal Distribution The material in this video is subject to the copyright of the owners of the material and

More information

IQR = number. summary: largest. = 2. Upper half: Q3 =

IQR = number. summary: largest. = 2. Upper half: Q3 = Step by step box plot Height in centimeters of players on the 003 Women s Worldd Cup soccer team. 157 1611 163 163 164 165 165 165 168 168 168 170 170 170 171 173 173 175 180 180 Determine the 5 number

More information

STA 570 Spring Lecture 5 Tuesday, Feb 1

STA 570 Spring Lecture 5 Tuesday, Feb 1 STA 570 Spring 2011 Lecture 5 Tuesday, Feb 1 Descriptive Statistics Summarizing Univariate Data o Standard Deviation, Empirical Rule, IQR o Boxplots Summarizing Bivariate Data o Contingency Tables o Row

More information

3. Data Analysis and Statistics

3. Data Analysis and Statistics 3. Data Analysis and Statistics 3.1 Visual Analysis of Data 3.2.1 Basic Statistics Examples 3.2.2 Basic Statistical Theory 3.3 Normal Distributions 3.4 Bivariate Data 3.1 Visual Analysis of Data Visual

More information

Chapter 6: DESCRIPTIVE STATISTICS

Chapter 6: DESCRIPTIVE STATISTICS Chapter 6: DESCRIPTIVE STATISTICS Random Sampling Numerical Summaries Stem-n-Leaf plots Histograms, and Box plots Time Sequence Plots Normal Probability Plots Sections 6-1 to 6-5, and 6-7 Random Sampling

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors

More information

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression

Your Name: Section: INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression Your Name: Section: 36-201 INTRODUCTION TO STATISTICAL REASONING Computer Lab #4 Scatterplots and Regression Objectives: 1. To learn how to interpret scatterplots. Specifically you will investigate, using

More information

BIO 360: Vertebrate Physiology Lab 9: Graphing in Excel. Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26

BIO 360: Vertebrate Physiology Lab 9: Graphing in Excel. Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26 Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26 INTRODUCTION Graphs are one of the most important aspects of data analysis and presentation of your of data. They are visual representations

More information

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student

The basic arrangement of numeric data is called an ARRAY. Array is the derived data from fundamental data Example :- To store marks of 50 student Organizing data Learning Outcome 1. make an array 2. divide the array into class intervals 3. describe the characteristics of a table 4. construct a frequency distribution table 5. constructing a composite

More information

BIOSTATISTICS LABORATORY PART 1: INTRODUCTION TO DATA ANALYIS WITH STATA: EXPLORING AND SUMMARIZING DATA

BIOSTATISTICS LABORATORY PART 1: INTRODUCTION TO DATA ANALYIS WITH STATA: EXPLORING AND SUMMARIZING DATA BIOSTATISTICS LABORATORY PART 1: INTRODUCTION TO DATA ANALYIS WITH STATA: EXPLORING AND SUMMARIZING DATA Learning objectives: Getting data ready for analysis: 1) Learn several methods of exploring the

More information

CHAPTER 2 DESCRIPTIVE STATISTICS

CHAPTER 2 DESCRIPTIVE STATISTICS CHAPTER 2 DESCRIPTIVE STATISTICS 1. Stem-and-Leaf Graphs, Line Graphs, and Bar Graphs The distribution of data is how the data is spread or distributed over the range of the data values. This is one of

More information

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram

IAT 355 Visual Analytics. Data and Statistical Models. Lyn Bartram IAT 355 Visual Analytics Data and Statistical Models Lyn Bartram Exploring data Example: US Census People # of people in group Year # 1850 2000 (every decade) Age # 0 90+ Sex (Gender) # Male, female Marital

More information

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 1 2 Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs. 2. How to construct (in your head!) and interpret confidence intervals.

More information

A straight line is the graph of a linear equation. These equations come in several forms, for example: change in x = y 1 y 0

A straight line is the graph of a linear equation. These equations come in several forms, for example: change in x = y 1 y 0 Lines and linear functions: a refresher A straight line is the graph of a linear equation. These equations come in several forms, for example: (i) ax + by = c, (ii) y = y 0 + m(x x 0 ), (iii) y = mx +

More information

Correctly Compute Complex Samples Statistics

Correctly Compute Complex Samples Statistics SPSS Complex Samples 15.0 Specifications Correctly Compute Complex Samples Statistics When you conduct sample surveys, use a statistics package dedicated to producing correct estimates for complex sample

More information

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Minitab 17 commands Prepared by Jeffrey S. Simonoff Minitab 17 commands Prepared by Jeffrey S. Simonoff Data entry and manipulation To enter data by hand, click on the Worksheet window, and enter the values in as you would in any spreadsheet. To then save

More information

Multiple Regression White paper

Multiple Regression White paper +44 (0) 333 666 7366 Multiple Regression White paper A tool to determine the impact in analysing the effectiveness of advertising spend. Multiple Regression In order to establish if the advertising mechanisms

More information

Chapter Two: Descriptive Methods 1/50

Chapter Two: Descriptive Methods 1/50 Chapter Two: Descriptive Methods 1/50 2.1 Introduction 2/50 2.1 Introduction We previously said that descriptive statistics is made up of various techniques used to summarize the information contained

More information

The results section of a clinicaltrials.gov file is divided into discrete parts, each of which includes nested series of data entry screens.

The results section of a clinicaltrials.gov file is divided into discrete parts, each of which includes nested series of data entry screens. OVERVIEW The ClinicalTrials.gov Protocol Registration System (PRS) is a web-based tool developed for submitting clinical trials information to ClinicalTrials.gov. This document provides step-by-step instructions

More information

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency

Math 120 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency Math 1 Introduction to Statistics Mr. Toner s Lecture Notes 3.1 Measures of Central Tendency lowest value + highest value midrange The word average: is very ambiguous and can actually refer to the mean,

More information

Summarising Data. Mark Lunt 09/10/2018. Arthritis Research UK Epidemiology Unit University of Manchester

Summarising Data. Mark Lunt 09/10/2018. Arthritis Research UK Epidemiology Unit University of Manchester Summarising Data Mark Lunt Arthritis Research UK Epidemiology Unit University of Manchester 09/10/2018 Summarising Data Today we will consider Different types of data Appropriate ways to summarise these

More information

Section 2.3: Simple Linear Regression: Predictions and Inference

Section 2.3: Simple Linear Regression: Predictions and Inference Section 2.3: Simple Linear Regression: Predictions and Inference Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.4 1 Simple

More information

MAT 110 WORKSHOP. Updated Fall 2018

MAT 110 WORKSHOP. Updated Fall 2018 MAT 110 WORKSHOP Updated Fall 2018 UNIT 3: STATISTICS Introduction Choosing a Sample Simple Random Sample: a set of individuals from the population chosen in a way that every individual has an equal chance

More information

An Introduction to Growth Curve Analysis using Structural Equation Modeling

An Introduction to Growth Curve Analysis using Structural Equation Modeling An Introduction to Growth Curve Analysis using Structural Equation Modeling James Jaccard New York University 1 Overview Will introduce the basics of growth curve analysis (GCA) and the fundamental questions

More information

Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding

Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding Psychology 282 Lecture #21 Outline Categorical IVs in MLR: Effects Coding and Contrast Coding In the previous lecture we learned how to incorporate a categorical research factor into a MLR model by using

More information

TYPES OF VARIABLES, STRUCTURE OF DATASETS, AND BASIC STATA LAYOUT

TYPES OF VARIABLES, STRUCTURE OF DATASETS, AND BASIC STATA LAYOUT PRIMER FOR ACS OUTCOMES RESEARCH COURSE: TYPES OF VARIABLES, STRUCTURE OF DATASETS, AND BASIC STATA LAYOUT STEP 1: Install STATA statistical software. STEP 2: Read through this primer and complete the

More information

WELCOME! Lecture 3 Thommy Perlinger

WELCOME! Lecture 3 Thommy Perlinger Quantitative Methods II WELCOME! Lecture 3 Thommy Perlinger Program Lecture 3 Cleaning and transforming data Graphical examination of the data Missing Values Graphical examination of the data It is important

More information

Week 4: Simple Linear Regression III

Week 4: Simple Linear Regression III Week 4: Simple Linear Regression III Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Goodness of

More information

Modelling Proportions and Count Data

Modelling Proportions and Count Data Modelling Proportions and Count Data Rick White May 4, 2016 Outline Analysis of Count Data Binary Data Analysis Categorical Data Analysis Generalized Linear Models Questions Types of Data Continuous data:

More information

Linear Regression. Problem: There are many observations with the same x-value but different y-values... Can t predict one y-value from x. d j.

Linear Regression. Problem: There are many observations with the same x-value but different y-values... Can t predict one y-value from x. d j. Linear Regression (*) Given a set of paired data, {(x 1, y 1 ), (x 2, y 2 ),..., (x n, y n )}, we want a method (formula) for predicting the (approximate) y-value of an observation with a given x-value.

More information

8. MINITAB COMMANDS WEEK-BY-WEEK

8. MINITAB COMMANDS WEEK-BY-WEEK 8. MINITAB COMMANDS WEEK-BY-WEEK In this section of the Study Guide, we give brief information about the Minitab commands that are needed to apply the statistical methods in each week s study. They are

More information

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes. Resources for statistical assistance Quantitative covariates and regression analysis Carolyn Taylor Applied Statistics and Data Science Group (ASDa) Department of Statistics, UBC January 24, 2017 Department

More information

Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975.

Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975. Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975. SPSS Statistics were designed INTRODUCTION TO SPSS Objective About the

More information

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures Part I, Chapters 4 & 5 Data Tables and Data Analysis Statistics and Figures Descriptive Statistics 1 Are data points clumped? (order variable / exp. variable) Concentrated around one value? Concentrated

More information

Lab #9: ANOVA and TUKEY tests

Lab #9: ANOVA and TUKEY tests Lab #9: ANOVA and TUKEY tests Objectives: 1. Column manipulation in SAS 2. Analysis of variance 3. Tukey test 4. Least Significant Difference test 5. Analysis of variance with PROC GLM 6. Levene test for

More information

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a Chapter 3 Missing Data 3.1 Types of Missing Data ˆ Missing completely at random (MCAR) ˆ Missing at random (MAR) a ˆ Informative missing (non-ignorable non-response) See 1, 38, 59 for an introduction to

More information

Modelling Proportions and Count Data

Modelling Proportions and Count Data Modelling Proportions and Count Data Rick White May 5, 2015 Outline Analysis of Count Data Binary Data Analysis Categorical Data Analysis Generalized Linear Models Questions Types of Data Continuous data:

More information

SLStats.notebook. January 12, Statistics:

SLStats.notebook. January 12, Statistics: Statistics: 1 2 3 Ways to display data: 4 generic arithmetic mean sample 14A: Opener, #3,4 (Vocabulary, histograms, frequency tables, stem and leaf) 14B.1: #3,5,8,9,11,12,14,15,16 (Mean, median, mode,

More information

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things.

Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. + What is Data? Data is a collection of facts. Data can be in the form of numbers, words, measurements, observations or even just descriptions of things. In most cases, data needs to be interpreted and

More information

STANDARDS OF LEARNING CONTENT REVIEW NOTES. ALGEBRA I Part I. 4 th Nine Weeks,

STANDARDS OF LEARNING CONTENT REVIEW NOTES. ALGEBRA I Part I. 4 th Nine Weeks, STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA I Part I 4 th Nine Weeks, 2016-2017 1 OVERVIEW Algebra I Content Review Notes are designed by the High School Mathematics Steering Committee as a resource

More information

Chapter 1. Looking at Data-Distribution

Chapter 1. Looking at Data-Distribution Chapter 1. Looking at Data-Distribution Statistics is the scientific discipline that provides methods to draw right conclusions: 1)Collecting the data 2)Describing the data 3)Drawing the conclusions Raw

More information

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display CURRICULUM MAP TEMPLATE Priority Standards = Approximately 70% Supporting Standards = Approximately 20% Additional Standards = Approximately 10% HONORS PROBABILITY AND STATISTICS Essential Questions &

More information

8: Statistics. Populations and Samples. Histograms and Frequency Polygons. Page 1 of 10

8: Statistics. Populations and Samples. Histograms and Frequency Polygons. Page 1 of 10 8: Statistics Statistics: Method of collecting, organizing, analyzing, and interpreting data, as well as drawing conclusions based on the data. Methodology is divided into two main areas. Descriptive Statistics:

More information

Using Large Data Sets Workbook Version A (MEI)

Using Large Data Sets Workbook Version A (MEI) Using Large Data Sets Workbook Version A (MEI) 1 Index Key Skills Page 3 Becoming familiar with the dataset Page 3 Sorting and filtering the dataset Page 4 Producing a table of summary statistics with

More information

. predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education)

. predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education) DUMMY VARIABLES AND INTERACTIONS Let's start with an example in which we are interested in discrimination in income. We have a dataset that includes information for about 16 people on their income, their

More information

Research Methods for Business and Management. Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel

Research Methods for Business and Management. Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel Research Methods for Business and Management Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel A Simple Example- Gym Purpose of Questionnaire- to determine the participants involvement

More information

To make sense of data, you can start by answering the following questions:

To make sense of data, you can start by answering the following questions: Taken from the Introductory Biology 1, 181 lab manual, Biological Sciences, Copyright NCSU (with appreciation to Dr. Miriam Ferzli--author of this appendix of the lab manual). Appendix : Understanding

More information

Coding Categorical Variables in Regression: Indicator or Dummy Variables. Professor George S. Easton

Coding Categorical Variables in Regression: Indicator or Dummy Variables. Professor George S. Easton Coding Categorical Variables in Regression: Indicator or Dummy Variables Professor George S. Easton DataScienceSource.com This video is embedded on the following web page at DataScienceSource.com: DataScienceSource.com/DummyVariables

More information

Statistical Good Practice Guidelines. 1. Introduction. Contents. SSC home Using Excel for Statistics - Tips and Warnings

Statistical Good Practice Guidelines. 1. Introduction. Contents. SSC home Using Excel for Statistics - Tips and Warnings Statistical Good Practice Guidelines SSC home Using Excel for Statistics - Tips and Warnings On-line version 2 - March 2001 This is one in a series of guides for research and support staff involved in

More information

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016)

CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1. Daphne Skipper, Augusta University (2016) CHAPTER 2: DESCRIPTIVE STATISTICS Lecture Notes for Introductory Statistics 1 Daphne Skipper, Augusta University (2016) 1. Stem-and-Leaf Graphs, Line Graphs, and Bar Graphs The distribution of data is

More information

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation 10.4 Measures of Central Tendency and Variation Mode-->The number that occurs most frequently; there can be more than one mode ; if each number appears equally often, then there is no mode at all. (mode

More information

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation 10.4 Measures of Central Tendency and Variation Mode-->The number that occurs most frequently; there can be more than one mode ; if each number appears equally often, then there is no mode at all. (mode

More information

Scatterplot: The Bridge from Correlation to Regression

Scatterplot: The Bridge from Correlation to Regression Scatterplot: The Bridge from Correlation to Regression We have already seen how a histogram is a useful technique for graphing the distribution of one variable. Here is the histogram depicting the distribution

More information

Table Of Contents. Table Of Contents

Table Of Contents. Table Of Contents Statistics Table Of Contents Table Of Contents Basic Statistics... 7 Basic Statistics Overview... 7 Descriptive Statistics Available for Display or Storage... 8 Display Descriptive Statistics... 9 Store

More information

STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA I. 4 th Nine Weeks,

STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA I. 4 th Nine Weeks, STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA I 4 th Nine Weeks, 2016-2017 1 OVERVIEW Algebra I Content Review Notes are designed by the High School Mathematics Steering Committee as a resource for

More information

Using Excel for Graphical Analysis of Data

Using Excel for Graphical Analysis of Data Using Excel for Graphical Analysis of Data Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters. Graphs are

More information

Tabular & Graphical Presentation of data

Tabular & Graphical Presentation of data Tabular & Graphical Presentation of data bjectives: To know how to make frequency distributions and its importance To know different terminology in frequency distribution table To learn different graphs/diagrams

More information

One Factor Experiments

One Factor Experiments One Factor Experiments 20-1 Overview Computation of Effects Estimating Experimental Errors Allocation of Variation ANOVA Table and F-Test Visual Diagnostic Tests Confidence Intervals For Effects Unequal

More information

Averages and Variation

Averages and Variation Averages and Variation 3 Copyright Cengage Learning. All rights reserved. 3.1-1 Section 3.1 Measures of Central Tendency: Mode, Median, and Mean Copyright Cengage Learning. All rights reserved. 3.1-2 Focus

More information

1. Assumptions. 1. Introduction. 2. Terminology

1. Assumptions. 1. Introduction. 2. Terminology 4. Process Modeling 4. Process Modeling The goal for this chapter is to present the background and specific analysis techniques needed to construct a statistical model that describes a particular scientific

More information

Regression Analysis and Linear Regression Models

Regression Analysis and Linear Regression Models Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33 Relationship between numerical

More information

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010

THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL. STOR 455 Midterm 1 September 28, 2010 THIS IS NOT REPRESNTATIVE OF CURRENT CLASS MATERIAL STOR 455 Midterm September 8, INSTRUCTIONS: BOTH THE EXAM AND THE BUBBLE SHEET WILL BE COLLECTED. YOU MUST PRINT YOUR NAME AND SIGN THE HONOR PLEDGE

More information

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques

Slide Copyright 2005 Pearson Education, Inc. SEVENTH EDITION and EXPANDED SEVENTH EDITION. Chapter 13. Statistics Sampling Techniques SEVENTH EDITION and EXPANDED SEVENTH EDITION Slide - Chapter Statistics. Sampling Techniques Statistics Statistics is the art and science of gathering, analyzing, and making inferences from numerical information

More information

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value. Calibration OVERVIEW... 2 INTRODUCTION... 2 CALIBRATION... 3 ANOTHER REASON FOR CALIBRATION... 4 CHECKING THE CALIBRATION OF A REGRESSION... 5 CALIBRATION IN SIMPLE REGRESSION (DISPLAY.JMP)... 5 TESTING

More information

Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University

Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University Maximizing Statistical Interactions Part II: Database Issues Provided by: The Biostatistics Collaboration Center (BCC) at Northwestern University While your data tables or spreadsheets may look good to

More information

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Performance Estimation and Regularization Kasthuri Kannan, PhD. Machine Learning, Spring 2018 Bias- Variance Tradeoff Fundamental to machine learning approaches Bias- Variance Tradeoff Error due to Bias:

More information

STANDARDS OF LEARNING CONTENT REVIEW NOTES. ALGEBRA I Part II. 3 rd Nine Weeks,

STANDARDS OF LEARNING CONTENT REVIEW NOTES. ALGEBRA I Part II. 3 rd Nine Weeks, STANDARDS OF LEARNING CONTENT REVIEW NOTES ALGEBRA I Part II 3 rd Nine Weeks, 2016-2017 1 OVERVIEW Algebra I Content Review Notes are designed by the High School Mathematics Steering Committee as a resource

More information

Data analysis using Microsoft Excel

Data analysis using Microsoft Excel Introduction to Statistics Statistics may be defined as the science of collection, organization presentation analysis and interpretation of numerical data from the logical analysis. 1.Collection of Data

More information

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem. STAT 2607 REVIEW PROBLEMS 1 REMINDER: On the final exam 1. Word problems must be answered in words of the problem. 2. "Test" means that you must carry out a formal hypothesis testing procedure with H0,

More information

Regression III: Advanced Methods

Regression III: Advanced Methods Lecture 3: Distributions Regression III: Advanced Methods William G. Jacoby Michigan State University Goals of the lecture Examine data in graphical form Graphs for looking at univariate distributions

More information

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file 1 SPSS Guide 2009 Content 1. Basic Steps for Data Analysis. 3 2. Data Editor. 2.4.To create a new SPSS file 3 4 3. Data Analysis/ Frequencies. 5 4. Recoding the variable into classes.. 5 5. Data Analysis/

More information

Table of Contents (As covered from textbook)

Table of Contents (As covered from textbook) Table of Contents (As covered from textbook) Ch 1 Data and Decisions Ch 2 Displaying and Describing Categorical Data Ch 3 Displaying and Describing Quantitative Data Ch 4 Correlation and Linear Regression

More information

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA

LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA LAB 1 INSTRUCTIONS DESCRIBING AND DISPLAYING DATA This lab will assist you in learning how to summarize and display categorical and quantitative data in StatCrunch. In particular, you will learn how to

More information

Graphical Analysis of Data using Microsoft Excel [2016 Version]

Graphical Analysis of Data using Microsoft Excel [2016 Version] Graphical Analysis of Data using Microsoft Excel [2016 Version] Introduction In several upcoming labs, a primary goal will be to determine the mathematical relationship between two variable physical parameters.

More information

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order.

Prepare a stem-and-leaf graph for the following data. In your final display, you should arrange the leaves for each stem in increasing order. Chapter 2 2.1 Descriptive Statistics A stem-and-leaf graph, also called a stemplot, allows for a nice overview of quantitative data without losing information on individual observations. It can be a good

More information

MINITAB 17 BASICS REFERENCE GUIDE

MINITAB 17 BASICS REFERENCE GUIDE MINITAB 17 BASICS REFERENCE GUIDE Dr. Nancy Pfenning September 2013 After starting MINITAB, you'll see a Session window above and a worksheet below. The Session window displays non-graphical output such

More information

Week 2: Frequency distributions

Week 2: Frequency distributions Types of data Health Sciences M.Sc. Programme Applied Biostatistics Week 2: distributions Data can be summarised to help to reveal information they contain. We do this by calculating numbers from the data

More information

STATS PAD USER MANUAL

STATS PAD USER MANUAL STATS PAD USER MANUAL For Version 2.0 Manual Version 2.0 1 Table of Contents Basic Navigation! 3 Settings! 7 Entering Data! 7 Sharing Data! 8 Managing Files! 10 Running Tests! 11 Interpreting Output! 11

More information

Reference

Reference Leaning diary: research methodology 30.11.2017 Name: Juriaan Zandvliet Student number: 291380 (1) a short description of each topic of the course, (2) desciption of possible examples or exercises done

More information

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES

2.1 Objectives. Math Chapter 2. Chapter 2. Variable. Categorical Variable EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 Objectives 2.1 What Are the Types of Data? www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative

More information

Correctly Compute Complex Samples Statistics

Correctly Compute Complex Samples Statistics PASW Complex Samples 17.0 Specifications Correctly Compute Complex Samples Statistics When you conduct sample surveys, use a statistics package dedicated to producing correct estimates for complex sample

More information

CS 237: Probability in Computing

CS 237: Probability in Computing CS 237: Probability in Computing Wayne Snyder Computer Science Department Boston University Lecture 25: Logistic Regression Motivation: Why Logistic Regression? Sigmoid functions the logit transformation

More information

Regression. Dr. G. Bharadwaja Kumar VIT Chennai

Regression. Dr. G. Bharadwaja Kumar VIT Chennai Regression Dr. G. Bharadwaja Kumar VIT Chennai Introduction Statistical models normally specify how one set of variables, called dependent variables, functionally depend on another set of variables, called

More information

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...

More information

Applied Statistics and Econometrics Lecture 6

Applied Statistics and Econometrics Lecture 6 Applied Statistics and Econometrics Lecture 6 Giuseppe Ragusa Luiss University gragusa@luiss.it http://gragusa.org/ March 6, 2017 Luiss University Empirical application. Data Italian Labour Force Survey,

More information

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables

Further Maths Notes. Common Mistakes. Read the bold words in the exam! Always check data entry. Write equations in terms of variables Further Maths Notes Common Mistakes Read the bold words in the exam! Always check data entry Remember to interpret data with the multipliers specified (e.g. in thousands) Write equations in terms of variables

More information

Exercise: Graphing and Least Squares Fitting in Quattro Pro

Exercise: Graphing and Least Squares Fitting in Quattro Pro Chapter 5 Exercise: Graphing and Least Squares Fitting in Quattro Pro 5.1 Purpose The purpose of this experiment is to become familiar with using Quattro Pro to produce graphs and analyze graphical data.

More information

Chapter 5: The standard deviation as a ruler and the normal model p131

Chapter 5: The standard deviation as a ruler and the normal model p131 Chapter 5: The standard deviation as a ruler and the normal model p131 Which is the better exam score? 67 on an exam with mean 50 and SD 10 62 on an exam with mean 40 and SD 12? Is it fair to say: 67 is

More information

The main issue is that the mean and standard deviations are not accurate and should not be used in the analysis. Then what statistics should we use?

The main issue is that the mean and standard deviations are not accurate and should not be used in the analysis. Then what statistics should we use? Chapter 4 Analyzing Skewed Quantitative Data Introduction: In chapter 3, we focused on analyzing bell shaped (normal) data, but many data sets are not bell shaped. How do we analyze quantitative data when

More information

Example how not to do it: JMP in a nutshell 1 HR, 17 Apr Subject Gender Condition Turn Reactiontime. A1 male filler

Example how not to do it: JMP in a nutshell 1 HR, 17 Apr Subject Gender Condition Turn Reactiontime. A1 male filler JMP in a nutshell 1 HR, 17 Apr 2018 The software JMP Pro 14 is installed on the Macs of the Phonetics Institute. Private versions can be bought from

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

Correlation. January 12, 2019

Correlation. January 12, 2019 Correlation January 12, 2019 Contents Correlations The Scattterplot The Pearson correlation The computational raw-score formula Survey data Fun facts about r Sensitivity to outliers Spearman rank-order

More information

Week 4: Simple Linear Regression II

Week 4: Simple Linear Regression II Week 4: Simple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Algebraic properties

More information

Key: 5 9 represents a team with 59 wins. (c) The Kansas City Royals and Cleveland Indians, who both won 65 games.

Key: 5 9 represents a team with 59 wins. (c) The Kansas City Royals and Cleveland Indians, who both won 65 games. AP statistics Chapter 2 Notes Name Modeling Distributions of Data Per Date 2.1A Distribution of a variable is the a variable takes and it takes that value. When working with quantitative data we can calculate

More information

Excel 2010 with XLSTAT

Excel 2010 with XLSTAT Excel 2010 with XLSTAT J E N N I F E R LE W I S PR I E S T L E Y, PH.D. Introduction to Excel 2010 with XLSTAT The layout for Excel 2010 is slightly different from the layout for Excel 2007. However, with

More information

1. Determine the population mean of x denoted m x. Ans. 10 from bottom bell curve.

1. Determine the population mean of x denoted m x. Ans. 10 from bottom bell curve. 6. Using the regression line, determine a predicted value of y for x = 25. Does it look as though this prediction is a good one? Ans. The regression line at x = 25 is at height y = 45. This is right at

More information

Chapter 4: Analyzing Bivariate Data with Fathom

Chapter 4: Analyzing Bivariate Data with Fathom Chapter 4: Analyzing Bivariate Data with Fathom Summary: Building from ideas introduced in Chapter 3, teachers continue to analyze automobile data using Fathom to look for relationships between two quantitative

More information

AND NUMERICAL SUMMARIES. Chapter 2

AND NUMERICAL SUMMARIES. Chapter 2 EXPLORING DATA WITH GRAPHS AND NUMERICAL SUMMARIES Chapter 2 2.1 What Are the Types of Data? 2.1 Objectives www.managementscientist.org 1. Know the definitions of a. Variable b. Categorical versus quantitative

More information

Descriptive Statistics, Standard Deviation and Standard Error

Descriptive Statistics, Standard Deviation and Standard Error AP Biology Calculations: Descriptive Statistics, Standard Deviation and Standard Error SBI4UP The Scientific Method & Experimental Design Scientific method is used to explore observations and answer questions.

More information