. predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education)

Size: px
Start display at page:

Download ". predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education)"


1 DUMMY VARIABLES AND INTERACTIONS Let's start with an example in which we are interested in discrimination in income. We have a dataset that includes information for about 16 people on their income, their education, their race-ethnic group (as well as additional variables that we shall, for the present, ignore and that were eliminated from this data subset).. use discrim. des Contains data from discrim.dta obs: 1,66 vars: 6 size: 44,968 (1.% of memory free) - 1. ed float %9.g 2. income float %9.g 3. female float %9.g 4. black float %9.g 5. hisp float %9.g 6. white float %9.g - Sorted by: MODEL 1: The first model includes only education as a predictor. regress income ed F( 1, 164) = Model e e+1 Prob > F =. Residual 5.953e R-square = Adj R-square =.1438 Total e Root MSE = ed _cons predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education) The graph shows, as expected, that education is related to increased income. In fact, it shows a linear relationship between education and income -- for every year of education, income is predicted to increase by $ model1 predicted income ed years of education

2 2 MODEL 2: Next, however, we introduce a dummy variable. A dummy variable has only two categories - and 1. In this case, the dummy variable white = 1 if the individual is white, and if he or she is nonwhite (in the case of this particular dataset, black or hispanic). The coefficient on this new variable asks whether there is a constant difference between whites and nonwhites in income when they have the same the education.. regress income ed white F( 2, 163) = Model 8.943e e+1 Prob > F =. Residual 5.62e R-square = Adj R-square =.1491 Total e Root MSE = ed white _cons We want, again, to look at the predicted values, but we can plot them separately for whites and nonwhites. To do so, we set up two new variables: mod2w aand mod2n contain the predicted values of income for whites and nonwhites respectively.. predict mod2. gen mod2w=mod2 if white==1 (294 missing values generated). gen mod2n=mod2 if white== (1312 missing values generated). graph mod2w mod2n ed, connect(ll)xlabel ylabel l1(model2 predicted income) b1(years of education) When we graph these values, we find two parallel lines: the lines for whites and nonwhites differ in only in their intercept. We can see how this happens by writing out the prediction equations for whites and non-whites. FOR NONWHITES: since white=, FOR WHITES: since white=1 b + b 1 ed + b 2 white, becomes, b + b 1 ed b + b 1 ed + b 2 white, becomes, b + b 2 + b 1 ed What the dummy variable has done is to allow us separate intercepts model2 predicted income mod2w mod2n for nonwhites b for whites b +b ed years of education

3 3 This model allows the intercepts to differ by race, BUT assumes the increase in income for each additional year of education is the same for whites and nonwhites alike. MODEL 3: But suppose we want to ask whether or not the slope is the same. To do so, we can use an interaction term that is the product of the variable white and the variable education. This variable is for nonwhites, but for whites is equal to their education.. gen edw = white*ed. regress income ed white edw F( 3, 162) = 13.8 Model 9.691e e+1 Prob > F =. Residual e R-square = Adj R-square =.1612 Total e Root MSE = ed white edw _cons predict mod3. gen mod3w = mod3 if white==1 (294 missing values generated). gen mod3n = mod3 if white== (1312 missing values generated). graph mod3w mod3n ed, connect(ll) xlabel ylabel l1(model3 predicted income) b1(years of education) What we see is that this whether both the slope and the to nonwhites. model3 predicted income 4 2 mod3w mod3n method has allowed us to ask intercept differ for whites compared ed years of education FOR NONWHITES: b + b 1 ed + b 2 white + b 3 edw becomes, since white= edw=, b + b 1 ed FOR WHITES: b + b 1 ed + b 2 white + b 3 edw becomes, since white=1and edw=ed (b + b 2 ) + (b 1 +b 3 ) ed What the dummy variable for White and its interaction with ed have done is to allow us to estimate separate intercepts and separate slopes for the relationship between education and income for whites and nonwhites. These analyses can be done separately for whites and nonwhites.

4 4 MODEL 4: Whites only. regress income ed if white==1 Source SS df MS Number of obs = F( 1, 131) = Model e e+1 Prob > F =. Residual e R-square = Adj R-square =.14 Total e Root MSE = ed _cons MODEL 5: Nonwhites only. regress income ed if white== Source SS df MS Number of obs = F( 1, 292) = 3.7 Model e e+9 Prob > F =. Residual e R-square = Adj R-square =.92 Total 4.953e Root MSE = ed _cons Please note that these separate regressions give the same results as the single analysis in Model 3. The intercept and education coefficients for nonwhites in Model 5 are the same as in Model 3. The intercept in Model 4 is the sum of the intercept and the coefficient for white in Model 3. The coefficient for education in Model 4 is the sum of the coefficient for education and that for edw in Model 3.

5 5 MORE THAN TWO CATEGORIES MODEL 6: We can extend the analysis to look at blacks and hispanics separately, so that now we have three categories: white, black, hispanic. To carry out this analysis, we need 2 dummy variables. In this case, I choose to use black = 1 if black, hisp=1 if hispanic, and zero otherwise. Whites are zero on both these variables.. regress income ed black hisp F( 3, 162) = Model e e+1 Prob > F =. Residual 5.59e R-square = Adj R-square =.1488 Total e Root MSE = ed black hisp _cons predict mod6. gen mod6w = mod6 if (black+hisp==) (294 missing values generated). gen mod6b = mod6 if black==1 (144 missing values generated). gen mod6h = mod6 if hisp==1 (1478 missing values generated). graph mod6w mod6b mod6h ed,connect(lll) xlabel ylabel l1(model6 predicted income) b1(years of education) model6 predicted income mod6w mod6b mod6h ed years of education Even though hisp is not significantly different from zero, I used the plots anyway. In this case, we got three parallel lines, one for each race-ethnic group.

6 6 MODEL 7: We can, further, allow the slopes to vary by creating the same kind of interaction variable as before:. gen edb = ed*black. gen edh = ed*hisp. regress income ed black edb hisp edh F( 5, 16) = Model e e+1 Prob > F =. Residual e R-square = Adj R-square =.166 Total e Root MSE = ed black edb hisp edh _cons predict mod7. gen mod7w = mod7 if black+hisp== (294 missing values generated) 4 mod7w mod7b mod7h. gen mod7b = mod7 if black==1 (144 missing values generated). gen mod7h = mod7 if hisp==1 (1478 missing values generated). graph mod7w mod7b mod7h ed, connect(lll) symbol(iii) model7 predicted income ed years of education In this case, since all coefficients are significant, we see that the slopes and intercepts differ: that there is a different starting value (or intercept) and different slopes. The starting values are higher for blacks and hispanics - i.e., at low levels of education, income is higher. BUT the increase with education is lower (the edb and edh variables have negative coefficients). What this means is that the lines cross, and as education increases, whites outstrip the other groups. WHAT OTHER VARIABLES MIGHT YOU WANT TO INCLUDE TO HAVE A FULLY DEVELOPED MODEL?

7 7 This part is up to you, the investigator. Statistics can't define issues for you -- using statistics, we can only say whether or not a particular model describes our data well -- or poorly. CONVERTING AN INTERVAL VARIABLE TO DUMMY VARIABLES Many of you asked whether the relationship of income and education was really linear. There are a number of ways of looking at that question. The one I want to introduce today is the use of dummy variables - turning education into a categorical variable. We know that education is measured in years:. sum ed Variable Obs Mean Std. Dev. Min Max ed * treat ed as a categorical variable: categories lths, HS, somecl, col. * need 3 dummy variables. gen lths=. replace lths=1 if ed<12 (37 real changes made). gen HS=. replace HS=1 if ed==12 (583 real changes made). gen somecl=. replace somecl=1 if ed>12 & ed<16 (295 real changes made). gen col=. replace col=1 if ed>=16 (358 real changes made). regress income HS somecl col F( 3, 162) = 1.71 Model e e+1 Prob > F =. Residual 5.97e R-square = Adj R-square =.1571 Total e Root MSE = HS somecl col _cons

8 8 Our model: or Y i â %â 1 X i1 %â 2 X i2 %â 3 X i3 %å i INCOME â %â 1 HS%â 2 somecl%â 3 col%å TThe estimated model is: ˆ INCOME % HS % somecl % col All of the coefficients are significant. What the results say is that, compared to those with less than high school education, income for those with a high school education is, on average, $4445 higher; for those who attend college less than 4 years, $9627 higher, for those who have 4+ years of college, $21231 higher - indicating that the increase is not likely to be linear. To see this more clearly, we could have constructed 18 dummy variables (since we have 19 years - -18) and tested it for each year. I next added in other variables and will call this the small model. It can also be referred to as a main effects model since it contains no interaction terms.. regress income HS somecl col black hisp female F( 6, 1599) = Model e e+1 Prob > F =. Residual 4.339e Small R-square = Adj R-square =.2699 Total e Root MSE = HS somecl col black hisp female _cons The next step is to consider again interactions between education and EACH of the race-ethnic and gender variables. We have to create interactions with EACH dummy variable representing a category of education. I'll refer to the resulting model as the large model.. gen HSb = HS*black. gen HSh = HS*hisp. gen HSf = HS*female. gen someclf = somecl*female. gen someclb = somecl*black. gen someclh = somecl*hisp. gen colf = col*female. gen colh = col*hisp. gen colb = col*black

9 9. regress income HS somecl col black hisp female HSb HSh HSf someclf someclb someclh colf colh colb F( 15, 159) = Model e e+1 Prob > F =. Residual 4.182e Large R-square = Adj R-square =.2913 Total e Root MSE = HS somecl col black hisp female HSb HSh HSf someclf someclb someclh colf colh colb _cons Lots of the variables are now non-significant. Can we DROP all of them? Is it really the case that the coefficients for HSb HSh HSf someclf someclb someclh and colb are ALL not significantly different from zero? THE F- test Here's where we use the F-test to our advantage. Remember, the F- test asks the question whether R 2 for the model that includes these variables is significantly greater than the R 2 for the model that omits them, i.e. it asks whether the difference in the two R 2 values is significantly different from zero. In our case, the value for the large model is.298 and for the small it's Equivalently, it asks whether the residual sum of squares (RSS) in the large model (here E+11) is significantly smaller than the RSS for the model with fewer variables (here 4.339E+11). This asks the question whether the estimated Y's are closer to the observed ones when we include these additional variables (even though each appears, alone, not to be significant). The test statistic is (RSS small model - RSS large model )/(df small model - df large model ) RSS large model / df large model The denominator also appears on our printout as the Residual MS or mean square residual. Please note that the difference in degrees of freedom for the two models is equal to the number of new variables introduced when we expand from the small model to the large one. One such test is done automatically for you in every regression output where you see an F value: this is the

10 1 particular test for the small model where all Y's are estimated to have the same value, which we call Model C: Model C: Ŷ constant For example, for our large model, F(15,159) = 44.9 and Prob>F =.. We have 15 variables (the X's) more than the model with only a constant, and we have 166 observations and 16 parameters, or 159 df for the large model. Please note that we can calculate the F statistic ourselves from the output when we are comparing to the model with only a constant. Recall that the RSS C for Model C is the Total SS in the output. In this case, the F statistic for comparing our model to model C is (Total SS - Residual SS)/Model df Model SS/Model df Model MS = = Residual MS Residual MS Residual MS When we want to compare a large model to a small one that still has some predictors, we have to use the more complicated expression given on the previous page -- or ask STATA to do it for us. After you issue the command. regress income HS somecl col black hisp female HSb HSh HSf someclf someclb someclh colf colh colb. test HSb HSh HSf someclf someclb someclh colb ( 1) HSb =. ( 2) HSh =. ( 3) HSf =. ( 4) someclf =. ( 5) someclb =. ( 6) someclh =. ( 7) colb =. F( 7, 159) =.52 Prob > F =.822 Here we do not reject the joint hypothesis that these coefficients are all zero. We can then estimate the model with them omitted:

11 11. regress income HS somecl col black hisp female colf colh F( 8, 1597) = 84.8 Model e e+1 Prob > F =. Residual e R-square = Adj R-square =.2928 Total e Root MSE = HS somecl col black hisp female colf colh _cons

12 Fraction.1 Fraction income loginc Residuals 5 Residuals Fitted values model predicting income Fitted values model predicting log income The graphs above were created using income itself and then the log of income. The top graphs show that income is not at all normally distributed, while the loginc = log(income) is reasonably close to normal. Why is this important? The same predictors were used in two regressions, but the outcomes are quite different. Before we go too far in interpreting the results, however, we should use a non-linear education variable -- either by using the dummy variables generated earlier or by introducing and ed 2 term. We ll do this as part of the next homework.

13 13. use discrim. * the smallest value of income is add 9392 to every value so that all values are > MODEL 1: INCOME is the dependent variable. regress inc ed black hisp female edb edh edf F( 7, 1598) = Model e e+1 Prob > F =. Residual e R-squared = Adj R-squared =.2793 Total e Root MSE = ed black hisp female edb edh edf _cons predict inchat. predict incres, resid. graph incres inchat, yline() xlabel ylabel b1(model predicting income) MODEL 2: log income is the dependent variable. gen loginc = inc replace loginc = log(loginc). regress loginc ed black hisp female edb edh edf F( 7, 1598) = 73.2 Model Prob > F =. Residual R-squared = Adj R-squared =.239 Total Root MSE =.5628 loginc Coef. Std. Err. t P> t [95% Conf. Interval] ed black hisp female edb edh

14 14 edf _cons predict loghat. predict logres, resid. graph logres loghat, yline() xlabel ylabel b1(model predicting log income)

Bivariate (Simple) Regression Analysis

Bivariate (Simple) Regression Analysis Revised July 2018 Bivariate (Simple) Regression Analysis This set of notes shows how to use Stata to estimate a simple (two-variable) regression equation. It assumes that you have set Stata up on your

More information

Week 4: Simple Linear Regression II

Week 4: Simple Linear Regression II Week 4: Simple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Algebraic properties

More information

Panel Data 4: Fixed Effects vs Random Effects Models

Panel Data 4: Fixed Effects vs Random Effects Models Panel Data 4: Fixed Effects vs Random Effects Models Richard Williams, University of Notre Dame, http://www3.nd.edu/~rwilliam/ Last revised April 4, 2017 These notes borrow very heavily, sometimes verbatim,

More information

Week 10: Heteroskedasticity II

Week 10: Heteroskedasticity II Week 10: Heteroskedasticity II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Dealing with heteroskedasticy

More information

Introduction to STATA 6.0 ECONOMICS 626

Introduction to STATA 6.0 ECONOMICS 626 Introduction to STATA 6.0 ECONOMICS 626 Bill Evans Fall 2001 This handout gives a very brief introduction to STATA 6.0 on the Economics Department Network. In a few short years, STATA has become one of

More information



More information

Week 11: Interpretation plus

Week 11: Interpretation plus Week 11: Interpretation plus Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline A bit of a patchwork

More information

/23/2004 TA : Jiyoon Kim. Recitation Note 1

/23/2004 TA : Jiyoon Kim. Recitation Note 1 Recitation Note 1 This is intended to walk you through using STATA in an Athena environment. The computer room of political science dept. has STATA on PC machines. But, knowing how to use it on Athena

More information

Week 4: Simple Linear Regression III

Week 4: Simple Linear Regression III Week 4: Simple Linear Regression III Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Goodness of

More information

PubHlth 640 Intermediate Biostatistics Unit 2 - Regression and Correlation. Simple Linear Regression Software: Stata v 10.1

PubHlth 640 Intermediate Biostatistics Unit 2 - Regression and Correlation. Simple Linear Regression Software: Stata v 10.1 PubHlth 640 Intermediate Biostatistics Unit 2 - Regression and Correlation Simple Linear Regression Software: Stata v 10.1 Emergency Calls to the New York Auto Club Source: Chatterjee, S; Handcock MS and

More information

range: [1,20] units: 1 unique values: 20 missing.: 0/20 percentiles: 10% 25% 50% 75% 90%

range: [1,20] units: 1 unique values: 20 missing.: 0/20 percentiles: 10% 25% 50% 75% 90% ------------------ log: \Term 2\Lecture_2s\regression1a.log log type: text opened on: 22 Feb 2008, 03:29:09. cmdlog using " \Term 2\Lecture_2s\regression1a.do" (cmdlog \Term 2\Lecture_2s\regression1a.do

More information

Week 5: Multiple Linear Regression II

Week 5: Multiple Linear Regression II Week 5: Multiple Linear Regression II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Adjusted R

More information

25 Working with categorical data and factor variables

25 Working with categorical data and factor variables 25 Working with categorical data and factor variables Contents 25.1 Continuous, categorical, and indicator variables 25.1.1 Converting continuous variables to indicator variables 25.1.2 Converting continuous

More information

22s:152 Applied Linear Regression

22s:152 Applied Linear Regression 22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider

More information

Introduction to Stata: An In-class Tutorial

Introduction to Stata: An In-class Tutorial Introduction to Stata: An I. The Basics - Stata is a command-driven statistical software program. In other words, you type in a command, and Stata executes it. You can use the drop-down menus to avoid

More information

ECON Introductory Econometrics Seminar 4

ECON Introductory Econometrics Seminar 4 ECON4150 - Introductory Econometrics Seminar 4 Stock and Watson EE8.2 April 28, 2015 Stock and Watson EE8.2 ECON4150 - Introductory Econometrics Seminar 4 April 28, 2015 1 / 20 Current Population Survey

More information

schooling.log 7/5/2006

schooling.log 7/5/2006 ----------------------------------- log: C:\dnb\schooling.log log type: text opened on: 5 Jul 2006, 09:03:57. /* schooling.log */ > use schooling;. gen age2=age76^2;. /* OLS (inconsistent) */ > reg lwage76

More information


optimization_machine_probit_bush106.c optimization_machine_probit_bush106.c. probit ybush black00 south hispanic00 income owner00 dwnom1n dwnom2n Iteration 0: log likelihood = -299.27289 Iteration 1: log likelihood = -154.89847 Iteration 2:

More information

Introduction to Hierarchical Linear Model. Hsueh-Sheng Wu CFDR Workshop Series January 30, 2017

Introduction to Hierarchical Linear Model. Hsueh-Sheng Wu CFDR Workshop Series January 30, 2017 Introduction to Hierarchical Linear Model Hsueh-Sheng Wu CFDR Workshop Series January 30, 2017 1 Outline What is Hierarchical Linear Model? Why do nested data create analytic problems? Graphic presentation

More information

Multiple Linear Regression

Multiple Linear Regression Multiple Linear Regression Rebecca C. Steorts, Duke University STA 325, Chapter 3 ISL 1 / 49 Agenda How to extend beyond a SLR Multiple Linear Regression (MLR) Relationship Between the Response and Predictors

More information

Stata Session 2. Tarjei Havnes. University of Oslo. Statistics Norway. ECON 4136, UiO, 2012

Stata Session 2. Tarjei Havnes. University of Oslo. Statistics Norway. ECON 4136, UiO, 2012 Stata Session 2 Tarjei Havnes 1 ESOP and Department of Economics University of Oslo 2 Research department Statistics Norway ECON 4136, UiO, 2012 Tarjei Havnes (University of Oslo) Stata Session 2 ECON

More information

SOCY7706: Longitudinal Data Analysis Instructor: Natasha Sarkisian. Panel Data Analysis: Fixed Effects Models

SOCY7706: Longitudinal Data Analysis Instructor: Natasha Sarkisian. Panel Data Analysis: Fixed Effects Models SOCY776: Longitudinal Data Analysis Instructor: Natasha Sarkisian Panel Data Analysis: Fixed Effects Models Fixed effects models are similar to the first difference model we considered for two wave data

More information

Multiple Regression White paper

Multiple Regression White paper +44 (0) 333 666 7366 Multiple Regression White paper A tool to determine the impact in analysing the effectiveness of advertising spend. Multiple Regression In order to establish if the advertising mechanisms

More information

Instruction on JMP IN of Chapter 19

Instruction on JMP IN of Chapter 19 Instruction on JMP IN of Chapter 19 Example 19.2 (1). Download the dataset xm19-02.jmp from the website for this course and open it. (2). Go to the Analyze menu and select Fit Model. Click on "REVENUE"

More information

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions) THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533 MIDTERM EXAMINATION: October 14, 2005 Instructor: Val LeMay Time: 50 minutes 40 Marks FRST 430 50 Marks FRST 533 (extra questions) This examination

More information

Lab 2: OLS regression

Lab 2: OLS regression Lab 2: OLS regression Andreas Beger February 2, 2009 1 Overview This lab covers basic OLS regression in Stata, including: multivariate OLS regression reporting coefficients with different confidence intervals

More information

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors (Section 5.4) What? Consequences of homoskedasticity Implication for computing standard errors What do these two terms

More information

22s:152 Applied Linear Regression

22s:152 Applied Linear Regression 22s:152 Applied Linear Regression Chapter 22: Model Selection In model selection, the idea is to find the smallest set of variables which provides an adequate description of the data. We will consider

More information

Robust Linear Regression (Passing- Bablok Median-Slope)

Robust Linear Regression (Passing- Bablok Median-Slope) Chapter 314 Robust Linear Regression (Passing- Bablok Median-Slope) Introduction This procedure performs robust linear regression estimation using the Passing-Bablok (1988) median-slope algorithm. Their

More information

Week 9: Modeling II. Marcelo Coca Perraillon. Health Services Research Methods I HSMP University of Colorado Anschutz Medical Campus

Week 9: Modeling II. Marcelo Coca Perraillon. Health Services Research Methods I HSMP University of Colorado Anschutz Medical Campus Week 9: Modeling II Marcelo Coca Perraillon University of Colorado Anschutz Medical Campus Health Services Research Methods I HSMP 7607 2017 c 2017 PERRAILLON ARR 1 Outline Taking the log Retransformation

More information

Stata versions 12 & 13 Week 4 Practice Problems

Stata versions 12 & 13 Week 4 Practice Problems Stata versions 12 & 13 Week 4 Practice Problems SOLUTIONS 1 Practice Screen Capture a Create a word document Name it using the convention lastname_lab1docx (eg bigelow_lab1docx) b Using your browser, go

More information

Cell means coding and effect coding

Cell means coding and effect coding Cell means coding and effect coding /* mathregr_3.sas */ %include 'readmath.sas'; title2 ''; /* The data step continues */ if ethnic ne 6; /* Otherwise, throw the case out */ /* Indicator dummy variables

More information

Coding Categorical Variables in Regression: Indicator or Dummy Variables. Professor George S. Easton

Coding Categorical Variables in Regression: Indicator or Dummy Variables. Professor George S. Easton Coding Categorical Variables in Regression: Indicator or Dummy Variables Professor George S. Easton DataScienceSource.com This video is embedded on the following web page at DataScienceSource.com: DataScienceSource.com/DummyVariables

More information

Dr. Barbara Morgan Quantitative Methods

Dr. Barbara Morgan Quantitative Methods Dr. Barbara Morgan Quantitative Methods 195.650 Basic Stata This is a brief guide to using the most basic operations in Stata. Stata also has an on-line tutorial. At the initial prompt type tutorial. In

More information

Centering and Interactions: The Training Data

Centering and Interactions: The Training Data Centering and Interactions: The Training Data A random sample of 150 technical support workers were first given a test of their technical skill and knowledge, and then randomly assigned to one of three

More information

Linear regression Number of obs = 6,866 F(16, 326) = Prob > F = R-squared = Root MSE =

Linear regression Number of obs = 6,866 F(16, 326) = Prob > F = R-squared = Root MSE = - /*** To demonstrate use of 2SLS ***/ * Case: In the early 1990's Tanzania implemented a FP program to reduce fertility, which was among the highest in the world * The FP program had two main components:

More information

CH5: CORR & SIMPLE LINEAR REFRESSION =======================================

CH5: CORR & SIMPLE LINEAR REFRESSION ======================================= STAT 430 SAS Examples SAS5 ===================== ssh xyz@glue.umd.edu, tap sas913 (old sas82), sas https://www.statlab.umd.edu/sasdoc/sashtml/onldoc.htm CH5: CORR & SIMPLE LINEAR REFRESSION =======================================

More information

Two-Stage Least Squares

Two-Stage Least Squares Chapter 316 Two-Stage Least Squares Introduction This procedure calculates the two-stage least squares (2SLS) estimate. This method is used fit models that include instrumental variables. 2SLS includes

More information


Source: Time Series Source: http://www.princeton.edu/~otorres/stata/ Time series data is data collected over time for a single or a group of variables. Date variable For this kind of data the first thing to do

More information

Lecture 1: Statistical Reasoning 2. Lecture 1. Simple Regression, An Overview, and Simple Linear Regression

Lecture 1: Statistical Reasoning 2. Lecture 1. Simple Regression, An Overview, and Simple Linear Regression Lecture Simple Regression, An Overview, and Simple Linear Regression Learning Objectives In this set of lectures we will develop a framework for simple linear, logistic, and Cox Proportional Hazards Regression

More information

Descriptives. Graph. [DataSet1] C:\Documents and Settings\BuroK\Desktop\Prestige.sav

Descriptives. Graph. [DataSet1] C:\Documents and Settings\BuroK\Desktop\Prestige.sav GET FILE='C:\Documents and Settings\BuroK\Desktop\Prestige.sav'. DESCRIPTIVES VARIABLES=prestige education income women /STATISTICS=MEAN STDDEV MIN MAX. Descriptives Input Missing Value Handling Resources

More information

Laboratory for Two-Way ANOVA: Interactions

Laboratory for Two-Way ANOVA: Interactions Laboratory for Two-Way ANOVA: Interactions For the last lab, we focused on the basics of the Two-Way ANOVA. That is, you learned how to compute a Brown-Forsythe analysis for a Two-Way ANOVA, as well as

More information

Regression Analysis and Linear Regression Models

Regression Analysis and Linear Regression Models Regression Analysis and Linear Regression Models University of Trento - FBK 2 March, 2015 (UNITN-FBK) Regression Analysis and Linear Regression Models 2 March, 2015 1 / 33 Relationship between numerical

More information

Review of Stata II AERC Training Workshop Nairobi, May 2002

Review of Stata II AERC Training Workshop Nairobi, May 2002 Review of Stata II AERC Training Workshop Nairobi, 20-24 May 2002 This note provides more information on the basics of Stata that should help you with the exercises in the remaining sessions of the workshop.

More information

CSC 328/428 Summer Session I 2002 Data Analysis for the Experimenter FINAL EXAM

CSC 328/428 Summer Session I 2002 Data Analysis for the Experimenter FINAL EXAM options pagesize=53 linesize=76 pageno=1 nodate; proc format; value $stcktyp "1"="Growth" "2"="Combined" "3"="Income"; data invstmnt; input stcktyp $ perform; label stkctyp="type of Stock" perform="overall

More information

A Short Introduction to STATA

A Short Introduction to STATA A Short Introduction to STATA 1) Introduction: This session serves to link everyone from theoretical equations to tangible results under the amazing promise of Stata! Stata is a statistical package that

More information

Exercise: Graphing and Least Squares Fitting in Quattro Pro

Exercise: Graphing and Least Squares Fitting in Quattro Pro Chapter 5 Exercise: Graphing and Least Squares Fitting in Quattro Pro 5.1 Purpose The purpose of this experiment is to become familiar with using Quattro Pro to produce graphs and analyze graphical data.

More information

Subset Selection in Multiple Regression

Subset Selection in Multiple Regression Chapter 307 Subset Selection in Multiple Regression Introduction Multiple regression analysis is documented in Chapter 305 Multiple Regression, so that information will not be repeated here. Refer to that

More information

An Introductory Guide to Stata

An Introductory Guide to Stata An Introductory Guide to Stata Scott L. Minkoff Assistant Professor Department of Political Science Barnard College sminkoff@barnard.edu Updated: July 9, 2012 1 TABLE OF CONTENTS ABOUT THIS GUIDE... 4

More information

Variable selection is intended to select the best subset of predictors. But why bother?

Variable selection is intended to select the best subset of predictors. But why bother? Chapter 10 Variable Selection Variable selection is intended to select the best subset of predictors. But why bother? 1. We want to explain the data in the simplest way redundant predictors should be removed.

More information

An Econometric Study: The Cost of Mobile Broadband

An Econometric Study: The Cost of Mobile Broadband An Econometric Study: The Cost of Mobile Broadband Zhiwei Peng, Yongdon Shin, Adrian Raducanu IATOM13 ENAC January 16, 2014 Zhiwei Peng, Yongdon Shin, Adrian Raducanu (UCLA) The Cost of Mobile Broadband

More information

An Example of Using inter5.exe to Obtain the Graph of an Interaction

An Example of Using inter5.exe to Obtain the Graph of an Interaction An Example of Using inter5.exe to Obtain the Graph of an Interaction This example covers the general use of inter5.exe to produce data from values inserted into a regression equation which can then be

More information

Introduction to Mixed Models: Multivariate Regression

Introduction to Mixed Models: Multivariate Regression Introduction to Mixed Models: Multivariate Regression EPSY 905: Multivariate Analysis Spring 2016 Lecture #9 March 30, 2016 EPSY 905: Multivariate Regression via Path Analysis Today s Lecture Multivariate

More information

Applied Statistics and Econometrics Lecture 6

Applied Statistics and Econometrics Lecture 6 Applied Statistics and Econometrics Lecture 6 Giuseppe Ragusa Luiss University gragusa@luiss.it http://gragusa.org/ March 6, 2017 Luiss University Empirical application. Data Italian Labour Force Survey,

More information



More information

Health Disparities (HD): It s just about comparing two groups

Health Disparities (HD): It s just about comparing two groups A review of modern methods of estimating the size of health disparities May 24, 2017 Emil Coman 1 Helen Wu 2 1 UConn Health Disparities Institute, 2 UConn Health Modern Modeling conference, May 22-24,

More information



More information

Salary 9 mo : 9 month salary for faculty member for 2004

Salary 9 mo : 9 month salary for faculty member for 2004 22s:52 Applied Linear Regression DeCook Fall 2008 Lab 3 Friday October 3. The data Set In 2004, a study was done to examine if gender, after controlling for other variables, was a significant predictor

More information

Performing Cluster Bootstrapped Regressions in R

Performing Cluster Bootstrapped Regressions in R Performing Cluster Bootstrapped Regressions in R Francis L. Huang / October 6, 2016 Supplementary material for: Using Cluster Bootstrapping to Analyze Nested Data with a Few Clusters in Educational and

More information

Lecture 13: Model selection and regularization

Lecture 13: Model selection and regularization Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always

More information

texdoc 2.0 An update on creating LaTeX documents from within Stata Example 2

texdoc 2.0 An update on creating LaTeX documents from within Stata Example 2 texdoc 20 An update on creating LaTeX documents from within Stata Contents Example 2 Ben Jann University of Bern, benjann@sozunibech 2016 German Stata Users Group Meeting GESIS, Cologne, June 10, 2016

More information

IQR = number. summary: largest. = 2. Upper half: Q3 =

IQR = number. summary: largest. = 2. Upper half: Q3 = Step by step box plot Height in centimeters of players on the 003 Women s Worldd Cup soccer team. 157 1611 163 163 164 165 165 165 168 168 168 170 170 170 171 173 173 175 180 180 Determine the 5 number

More information



More information

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value. Calibration OVERVIEW... 2 INTRODUCTION... 2 CALIBRATION... 3 ANOTHER REASON FOR CALIBRATION... 4 CHECKING THE CALIBRATION OF A REGRESSION... 5 CALIBRATION IN SIMPLE REGRESSION (DISPLAY.JMP)... 5 TESTING

More information

Section 2.3: Simple Linear Regression: Predictions and Inference

Section 2.3: Simple Linear Regression: Predictions and Inference Section 2.3: Simple Linear Regression: Predictions and Inference Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.4 1 Simple

More information

Gelman-Hill Chapter 3

Gelman-Hill Chapter 3 Gelman-Hill Chapter 3 Linear Regression Basics In linear regression with a single independent variable, as we have seen, the fundamental equation is where ŷ bx 1 b0 b b b y 1 yx, 0 y 1 x x Bivariate Normal

More information

Creating LaTeX and HTML documents from within Stata using texdoc and webdoc. Example 2

Creating LaTeX and HTML documents from within Stata using texdoc and webdoc. Example 2 Creating LaTeX and HTML documents from within Stata using texdoc and webdoc Contents Example 2 Ben Jann University of Bern, benjann@sozunibech Nordic and Baltic Stata Users Group meeting Oslo, September

More information

Frequency Tables. Chapter 500. Introduction. Frequency Tables. Types of Categorical Variables. Data Structure. Missing Values

Frequency Tables. Chapter 500. Introduction. Frequency Tables. Types of Categorical Variables. Data Structure. Missing Values Chapter 500 Introduction This procedure produces tables of frequency counts and percentages for categorical and continuous variables. This procedure serves as a summary reporting tool and is often used

More information

1 Downloading files and accessing SAS. 2 Sorting, scatterplots, correlation and regression

1 Downloading files and accessing SAS. 2 Sorting, scatterplots, correlation and regression Statistical Methods and Computing, 22S:30/105 Instructor: Cowles Lab 2 Feb. 6, 2015 1 Downloading files and accessing SAS. We will be using the billion.dat dataset again today, as well as the OECD dataset

More information

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false.

ST512. Fall Quarter, Exam 1. Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false. ST512 Fall Quarter, 2005 Exam 1 Name: Directions: Answer questions as directed. Please show work. For true/false questions, circle either true or false. 1. (42 points) A random sample of n = 30 NBA basketball

More information

Example 1 of panel data : Data for 6 airlines (groups) over 15 years (time periods) Example 1

Example 1 of panel data : Data for 6 airlines (groups) over 15 years (time periods) Example 1 Panel data set Consists of n entities or subjects (e.g., firms and states), each of which includes T observations measured at 1 through t time period. total number of observations : nt Panel data have

More information

Model Selection and Inference

Model Selection and Inference Model Selection and Inference Merlise Clyde January 29, 2017 Last Class Model for brain weight as a function of body weight In the model with both response and predictor log transformed, are dinosaurs

More information

PSY 9556B (Feb 5) Latent Growth Modeling

PSY 9556B (Feb 5) Latent Growth Modeling PSY 9556B (Feb 5) Latent Growth Modeling Fixed and random word confusion Simplest LGM knowing how to calculate dfs How many time points needed? Power, sample size Nonlinear growth quadratic Nonlinear growth

More information

piecewise ginireg 1 Piecewise Gini Regressions in Stata Jan Ditzen 1 Shlomo Yitzhaki 2 September 8, 2017

piecewise ginireg 1 Piecewise Gini Regressions in Stata Jan Ditzen 1 Shlomo Yitzhaki 2 September 8, 2017 piecewise ginireg 1 Piecewise Gini Regressions in Stata Jan Ditzen 1 Shlomo Yitzhaki 2 1 Heriot-Watt University, Edinburgh, UK Center for Energy Economics Research and Policy (CEERP) 2 The Hebrew University

More information



More information


SPSS INSTRUCTION CHAPTER 9 SPSS INSTRUCTION CHAPTER 9 Chapter 9 does no more than introduce the repeated-measures ANOVA, the MANOVA, and the ANCOVA, and discriminant analysis. But, you can likely envision how complicated it can

More information

Section 2.1: Intro to Simple Linear Regression & Least Squares

Section 2.1: Intro to Simple Linear Regression & Least Squares Section 2.1: Intro to Simple Linear Regression & Least Squares Jared S. Murray The University of Texas at Austin McCombs School of Business Suggested reading: OpenIntro Statistics, Chapter 7.1, 7.2 1 Regression:

More information

Labor Economics with STATA. Estimating the Human Capital Model Using Artificial Data

Labor Economics with STATA. Estimating the Human Capital Model Using Artificial Data Labor Economics with STATA Liyousew G. Borga December 2, 2015 Estimating the Human Capital Model Using Artificial Data Liyou Borga Labor Economics with STATA December 2, 2015 84 / 105 Outline 1 The Human

More information

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100. Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1 Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records

More information

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Statistics Lab #7 ANOVA Part 2 & ANCOVA Statistics Lab #7 ANOVA Part 2 & ANCOVA PSYCH 710 7 Initialize R Initialize R by entering the following commands at the prompt. You must type the commands exactly as shown. options(contrasts=c("contr.sum","contr.poly")

More information

3-1 Writing Linear Equations

3-1 Writing Linear Equations 3-1 Writing Linear Equations Suppose you have a job working on a monthly salary of $2,000 plus commission at a car lot. Your commission is 5%. What would be your pay for selling the following in monthly

More information

Orange Juice data. Emanuele Taufer. 4/12/2018 Orange Juice data (1)

Orange Juice data. Emanuele Taufer. 4/12/2018 Orange Juice data (1) Orange Juice data Emanuele Taufer file:///c:/users/emanuele.taufer/google%20drive/2%20corsi/5%20qmma%20-%20mim/0%20labs/l10-oj-data.html#(1) 1/31 Orange Juice Data The data contain weekly sales of refrigerated

More information

Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017

Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 04, 2017 Bivariate Linear Regression James M. Murray, Ph.D. University of Wisconsin - La Crosse Updated: October 4, 217 PDF file location: http://www.murraylax.org/rtutorials/regression_intro.pdf HTML file location:

More information

Lecture 3: The basic of programming- do file and macro

Lecture 3: The basic of programming- do file and macro Introduction to Stata- A. Chevalier Lecture 3: The basic of programming- do file and macro Content of Lecture 3: -Entering and executing programs do file program ado file -macros 1 A] Entering and executing

More information

Multiple-imputation analysis using Stata s mi command

Multiple-imputation analysis using Stata s mi command Multiple-imputation analysis using Stata s mi command Yulia Marchenko Senior Statistician StataCorp LP 2009 UK Stata Users Group Meeting Yulia Marchenko (StataCorp) Multiple-imputation analysis using mi

More information

Non-Linear Regression. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel

Non-Linear Regression. Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel Non-Linear Regression Business Analytics Practice Winter Term 2015/16 Stefan Feuerriegel Today s Lecture Objectives 1 Understanding the need for non-parametric regressions 2 Familiarizing with two common

More information

An Introduction to Growth Curve Analysis using Structural Equation Modeling

An Introduction to Growth Curve Analysis using Structural Equation Modeling An Introduction to Growth Curve Analysis using Structural Equation Modeling James Jaccard New York University 1 Overview Will introduce the basics of growth curve analysis (GCA) and the fundamental questions

More information

The x-intercept can be found by setting y = 0 and solving for x: 16 3, 0

The x-intercept can be found by setting y = 0 and solving for x: 16 3, 0 y=-3/4x+4 and y=2 x I need to graph the functions so I can clearly describe the graphs Specifically mention any key points on the graphs, including intercepts, vertex, or start/end points. What is the

More information

Reproducible Research: Weaving with Stata

Reproducible Research: Weaving with Stata StataCorp LP Italian Stata Users Group Meeting October, 2008 Outline I Introduction 1 Introduction Goals Reproducible Research and Weaving 2 3 What We ve Seen Goals Reproducible Research and Weaving Goals

More information

Introduction to Programming in Stata

Introduction to Programming in Stata Introduction to in Stata Laron K. University of Missouri Goals Goals Replicability! Goals Replicability! Simplicity/efficiency Goals Replicability! Simplicity/efficiency Take a peek under the hood! Data

More information

1 Introducing Stata sample session

1 Introducing Stata sample session 1 Introducing Stata sample session Introducing Stata This chapter will run through a sample work session, introducing you to a few of the basic tasks that can be done in Stata, such as opening a dataset,

More information

Compare Linear Regression Lines for the HP-67

Compare Linear Regression Lines for the HP-67 Compare Linear Regression Lines for the HP-67 by Namir Shammas This article presents an HP-67 program that calculates the linear regression statistics for two data sets and then compares their slopes and

More information

Math 182. Assignment #4: Least Squares

Math 182. Assignment #4: Least Squares Introduction Math 182 Assignment #4: Least Squares In any investigation that involves data collection and analysis, it is often the goal to create a mathematical function that fits the data. That is, a

More information

Section 1.5: Point-Slope Form

Section 1.5: Point-Slope Form Section 1.: Point-Slope Form Objective: Give the equation of a line with a known slope and point. The slope-intercept form has the advantage of being simple to remember and use, however, it has one major

More information

Generalized least squares (GLS) estimates of the level-2 coefficients,

Generalized least squares (GLS) estimates of the level-2 coefficients, Contents 1 Conceptual and Statistical Background for Two-Level Models...7 1.1 The general two-level model... 7 1.1.1 Level-1 model... 8 1.1.2 Level-2 model... 8 1.2 Parameter estimation... 9 1.3 Empirical

More information

Factorial ANOVA with SAS

Factorial ANOVA with SAS Factorial ANOVA with SAS /* potato305.sas */ options linesize=79 noovp formdlim='_' ; title 'Rotten potatoes'; title2 ''; proc format; value tfmt 1 = 'Cool' 2 = 'Warm'; data spud; infile 'potato2.data'

More information

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools

Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Regression on SAT Scores of 374 High Schools and K-means on Clustering Schools Abstract In this project, we study 374 public high schools in New York City. The project seeks to use regression techniques

More information

Regression. Page 1. Notes. Output Created Comments Data. 26-Mar :31:18. Input. C:\Documents and Settings\BuroK\Desktop\Data Sets\Prestige.

Regression. Page 1. Notes. Output Created Comments Data. 26-Mar :31:18. Input. C:\Documents and Settings\BuroK\Desktop\Data Sets\Prestige. GET FILE='C:\Documents and Settings\BuroK\Desktop\DataSets\Prestige.sav'. GET FILE='E:\MacEwan\Teaching\Stat252\Data\SPSS_data\MENTALID.sav'. DATASET ACTIVATE DataSet1. DATASET CLOSE DataSet2. GET FILE='E:\MacEwan\Teaching\Stat252\Data\SPSS_data\survey_part.sav'.

More information

Stat 401 B Lecture 26

Stat 401 B Lecture 26 Stat B Lecture 6 Forward Selection The Forward selection rocedure looks to add variables to the model. Once added, those variables stay in the model even if they become insignificant at a later ste. Backward

More information

22s:152 Applied Linear Regression DeCook Fall 2011 Lab 3 Monday October 3

22s:152 Applied Linear Regression DeCook Fall 2011 Lab 3 Monday October 3 s:5 Applied Linear Regression DeCook all 0 Lab onday October The data Set In 004, a study was done to examine if gender, after controlling for other variables, was a significant predictor of salary for

More information