Birkbeck College Department of Economics, Mathematics and Statistics. Graduate Certificates and Diplomas Economics, Finance, Financial Engineering 2012 Applied Statistics and Econometrics INTRODUCTION TO STATA Elisa Cavatorta ecavatorta@ems.bbk.ac.uk
CONTENTS 1. THE BASICS OF WORKING WITH STATA 1.1. A note to start 1.2. The Stata Windows 1.3. Knowing where you are 1.4. Creating a do-file 1.5. Creating a log-file 1.6. Importing the data 1.7. Labelling and rename 1.8. Preliminary steps and general terminology 1.9. Connectors 2. TODAY s RESEARCH PROJECT 2.1. Looking at the data 2.2. Descriptive statistics 2.3. Generating new variables 2.4. Linear regression 2.5. Post-estimation: predicted values and diagnostics 2.5.1 Misspecification 2.5.2 Heteroskedasticity 2.6. Comparing competing models: measures of fit 2.7. Hypothesis testing 2.8. Marginal effects 2.9. Presenting regression results A.1 A notes on Stata with Time-series A.2 Sources and References A.3 List of regression commands
1 THE BASICS OF WORKING WITH STATA 1.1 A note to start These notes aim to introduce you to the basics of working with Stata. Stata is a power software for data analysis, implementing a huge range of techniques. These notes are based on Stata 12 available on Birkbeck College labs. A word of warning: using Stata is a learning process, do not be discouraged by error messages! 1.2 The STATA Windows The window labeled Command is where you type your commands. Stata then shows the results in the larger black window above. Your command is added to a list in the window labeled Review on the left, so you can keep track of the commands you have used. The window labeled Variables, on the top right, lists the variables in your dataset. The Properties window immediately below that, new in version 12, displays properties of your variables and dataset. 1.3 Knowing where you are The command cd gives you where Stata is working and saving files. You can change it by typing a new location cd "C:\Users\ELISA\MyASEProjects\ 1.4 Creating a Do-file Always create a do-file to track what you did. A do file is just a set of Stata commands typed in a plain text file. You can use Stata's own built-in do-file Editor, which has the great advantage that you can run your program directly from the editor by clicking on the run icon. 1.5 A log file To keep a permanent record of your results, however, you should log your session. When you open a log, Stata writes all results to both the Results window and to the file you specify. To open a log file use the command log using filename, text replace where filename is the name of your log file. Note the use of two recommended options: text and replace. text option creates logs in plain text (ASCII) format, which can be viewed in an editor. replace option replaces the old version. If you use the Menu windows Log => Begin, by default the log is written using SMCL, Stata Markup and Control Language (pronounced "smicle"). You need to use the translate command to convert it to plain text.
1.6 Importing the data If the data are in STATA format (.dta) you can import them directly. Go to File=> Open => browse to the data location. This is equivalent to type: use houseprice.dta, clear If the data are in another format you need to import the differently. Go to File => import => (choose the data type you have) Excel spreadsheet [.xls]/ text data created by a spreadsheet [.csv]. Equivalent to type: insheet using " HousePrice.csv", comma import excel " HousePrice.xls", sheet("houseprice") firstrow You can see your data from the Data Editor button. 1.7 Labelling and rename label var price "median price of single-family home" rename room rooms 1.8 Preliminary steps and general terminology Stata needs to know which typology of data you are using. Simple cross-sectional data do not need to be declared. Time-series data: tsset year, yearly Survey data with complex strata PSU: svy. Panel data: tsset panelvar timevar. Few additional useful things. If you need more space you can ask it here s a typical set up: set mem 10m (to set memory size) set more off (to let the output on the screen to run until the end of the command) Options: everything that is followed by a comma (,) is an optional command. Help: typing help command gives you explanation about a command. Let s try with help use 1.9 Connectors & and or > strictly greater, < strictly smaller == equals >= greater or equal to
2. TODAY S RESEARCH PROJECT: single-family housing prices We want to analyse the influence on house prices exerted by several external factors. We illustrate this with data on 506 Boston Communities housing price data. The response variable is the logarithm of the median price of a signle-family home in each community. The external factors under consideration include a measure of air pollution (lnox, the log of nitrous oxide in parts per 100m), the distance from the community to employment centers (ldist, the log of the weighted distance to five employment centers), the average student-teacher ratio in local schools (stratio). 2.1 Looking at the data Be aware of what is in your dataset and which type of variables. You can describe the data by describe Always plot your data: graphs contain a lot of information. Explore the number of possibilities of graphs in Graphics on the Menu list. To create a single plot type overlaid by normal distribution: histogram price, bin(30) normal To create a two ways scatter plot of house prices and number of rooms. twoway (scatter price rooms, sort) scatter is the type of connector (with time series you want line). sort is the option to sort on x variable. What can you say about the relationship? Which correlation do you expect? twoway (scatter price dist, sort) What can you say about the relationship? 2.2 Descriptive statistics summarize command followed by the names of the variables (which can be omitted to summarize everything). For more detailed statistics, use summarize [varlist], detail summarize summarize price, det summarize price if rooms > 6.28 A note: stata wants > (strictly greater), < (strictly smaller) or == (equal). histogram price, bins(22) normal Is the variable normal distributed? You can test this formally by the Skewness/Kurtosis test for Normality sktest price How do the variables correlate and at which level of significance? Are there collinear variables
pwcorr price rooms nox dist stratio, sig 2.3 Generating new variables: generate, egen, replace To compute a new variable use the generate command with a new variable name and an arithmetic expression. Choose variable names that are easy and remind you what the variable is about. Remind that Stata commands are case sensitive. Let s generate the logs of housing price generate lprice = log(price) Logs variables may help with heteroskedasticity and normality. Check that lprice approximate better a normal distribution, e.g. histogram lprice, bins(22). A useful command to create a new variable that satisfies certain condition is generate newvariable = cond(variable x == a, 1, 0) which tells that if the condition variable x == a is satisfied the new variable should take the value of 1, otherwise it is 0. A useful extension to generate is egen. Type help egen for a full list of possibilities. 2.4 Linear regression Stata can do a lot of fancy regressions. The syntax for most of them is very similar. We will focus on this is the most basic form of linear regression. regress fits a model of depvar on varlist using linear regression. By default it includes the constant term. The help regress command will bring up the following instructions for using regress. regress lprice rooms lnox ldist stratio * The top-left corner gives the ANOVA decomposition of the sum of squares in the dependent variable (Total) into the explained (Model) and unexplained (Residual). * The top-right corner reports the statistical significance results for the model as a whole. * The bottom section gives the results for the individual explanatory variables. Useful options The regress command can be used with the robust option for estimating the standard errors using the Huber-White sandwich estimator (to correct the standard errors for heteroscedasticity). 2.5 Post-estimation: predicted values and diagnostics A number of predicted values can be obtained after all estimation commands listed above. The most important are the predicted values for the dependent variable and the predicted residuals. regress lprice rooms nox dist stratio predict lpricehat, xb label var lprice Predicted log price predict uhat, residual
before looking at the coefficients you need to make sure your regression is sufficiently healthy. There are a number of diagnostic tests available in Stata. Type help regress postestimation for a list of available tests. twoway (scatter lpricehat lprice) (line lprice lprice if lprice <., clwidth(thin) ), ytitle( Predicted log median housing price ) xtitle( Actual log median housing price ) legend(off) rvfplot, yline(0) 2.5.1 Misspecification Misspecification may arise because the true model specifies a nonlinear relationship and we omit a squared term. One way of testing this is the RESET test. The RESET tests runs an augmented regression that include the original regressors, powers of the predicted values and powers of the original regressors. The null hypothesis tested is no misspecification. Under the null hypothesis of no-misspecification, the coefficients of the additional regressors are zero. estat ovtest rvpplot ldist, ms(0h) yline(0) The residual is more variable for low level of log distance. Hence, the hypothesis of homoskedasticity is untenable. 2.5.2 Heteroskedasticity The Breusch Pagan test of the null hypothesis of homoskedasticity is implemented by estat hettest 2.6 Comparing competing models: measures of fit You should be able to comment on the R 2, adj R 2 and SER. You can also check the Information Criteria. estat ic estat ic will display the log likelihood of the null model (only a constant term), the log likelihood of the fitted model and the AIC and BIC statistics. Lower values indicate better fit. For example, try to adjust the previous model by taking the log of the distance and adding a squared term. Any improvements? Compare the measures of fit. gen ldist2 = ldist^2 label var dist2 "Log Distance squared" regress lprice rooms lnox ldist ldist2 stratio gen rooms2 = rooms^2 regress lprice rooms rooms2 lnox ldist ldist2 stratio lproptax 2.7 Hypothesis testing The regression output automatically includes a two-sided t-test (for linear regressions) on the null hypothesis that the true coefficient is equal to zero for each independent variable. Two equivalent formulations: test _b[rooms] = 0
test rooms Let s suppose the theory suggests that the coefficient on variable rooms should be 0.33. This is testable by test rooms = 0.33 You can test arbitrary restrictions, such as that the three coefficients equal zero lincom rooms + ldist + stratio You can test equality of two coefficients by test ldist = stratio 2.8 Marginal effects The command mfx computes marginal effects or elasticities after estimation. The option eyex computes the elasticity of y with respect to x, equivalent to the marginal effect in the log-log specification. regress price rooms nox dist stratio mfx, eyex You will find rooms to be elastic, having almost twice as large an effect on price in proportional terms. nox dist are inelastic, with estimated elasticity within the unit interval. 2.9 Presenting regression results It is generally good practise to present competing models to support your analysis. In the text of your project you need to justify which model you consider the best fitting model. You need to estimate all models first, save the estimation results (estimates store) and create a table. Here is an example quietly regress lprice rooms est store m1 quietly regress lprice rooms lnox ldist stratio est store m2 quietly regress lprice rooms lnox ldist ldist2 stratio lproptax est store m3 quietly regress lprice rooms rooms2 lnox ldist ldist2 stratio lproptax est store m4 estout m1 m2 m3 m4, stats(r2_a rmse aic) cells(b(star fmt(%8.3f)) /// se(par fmt(%6.3f))) starlevels(* 0.1 ** 0.05 *** 0.01)
A.1 A Note on Stata for time-series Stata has many build-in command for analysing time-series data. First, you need to tell Stata you are using time-series data. You do this by typing tsset timevariable (e.g. tsset year) You can find tests for univariate time-series, such as ADF in Statistics=> Time series => Tests Diagnostics tests after regression commands, such as Durbin Watson test, Godfrey LM test and heteroskedasticity test can be found in Statistics=> Time series => Tests => Time Series specification test after regress Line plots, correlograms, autocorrelation graphs can be found in Statistics=> Time series => Graphs More complex analysis for multivariate time series such as VAR, VECM and Cointegration tests can be found in Statistics=> Multivariate time series
A.2 Sources and References Stata website at http://www.stata.com. Among other things you will find that they make available online all datasets used in the official documentation, that they publish a journal called Stata Journal, and that they have an excellent bookstore with texts on Stata and related statistical subjects. Stata also offers email and web-based training courses called NetCourses, see http://www.stata.com/info/products/netcourse/. There is an independent listserv where you can post questions and receive prompt and knowledgeable answers from other users. To join the list see http://www.stata.com/support/statalist/ and follow the link to subscribe. Stata also maintains a list of frequently asked questions (FAQ) classified by topic, see http://www.stata.com/support/faqs/. UCLA maintains an excellent Stata portal at http://www.ats.ucla.edu/stat/stata/ There is a list manuals such as An introduction to Modern Econometrics using Stata by C. Baum. A.3 List of regression commands anova analysis of variance and covariance cnreg censored-normal regression gmm Generalized methods of moments estimator heckman Heckman selection model intreg interval regression ivregress instrumental variables (2SLS) regression newey regression with Newey-West standard errors prais Prais-Winsten, Cochrane-Orcutt, or Hildreth-Lu regression qreg quantile (including median) regression reg ordinary least squares regression reg3 three-stage least squares regression rreg robust regression (NOT robust standard errors) sureg seemingly unrelated regression tobit tobit regression treatreg treatment effects model truncreg truncated regression xtabond Arellano-Bond linear, dynamic panel-data estimator xtintreg panel data interval regression models xtreg fixed- and random-effects linear models xtregar fixed- and random-effects linear models with an AR(1) disturbance xttobit panel data tobit models