Birkbeck College Department of Economics, Mathematics and Statistics.

Similar documents
Title. Description. time series Introduction to time-series commands

Getting started with Stata 2017: Cheat-sheet

A quick introduction to STATA

GETTING STARTED WITH STATA. Sébastien Fontenay ECON - IRES

An Introductory Guide to Stata

Serial Correlation and Heteroscedasticity in Time series Regressions. Econometric (EC3090) - Week 11 Agustín Bénétrix

A quick introduction to STATA:

Introduction to Stata. Getting Started. This is the simple command syntax in Stata and more conditions can be added as shown in the examples.

Labor Economics with STATA. Estimating the Human Capital Model Using Artificial Data

Dr. Barbara Morgan Quantitative Methods

Within these three broad categories, similar commands have been grouped together. Declare data to be time-series data [TS] tsfill

SOCY7706: Longitudinal Data Analysis Instructor: Natasha Sarkisian. Panel Data Analysis: Fixed Effects Models

A Quick Guide to Stata 8 for Windows

STATA 13 INTRODUCTION

A Short Guide to Stata 10 for Windows

An Introduction to Stata Part II: Data Analysis

A First Tutorial in Stata

Introduction to Computing for Sociologists Neustadtl

A quick introduction to STATA:

Intro to Stata. University of Virginia Library data.library.virginia.edu. September 16, 2014

Revision of Stata basics in STATA 11:

STATA Tutorial. Introduction to Econometrics. by James H. Stock and Mark W. Watson. to Accompany

Week 10: Heteroskedasticity II

Basic Stata Tutorial

Stata Training. AGRODEP Technical Note 08. April Manuel Barron and Pia Basurto

ECO375 Tutorial 1 Introduction to Stata

CLAREMONT MCKENNA COLLEGE. Fletcher Jones Student Peer to Peer Technology Training Program. Basic Statistics using Stata

A Short Introduction to STATA

Introduction to STATA

Applied Regression Modeling: A Business Approach

From the help desk. Allen McDowell Stata Corporation

Model Diagnostic tests

Important Things to Know about Stata

Analysis of Panel Data. Third Edition. Cheng Hsiao University of Southern California CAMBRIDGE UNIVERSITY PRESS

Economics 145 Fall 2009 Howell Getting Started with Stata

PAM 4280/ECON 3710: The Economics of Risky Health Behaviors Fall 2015 Professor John Cawley TA Christine Coyer. Stata Basics for PAM 4280/ECON 3710

set mem 10m we can also decide to have the more separation line on the screen or not when the software displays results: set more on set more off

STATA TUTORIAL B. Rabin with modifications by T. Marsh

Introduction to Stata Session 3

Migration and the Labour Market: Data and Intro to STATA

Intro to E-Views. E-views is a statistical package useful for cross sectional, time series and panel data statistical analysis.

Introduction to Statistical Analyses in SAS

Introduction to Stata: An In-class Tutorial

Workshop for empirical trade analysis. December 2015 Bangkok, Thailand

Lab 1: Basics of Stata Short Course on Poverty & Development for Nordic Ph.D. Students University of Copenhagen June 13-23, 2000

Intro to Stata for Political Scientists

Source:

INTRODUCTION TO PANEL DATA ANALYSIS

After opening Stata for the first time: set scheme s1mono, permanently

ECONOMICS 452 TIME SERIES WITH STATA

Graphics before and after model fitting. Nicholas J. Cox University of Durham.

Intermediate Stata. Jeremy Craig Green. 1 March /29/2011 1

Introduction to Stata. Written by Yi-Chi Chen

A QUICK INTRODUCTION TO STATA

ECON Stata course, 3rd session

Data analysis using Stata , AMSE Master (M1), Spring semester

INTRODUCTION to. Program in Statistics and Methodology (PRISM) Daniel Blake & Benjamin Jones January 15, 2010

Stat 500 lab notes c Philip M. Dixon, Week 10: Autocorrelated errors

An Econometric Study: The Cost of Mobile Broadband

STAT 2607 REVIEW PROBLEMS Word problems must be answered in words of the problem.

Minitab 17 commands Prepared by Jeffrey S. Simonoff

GETTING DATA INTO THE PROGRAM

Applied Regression Modeling: A Business Approach

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

International Graduate School of Genetic and Molecular Epidemiology (GAME) Computing Notes and Introduction to Stata

Useful Stata Commands (for Stata versions 13 & 14)

An Introduction to Stata Part I: Data Management

THE LINEAR PROBABILITY MODEL: USING LEAST SQUARES TO ESTIMATE A REGRESSION EQUATION WITH A DICHOTOMOUS DEPENDENT VARIABLE

Economics 561: Economics of Labour (Industrial Relations) Empirical Assignment #2 Due Date: March 7th

A Short Guide to Stata 14

Gov Troubleshooting the Linear Model II: Heteroskedasticity

Chapter 5 Parameter Estimation:

book 2014/5/6 15:21 page v #3 List of figures List of tables Preface to the second edition Preface to the first edition

Detailed Explanation of Stata Code for a Marginal Effect Plot for X

Econ Stata Tutorial I: Reading, Organizing and Describing Data. Sanjaya DeSilva

Seminar Corporate Governance: Topics on Data Analysis with STATA

Here is Kellogg s custom menu for their core statistics class, which can be loaded by typing the do statement shown in the command window at the very

Stata: A Brief Introduction Biostatistics

Week 1: Introduction to Stata

Introduction to gretl

Data Management 2. 1 Introduction. 2 Do-files. 2.1 Ado-files and Do-files

Subject index. ASCII data, reading comma-separated fixed column multiple lines per observation

Econometrics I: OLS. Dean Fantazzini. Dipartimento di Economia Politica e Metodi Quantitativi. University of Pavia

API-202 Empirical Methods II Spring 2004 A SHORT INTRODUCTION TO STATA 8.0

CDAA No. 4 - Part Two - Multiple Regression - Initial Data Screening

/23/2004 TA : Jiyoon Kim. Recitation Note 1

Empirical trade analysis

Two-Stage Least Squares

StatCalc User Manual. Version 9 for Mac and Windows. Copyright 2018, AcaStat Software. All rights Reserved.

Department of Economics Spring 2016 University of California Economics 154 Professor Martha Olney Stata Lesson Wednesday February 17, 2016

Using SAS and STATA in Archival Accounting Research

GRETL FOR TODDLERS!! CONTENTS. 1. Access to the econometric software A new data set: An existent data set: 3

PubHlth 640 Intermediate Biostatistics Unit 2 - Regression and Correlation. Simple Linear Regression Software: Stata v 10.1

Fathom Dynamic Data TM Version 2 Specifications

Introduction to Stata First Session. I- Launching and Exiting Stata Launching Stata Exiting Stata..

Sacha Kapoor - Masters Metrics

( ) = Y ˆ. Calibration Definition A model is calibrated if its predictions are right on average: ave(response Predicted value) = Predicted value.

Introduction to STATA

MPhil computer package lesson: getting started with Eviews

Results Based Financing for Health Impact Evaluation Workshop Tunis, Tunisia October Stata 2. Willa Friedman

Transcription:

Birkbeck College Department of Economics, Mathematics and Statistics. Graduate Certificates and Diplomas Economics, Finance, Financial Engineering 2012 Applied Statistics and Econometrics INTRODUCTION TO STATA Elisa Cavatorta ecavatorta@ems.bbk.ac.uk

CONTENTS 1. THE BASICS OF WORKING WITH STATA 1.1. A note to start 1.2. The Stata Windows 1.3. Knowing where you are 1.4. Creating a do-file 1.5. Creating a log-file 1.6. Importing the data 1.7. Labelling and rename 1.8. Preliminary steps and general terminology 1.9. Connectors 2. TODAY s RESEARCH PROJECT 2.1. Looking at the data 2.2. Descriptive statistics 2.3. Generating new variables 2.4. Linear regression 2.5. Post-estimation: predicted values and diagnostics 2.5.1 Misspecification 2.5.2 Heteroskedasticity 2.6. Comparing competing models: measures of fit 2.7. Hypothesis testing 2.8. Marginal effects 2.9. Presenting regression results A.1 A notes on Stata with Time-series A.2 Sources and References A.3 List of regression commands

1 THE BASICS OF WORKING WITH STATA 1.1 A note to start These notes aim to introduce you to the basics of working with Stata. Stata is a power software for data analysis, implementing a huge range of techniques. These notes are based on Stata 12 available on Birkbeck College labs. A word of warning: using Stata is a learning process, do not be discouraged by error messages! 1.2 The STATA Windows The window labeled Command is where you type your commands. Stata then shows the results in the larger black window above. Your command is added to a list in the window labeled Review on the left, so you can keep track of the commands you have used. The window labeled Variables, on the top right, lists the variables in your dataset. The Properties window immediately below that, new in version 12, displays properties of your variables and dataset. 1.3 Knowing where you are The command cd gives you where Stata is working and saving files. You can change it by typing a new location cd "C:\Users\ELISA\MyASEProjects\ 1.4 Creating a Do-file Always create a do-file to track what you did. A do file is just a set of Stata commands typed in a plain text file. You can use Stata's own built-in do-file Editor, which has the great advantage that you can run your program directly from the editor by clicking on the run icon. 1.5 A log file To keep a permanent record of your results, however, you should log your session. When you open a log, Stata writes all results to both the Results window and to the file you specify. To open a log file use the command log using filename, text replace where filename is the name of your log file. Note the use of two recommended options: text and replace. text option creates logs in plain text (ASCII) format, which can be viewed in an editor. replace option replaces the old version. If you use the Menu windows Log => Begin, by default the log is written using SMCL, Stata Markup and Control Language (pronounced "smicle"). You need to use the translate command to convert it to plain text.

1.6 Importing the data If the data are in STATA format (.dta) you can import them directly. Go to File=> Open => browse to the data location. This is equivalent to type: use houseprice.dta, clear If the data are in another format you need to import the differently. Go to File => import => (choose the data type you have) Excel spreadsheet [.xls]/ text data created by a spreadsheet [.csv]. Equivalent to type: insheet using " HousePrice.csv", comma import excel " HousePrice.xls", sheet("houseprice") firstrow You can see your data from the Data Editor button. 1.7 Labelling and rename label var price "median price of single-family home" rename room rooms 1.8 Preliminary steps and general terminology Stata needs to know which typology of data you are using. Simple cross-sectional data do not need to be declared. Time-series data: tsset year, yearly Survey data with complex strata PSU: svy. Panel data: tsset panelvar timevar. Few additional useful things. If you need more space you can ask it here s a typical set up: set mem 10m (to set memory size) set more off (to let the output on the screen to run until the end of the command) Options: everything that is followed by a comma (,) is an optional command. Help: typing help command gives you explanation about a command. Let s try with help use 1.9 Connectors & and or > strictly greater, < strictly smaller == equals >= greater or equal to

2. TODAY S RESEARCH PROJECT: single-family housing prices We want to analyse the influence on house prices exerted by several external factors. We illustrate this with data on 506 Boston Communities housing price data. The response variable is the logarithm of the median price of a signle-family home in each community. The external factors under consideration include a measure of air pollution (lnox, the log of nitrous oxide in parts per 100m), the distance from the community to employment centers (ldist, the log of the weighted distance to five employment centers), the average student-teacher ratio in local schools (stratio). 2.1 Looking at the data Be aware of what is in your dataset and which type of variables. You can describe the data by describe Always plot your data: graphs contain a lot of information. Explore the number of possibilities of graphs in Graphics on the Menu list. To create a single plot type overlaid by normal distribution: histogram price, bin(30) normal To create a two ways scatter plot of house prices and number of rooms. twoway (scatter price rooms, sort) scatter is the type of connector (with time series you want line). sort is the option to sort on x variable. What can you say about the relationship? Which correlation do you expect? twoway (scatter price dist, sort) What can you say about the relationship? 2.2 Descriptive statistics summarize command followed by the names of the variables (which can be omitted to summarize everything). For more detailed statistics, use summarize [varlist], detail summarize summarize price, det summarize price if rooms > 6.28 A note: stata wants > (strictly greater), < (strictly smaller) or == (equal). histogram price, bins(22) normal Is the variable normal distributed? You can test this formally by the Skewness/Kurtosis test for Normality sktest price How do the variables correlate and at which level of significance? Are there collinear variables

pwcorr price rooms nox dist stratio, sig 2.3 Generating new variables: generate, egen, replace To compute a new variable use the generate command with a new variable name and an arithmetic expression. Choose variable names that are easy and remind you what the variable is about. Remind that Stata commands are case sensitive. Let s generate the logs of housing price generate lprice = log(price) Logs variables may help with heteroskedasticity and normality. Check that lprice approximate better a normal distribution, e.g. histogram lprice, bins(22). A useful command to create a new variable that satisfies certain condition is generate newvariable = cond(variable x == a, 1, 0) which tells that if the condition variable x == a is satisfied the new variable should take the value of 1, otherwise it is 0. A useful extension to generate is egen. Type help egen for a full list of possibilities. 2.4 Linear regression Stata can do a lot of fancy regressions. The syntax for most of them is very similar. We will focus on this is the most basic form of linear regression. regress fits a model of depvar on varlist using linear regression. By default it includes the constant term. The help regress command will bring up the following instructions for using regress. regress lprice rooms lnox ldist stratio * The top-left corner gives the ANOVA decomposition of the sum of squares in the dependent variable (Total) into the explained (Model) and unexplained (Residual). * The top-right corner reports the statistical significance results for the model as a whole. * The bottom section gives the results for the individual explanatory variables. Useful options The regress command can be used with the robust option for estimating the standard errors using the Huber-White sandwich estimator (to correct the standard errors for heteroscedasticity). 2.5 Post-estimation: predicted values and diagnostics A number of predicted values can be obtained after all estimation commands listed above. The most important are the predicted values for the dependent variable and the predicted residuals. regress lprice rooms nox dist stratio predict lpricehat, xb label var lprice Predicted log price predict uhat, residual

before looking at the coefficients you need to make sure your regression is sufficiently healthy. There are a number of diagnostic tests available in Stata. Type help regress postestimation for a list of available tests. twoway (scatter lpricehat lprice) (line lprice lprice if lprice <., clwidth(thin) ), ytitle( Predicted log median housing price ) xtitle( Actual log median housing price ) legend(off) rvfplot, yline(0) 2.5.1 Misspecification Misspecification may arise because the true model specifies a nonlinear relationship and we omit a squared term. One way of testing this is the RESET test. The RESET tests runs an augmented regression that include the original regressors, powers of the predicted values and powers of the original regressors. The null hypothesis tested is no misspecification. Under the null hypothesis of no-misspecification, the coefficients of the additional regressors are zero. estat ovtest rvpplot ldist, ms(0h) yline(0) The residual is more variable for low level of log distance. Hence, the hypothesis of homoskedasticity is untenable. 2.5.2 Heteroskedasticity The Breusch Pagan test of the null hypothesis of homoskedasticity is implemented by estat hettest 2.6 Comparing competing models: measures of fit You should be able to comment on the R 2, adj R 2 and SER. You can also check the Information Criteria. estat ic estat ic will display the log likelihood of the null model (only a constant term), the log likelihood of the fitted model and the AIC and BIC statistics. Lower values indicate better fit. For example, try to adjust the previous model by taking the log of the distance and adding a squared term. Any improvements? Compare the measures of fit. gen ldist2 = ldist^2 label var dist2 "Log Distance squared" regress lprice rooms lnox ldist ldist2 stratio gen rooms2 = rooms^2 regress lprice rooms rooms2 lnox ldist ldist2 stratio lproptax 2.7 Hypothesis testing The regression output automatically includes a two-sided t-test (for linear regressions) on the null hypothesis that the true coefficient is equal to zero for each independent variable. Two equivalent formulations: test _b[rooms] = 0

test rooms Let s suppose the theory suggests that the coefficient on variable rooms should be 0.33. This is testable by test rooms = 0.33 You can test arbitrary restrictions, such as that the three coefficients equal zero lincom rooms + ldist + stratio You can test equality of two coefficients by test ldist = stratio 2.8 Marginal effects The command mfx computes marginal effects or elasticities after estimation. The option eyex computes the elasticity of y with respect to x, equivalent to the marginal effect in the log-log specification. regress price rooms nox dist stratio mfx, eyex You will find rooms to be elastic, having almost twice as large an effect on price in proportional terms. nox dist are inelastic, with estimated elasticity within the unit interval. 2.9 Presenting regression results It is generally good practise to present competing models to support your analysis. In the text of your project you need to justify which model you consider the best fitting model. You need to estimate all models first, save the estimation results (estimates store) and create a table. Here is an example quietly regress lprice rooms est store m1 quietly regress lprice rooms lnox ldist stratio est store m2 quietly regress lprice rooms lnox ldist ldist2 stratio lproptax est store m3 quietly regress lprice rooms rooms2 lnox ldist ldist2 stratio lproptax est store m4 estout m1 m2 m3 m4, stats(r2_a rmse aic) cells(b(star fmt(%8.3f)) /// se(par fmt(%6.3f))) starlevels(* 0.1 ** 0.05 *** 0.01)

A.1 A Note on Stata for time-series Stata has many build-in command for analysing time-series data. First, you need to tell Stata you are using time-series data. You do this by typing tsset timevariable (e.g. tsset year) You can find tests for univariate time-series, such as ADF in Statistics=> Time series => Tests Diagnostics tests after regression commands, such as Durbin Watson test, Godfrey LM test and heteroskedasticity test can be found in Statistics=> Time series => Tests => Time Series specification test after regress Line plots, correlograms, autocorrelation graphs can be found in Statistics=> Time series => Graphs More complex analysis for multivariate time series such as VAR, VECM and Cointegration tests can be found in Statistics=> Multivariate time series

A.2 Sources and References Stata website at http://www.stata.com. Among other things you will find that they make available online all datasets used in the official documentation, that they publish a journal called Stata Journal, and that they have an excellent bookstore with texts on Stata and related statistical subjects. Stata also offers email and web-based training courses called NetCourses, see http://www.stata.com/info/products/netcourse/. There is an independent listserv where you can post questions and receive prompt and knowledgeable answers from other users. To join the list see http://www.stata.com/support/statalist/ and follow the link to subscribe. Stata also maintains a list of frequently asked questions (FAQ) classified by topic, see http://www.stata.com/support/faqs/. UCLA maintains an excellent Stata portal at http://www.ats.ucla.edu/stat/stata/ There is a list manuals such as An introduction to Modern Econometrics using Stata by C. Baum. A.3 List of regression commands anova analysis of variance and covariance cnreg censored-normal regression gmm Generalized methods of moments estimator heckman Heckman selection model intreg interval regression ivregress instrumental variables (2SLS) regression newey regression with Newey-West standard errors prais Prais-Winsten, Cochrane-Orcutt, or Hildreth-Lu regression qreg quantile (including median) regression reg ordinary least squares regression reg3 three-stage least squares regression rreg robust regression (NOT robust standard errors) sureg seemingly unrelated regression tobit tobit regression treatreg treatment effects model truncreg truncated regression xtabond Arellano-Bond linear, dynamic panel-data estimator xtintreg panel data interval regression models xtreg fixed- and random-effects linear models xtregar fixed- and random-effects linear models with an AR(1) disturbance xttobit panel data tobit models