An Introduction to Stata Part II: Data Analysis

Similar documents
An Introduction to Stata Part I: Data Management

OVERVIEW OF WINDOWS IN STATA

Introduction to data analysis using STATA. Miguel Niño-Zarazúa World Institute for Development Economics Research United Nations University

A quick introduction to STATA

Getting started with Stata 2017: Cheat-sheet

Dr. Barbara Morgan Quantitative Methods

A Short Guide to Stata 10 for Windows

A Quick Guide to Stata 8 for Windows

STATA TUTORIAL B. Rabin with modifications by T. Marsh

GETTING STARTED WITH STATA. Sébastien Fontenay ECON - IRES

STATA 13 INTRODUCTION

Introduction to STATA

Lab 1: Basics of Stata Short Course on Poverty & Development for Nordic Ph.D. Students University of Copenhagen June 13-23, 2000

Revision of Stata basics in STATA 11:

ICSSR Data Service. Stata: User Guide. Indian Council of Social Science Research. Indian Social Science Data Repository

Basic Stata Tutorial

Workshop for empirical trade analysis. December 2015 Bangkok, Thailand

Minitab 17 commands Prepared by Jeffrey S. Simonoff

STATA Tutorial. Introduction to Econometrics. by James H. Stock and Mark W. Watson. to Accompany

After opening Stata for the first time: set scheme s1mono, permanently

Empirical trade analysis

Introduction to Stata. Getting Started. This is the simple command syntax in Stata and more conditions can be added as shown in the examples.

Applied Regression Modeling: A Business Approach

Applied Regression Modeling: A Business Approach

A quick introduction to STATA:

Use data on individual respondents from the first 17 waves of the British Household

International Graduate School of Genetic and Molecular Epidemiology (GAME) Computing Notes and Introduction to Stata

Statistical Package for the Social Sciences INTRODUCTION TO SPSS SPSS for Windows Version 16.0: Its first version in 1968 In 1975.

API-202 Empirical Methods II Spring 2004 A SHORT INTRODUCTION TO STATA 8.0

Math 227 EXCEL / MEGASTAT Guide

Introduction to STATA

Intermediate Stata. Jeremy Craig Green. 1 March /29/2011 1

A Short Guide to Stata 14

You will learn: The structure of the Stata interface How to open files in Stata How to modify variable and value labels How to manipulate variables

Stata Training. AGRODEP Technical Note 08. April Manuel Barron and Pia Basurto

set mem 10m we can also decide to have the more separation line on the screen or not when the software displays results: set more on set more off

Getting Our Feet Wet with Stata SESSION TWO Fall, 2018

A Short Introduction to STATA

Important Things to Know about Stata

Introduction to SAS. I. Understanding the basics In this section, we introduce a few basic but very helpful commands.

Introduction to Stata Toy Program #1 Basic Descriptives

Ninth ARTNeT Capacity Building Workshop for Trade Research "Trade Flows and Trade Policy Analysis"

INTRODUCTION to. Program in Statistics and Methodology (PRISM) Daniel Blake & Benjamin Jones January 15, 2010

I Launching and Exiting Stata. Stata will ask you if you would like to check for updates. Update now or later, your choice.

A First Tutorial in Stata

1. Basic Steps for Data Analysis Data Editor. 2.4.To create a new SPSS file

8. MINITAB COMMANDS WEEK-BY-WEEK

Advanced Regression Analysis Autumn Stata 6.0 For Dummies

Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition

Introduction to Stata - Session 2

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Subject index. ASCII data, reading comma-separated fixed column multiple lines per observation

Introduction to gretl

Data analysis using Stata , AMSE Master (M1), Spring semester

StatLab Workshops 2008

SAS (Statistical Analysis Software/System)

Using Large Data Sets Workbook Version A (MEI)

STAT 311 (3 CREDITS) VARIANCE AND REGRESSION ANALYSIS ELECTIVE: ALL STUDENTS. CONTENT Introduction to Computer application of variance and regression

Economics 145 Fall 2009 Howell Getting Started with Stata

Lab 3 - Introduction to Graphics

Chapter 3: Data Description Calculate Mean, Median, Mode, Range, Variation, Standard Deviation, Quartiles, standard scores; construct Boxplots.

Introduction to Stata: An In-class Tutorial

Intro to Stata. University of Virginia Library data.library.virginia.edu. September 16, 2014

Introduction to Stata - Session 1

A quick introduction to STATA:

Introduction. About this Document. What is SPSS. ohow to get SPSS. oopening Data

SPSS QM II. SPSS Manual Quantitative methods II (7.5hp) SHORT INSTRUCTIONS BE CAREFUL

SAS (Statistical Analysis Software/System)

Introduction to Minitab 1

Econ Stata Tutorial I: Reading, Organizing and Describing Data. Sanjaya DeSilva

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Excel 2010 with XLSTAT

Intermediate SPSS. If you have an SPSS dataset (*.sav), you can open it in the following way:

Department of Economics Spring 2016 University of California Economics 154 Professor Martha Olney Stata Lesson Wednesday February 17, 2016

Data analysis using Microsoft Excel

Data Management 2. 1 Introduction. 2 Do-files. 2.1 Ado-files and Do-files

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

SPSS. (Statistical Packages for the Social Sciences)

1 Introduction to Using Excel Spreadsheets

Week 10: Heteroskedasticity II

Department of Economics Spring 2018 University of California Economics 154 Professor Martha Olney Stata Lesson Thursday February 15, 2018

Migration and the Labour Market: Data and Intro to STATA

Bar Charts and Frequency Distributions

SAS (Statistical Analysis Software/System)

SIDM3: Combining and restructuring datasets; creating summary data across repeated measures or across groups

Fathom Tutorial. Dynamic Statistical Software. for teachers of Ontario grades 7-10 mathematics courses

Survey of Math: Excel Spreadsheet Guide (for Excel 2016) Page 1 of 9

Stata: A Brief Introduction Biostatistics

Stata version 13. First Session. January I- Launching and Exiting Stata Launching Stata Exiting Stata..

PAM 4280/ECON 3710: The Economics of Risky Health Behaviors Fall 2015 Professor John Cawley TA Christine Coyer. Stata Basics for PAM 4280/ECON 3710

StatCalc User Manual. Version 9 for Mac and Windows. Copyright 2018, AcaStat Software. All rights Reserved.

STM103 Spring 2008 INTRODUCTION TO STATA 8.0

Birkbeck College Department of Economics, Mathematics and Statistics.

Basic concepts and terms

ECO375 Tutorial 1 Introduction to Stata

An Introduction to Stata

Canadian National Longitudinal Survey of Children and Youth (NLSCY)

GETTING DATA INTO THE PROGRAM

Sacha Kapoor - Masters Metrics

Lecture 2: Advanced data manipulation

Transcription:

An Introduction to Stata Part II: Data Analysis Kerry L. Papps

1. Overview Do-files Sorting a dataset Combining datasets Creating a dataset of means or medians etc. Weights Panel data capabilities Dummy variables Lags

2. Overview (cont.) Summary statistics Basic regression commands

3. Do-files Do-files allow commands to be saved and executed in batch form. We will use the Stata do-file editor to write dofiles. To open do-file editor click Window Do-File Editor or click Could also use WordPad or Notepad: Save as Text Document with extension.do (instead of.txt ).

4. Do-files (cont.) To run a do-file from within the do-file editor, either select Tools Do or click If you highlight certain lines of code, only those commands will run. To run do-file from the main Stata windows, either select File Do or type: do dofilename Can comment out lines by preceding with * or by enclosing text within /* and */.

5. Do-files (cont.) Can save the contents of the Review window as a do-file by right-clicking on window and selecting Save All....

6. Sorting a dataset sort puts the observations in a specific order. To sort by alphabetic or numeric order of varname, use: sort varname You can sort a file on more than one variable: sort varlist Example: sort country year

7. Appending datasets To add another Stata dataset below the end of the dataset in memory, type: append using filename Dataset in memory is called master dataset. Dataset filename is called using dataset. Variables (i.e. with same name) in both datasets will be combined. Variables in only one dataset will have missing values for observations from the other dataset.

8. Merging datasets To join corresponding observations from a Stata dataset with those in the dataset in memory, type: merge 1:1 varlist using filename Stata will join observations with common values of varlist, which must be present in both datasets. If more than one observation has the same value(s) for varlist in the master dataset, use: merge m:1 varlist using filename If more than one observation has the same value(s) for varlist in the using dataset, use: merge 1:m varlist using filename

9. Merging datasets (cont.) The variable _merge is automatically added to the dataset, containing: _merge==1 Observation from master data _merge==2 Observation from using data _merge==3 Observation from both master and using data Stata reports the number of observations with each value of _merge.

EXERCISE 1 10. Merging Open the do-file editor in Stata. Run all your solutions to the exercises from here. Use cd to change the working directory to your preferred folder. Redo all the commands from Part I and recreate the two Stata files, Economic data.dta and EU data.dta : do http://people.bath.ac.uk/ klp33/stata_part_one_solutions.do

EXERCISE 1 (cont.) 11. Merging Open Economic data.dta (master dataset) and merge with EU data.dta" (using dataset) using country as the match variable. Should you use merge 1:1, merge m:1 or merge 1:m? Look at the values that _merge takes: what does this indicate?

EXERCISE 1 (cont.) 12. Merging Remove those observations that do not contain data from both files: drop if _merge==1 Create a dummy variable called eu for whether a country was a member of the EU in a given year.

13. Collapsing datasets To create a dataset of means, sums etc., type: collapse (stat) varlist1 (stat) [[weight]], by(varlist2) stat can be mean, sd, sum, median or other statistics. by(varlist2) specifies the groups over which the means etc. are to be calculated.

14. Collapsing datasets (cont.) Be sure to save data before attempting collapse as there is no undo facility. Example: collapse (mean) age educ (median) income, by(country)

15. Collapsing datasets (cont.) Four types of weight can be used in Stata: fweight (frequency weights): weights indicate the number of duplicated observations. pweight (sampling weights): weights denote the inverse of the probability that an observation is included in the sample.

16. Collapsing datasets (cont.) aweight (analytic weights): weights are inversely proportional to the variance of an observation to correct for heteroskedasticity. Often, observations represent averages and weights are number of elements that gave rise to the average. iweight (importance weights): weights have no other interpretation.

17. Collapsing datasets (cont.) Example: collapse (mean) unemplrate [aweight=labforce], by(country) Weights may be used in many other Stata commands, e.g. correlate, regress. Note that the square brackets around the weight must be typed.

EXERCISE 2 18. Collapsing Collapse the merged dataset from Exercise 1 by year to produce a dataset containing the sums of pop and area and the means of gdpcap, lfpr, unemplrate and secondary across the entire EU. Use aweight=pop so that the variables take into account the changing populations of the countries. In what years did the EU have the highest unemployment rate and the highest GDP per capita?

EXERCISE 2 (cont.) 19. Collapsing Looking at the data editor, can you spot a problem with the collapsed data? (A more appropriate collapse step would use rawsum rather than sum, which computes the unweighted sum.) Do not save the new dataset.

20. Panel data manipulation Panel data generally refer to the repeated observation of a set of fixed entities at fixed intervals of time (also known as longitudinal data). Stata is particularly good at arranging and analysing panel data. Stata refers to two panel display formats: Wide form: useful for display purposes and often the form data obtained in. Long form: needed for regressions etc.

21. Panel data manipulation (cont.) i Example of wide form: x ij id sex inc2008 inc2009 inc2010 1 0 5000 5500 6000 2 1 2000 2200 3300 3 0 3000 2000 1000 Note the naming convention for inc.

22. Panel data manipulation (cont.) Example of long form: i j x ij id year sex inc 1 2008 0 5000 1 2009 0 5500 1 2010 0 6000 2 2008 1 2000 2 2009 1 2200 2 2010 1 3300 3 2008 0 3000 3 2009 0 2000 3 2010 0 1000

23. Panel data manipulation (cont.) To change from long to wide form, type: reshape wide varlist, i(ivarname) j(jvarname) varlist is the list of variables to be converted from long to wide form. i(ivarname) specifies the variable(s) whose unique values denote the spatial unit. j(jvarname) specifies the variable whose unique values denote the time period.

24. Panel data manipulation (cont.) To change from wide to long form, type: reshape long stublist, i(ivarname) j(jvarname) stublist is the word part of the names of variables to be converted from wide to long form, e.g. inc above. It is important to name variables in this format, i.e. word description followed by year.

25. Panel data manipulation (cont.) To move between the above example datasets use: reshape long inc, i(id) j(year) reshape wide inc, i(id) j(year) These steps undo each other.

26. Lags You can declare the data to be in panel form, with the tsset command: tsset panelvar timevar For example: tsset countryid year After using tsset, a lag can be created with: gen lagname = L.varname Similarly, L2.varname gives the second lag.

EXERCISE 3 27. Manipulating a panel Open nlswork.dta from the internet using: webuse nlswork Declare the data to be a panel using tsset, noting that idcode is the panel variable and year is the time variable. Generate a new variable lagwage equal to the lag of ln_wage and confirm that this contains the correct values by listing some data (use the break button): list idcode year ln_wage lagwage

EXERCISE 3 (cont.) 28. Manipulating a panel Drop all variables other than idcode, year and ln_wage using the keep command (quicker than using drop). Use the reshape wide option to rearrange the data so that the first column represents each person (idcode) and the other columns contain ln_wage for a particular year. Return the data to long form (change wide to long in the command).

29. Univariate summary statistics tabstat produces a table of summary statistics: tabstat varlist [, statistics(statlist)] Example: tabstat age educ, stats(mean sd sdmean n) summarize displays a variety of univariate summary statistics (number of non-missing observations, mean, standard deviation, minimum, maximum): summarize [varlist]

30. Multivariate summary statistics table displays table of statistics: table rowvar [colvar] [, contents(clist varname)] clist can be freq, mean, sum etc. rowvar and colvar may be numeric or string variables. Example: table sex educ, c(mean age median inc)

31. Multivariate summary statistics (cont.) One super-column and up to 4 super-rows are also allowed.

32. Sets of dummy variables Dummy variables take the values 0 and 1 only. Large sets of dummy variables can be created with: tab varname, gen(dummyname) When using large numbers of dummies in regressions, useful to name with pattern, e.g. id1, id2 Then id* can be used to refer to all variables beginning with *.

33. Linear regression To perform a linear regression of depvar on varlist, type: regress depvar varlist [[weight]] [if exp] [, noconstant robust] depvar is the dependent variable. varlist is the set of independent variables (regressors). By default Stata includes a constant. The noconstant option excludes it.

34. Linear regression (cont.) Weights may be used, e.g. when data are group averages, as in: regress inflation unemplrate year [aweight=pop] Note that here year allows for a linear time trend. Other estimation commands (e.g. probit) follow the same general syntax.

35. Post-estimation commands After any estimation command, several predicted values can be computed using predict. predict refers to the most recent model estimated. predict yhat, xb creates a new variable yhat equal to the predicted values of the dependent variable. predict res, residual creates a new variable res equal to the residuals.

36. Post-estimation commands (cont.) Linear hypotheses can be tested (e.g. t-test or F- test) after estimating a model by using test. test varlist tests that the coefficients corresponding to every element in varlist jointly equal zero. test eqlist tests the restrictions in eqlist, e.g.: test sex==3 The option accumulate allows a hypothesis to be tested jointly with the previously tested hypotheses.

37. Post-estimation commands (cont.) Example: regress lnw sex race school age test sex race test school == age, accum

EXERCISE 4 38. Statistics and regression Clear the memory and open nlswork.dta again: webuse nlswork Type summarize to look at the summary statistics for all variables in the dataset. Restrict summarize to hours and ln_wage and perform it separately for non-married and married (i.e. msp==0 and 1). Use tabstat to report the mean, median, minimum and maximum for hours and ln_wage.

EXERCISE 4 (cont.) 39. Statistics and regression Report the mean and median of ln_wage by age (along the rows) and race (across the columns) : table age race, c(mean ln_wage median ln_wage) Generate a variable called age2 that is equal to the square of age (the square operator in Stata is ^). Create a set of race dummies with: tab race, gen(race)

EXERCISE 4 (cont.) 40. Statistics and regression Regress ln_wage on: age, age2, race2, race3, msp, grade, tenure, c_city.

41. Graphs To obtain a basic histogram of varname, type: histogram varname, discrete freq To display a scatterplot of two (or more) variables, type: scatter varlist [[weight]] weight determines the diameter of the markers used in the scatterplot.

42. Graphs (cont.) There are options for (among other things): Adding a title (title) Altering the scale of the axes (xscale, yscale) Specifying what axis labels to use (xlabel, ylabel) Changing the markers used (msymbol) Changing the connecting lines (connect)

43. Graphs (cont.) Particularly useful is mlabel(varname) which uses the values of varname as markers in the scatterplot. Example: scatter gdp unemplrate, mlabel(country)

44. Graphs (cont.) Graphs are not saved by log files (separate windows). Select File Save Graph. To insert in a Word document etc., select Edit Copy and then paste into Word document. This can be resized but is not interactive (unlike Excel charts etc.).