Data Management 2. 1 Introduction. 2 Do-files. 2.1 Ado-files and Do-files

University of California, Santa Cruz Department of Economics ECON 294A (Fall 2014)- Stata Lab Instructor: Manuel Barron 1 Data Management 2 1 Introduction Today we are going to introduce the use of do-files, together with some good practices on writing do-files (and scripts in general). We are also going to go a bit deeper into Stata. We will focus on joining datasets. This is important, since real-world data comes from different sources, and you need to know how to combine those sources in order to construct the dataset you need for analysis. For instance, say we want to analyze how climate change is affecting agricultural output. We may find data on a country s agricultural output from the World Bank, and data on climate variables from the NASA. Or we may be looking for a few years of a given company s financial data, and these data are released in yearly reports. We will cover mmerge and append, Stata s main commands to join datasets. 2 Do-files A Stata do-file is a script that allows you to store a list of Stata commands. It has the.do extension. The most common way of running the commands in a do-file (or running a do-file ) is from the do-file editor. To open the do-file editor, click on the icon, or type.doedit from the command window. From the do-file editor, you can highlight the part that you want to execute and then clicking on the do icon. If you want to execute all the commands in the do-file, just click on do without highlighting anything. 2.1 Ado-files and Do-files Please do not confuse do-files and.ado files..ado files are similar to do-files but their purpose is to store commands and are more oriented towards programmers. We will not cover ado-files in this course. 1 Comments welcome. Please let me know if you find typos or other mistakes; or if the explanations are unclear. Contact me at mbarron4 [at] ucsc 1

2.2 Writing Tidy Do-files It is good practice to write neat do-files. It not only makes it easier for you to understand what you did in the past, but it also facilitates working with others. Also, it is easier to spot mistakes in a tidy do-file than in a messy one. I ll give a few more tips in the programming section of the course, but by now the two main things you can do is include comments and line breaks. 2.2.1 Comments It is good practice to insert comments in your do-file. It helps remind yourself, but also helps others understand what you did. Comments are not Stata commands, so we need to tell Stata that we are about to write a comment. There are four main ways to include comments. 1. The easiest way is by starting the line with an asterisk *. Stata will ignore the whole line * This is a comment. Stata will not read this line sysuse auto.dta, clear 2. It is also possible to write comments on the same line as a Stata command. To do so, type two slashes // after the Stata command. sysuse auto.dta, clear // This is another comment 3. If you want to write a comment that will cover multiple lines, you may do so by typing /* at the beginning of the comment and */ at the end. sysuse auto.dta, clear /* This is a comment that spans over three lines. Note that I did not need to use the * at the beginning of each line */ summarize mpg 2.2.2 Line Breaks Long lines are very difficult to read, especially if the command doesn t fit in the screen. Splitting a lengthy command over several lines improves readability. You may place a /* symbol at the point where you wish to break a line and a */ symbol at the beginning of the next line. You may place three slashes /// at the point where you wish to break a command line. twoway (scatter price mpg) (lfit price mpg if foreign==1) /* */ (lfit price mpg if foreign==0) 2

You can make that even easier to read by placing one graph per line twoway (scatter price mpg) /* */ (lfit price mpg if foreign==1) /* */ (lfit price mpg if foreign==0) You can use the line break symbols also to insert comments, and you may use different line break symbols even in the same command (although that is not very common). twoway (scatter price mpg) /* You can write comments here */ (lfit price mpg if foreign==1) /// and here. (lfit price mpg if foreign==0) 2.2.3 Avoid too many comments Especially when one starts learning a new software, t is easy to overdo it with the comments. In the end you will find the balance that works best for yourself, but I would advice against writing too many comments, like this: *** begin do-file *** use auto.dta, clear /* I m opening the auto.dta file */ /* Now, I ll generate a variable called expensive, that takes the value of 1 if price is higher than $3000 and the value of 0 otherwise. In our last meeting we agreed to try 3,000 as cutoff value, but we should also try $3,500 */ gen expensive = 1 if price>=3000 & price!=. replace expensive = 0 if price<3000 *** end do-file*** These comments just repeat everything I m doing with the commands, and even clutter the do-file. So these comments do not help understand the do-file. Compare it to this do-file: *** begin do-file *** use auto.dta, clear * Consider changing the cutoff for expensive to $3,500 gen expensive = 1 if price>=3000 & price!=. replace expensive = 0 if price<3000 *** end do-file *** 3

2.3 Setting the Working Directory A project involves working with multiple files. You have the original ( raw ) data set or datasets, do-files, and one or more datasets that result from applying your do-files to the raw data. It is usually good to have all the files related to a project in a single folder. I usually work with three folders: original data, working data, and do-files. Any do-file related to that project goes to in that folder. I keep the original data in a folder different than the working data to make sure not to overwrite the original data. When we want to open or save a file, we want to avoid typing long paths. The cd command (that stands for change directory ) will help us set a working directory. The working directory is the directory where Stata will look for files unless we write the whole path to the file. Let s see how it works: Instead of typing *** begin do-file *** use /Users/manuel/Econ294A/lab/week1/data/originaldata.dta, clear [...stata commands...] save /Users/manuel/Econ294A/lab/week1/data/modifieddata.dta, replace *** end do-file *** You may set the working directory at the beginning of the do-file, and then just call the files by their names: *** begin do-file *** cd "/Users/manuel/Econ294A/lab/week1/data/" use originaldata.dta, clear [...stata commands...] save modifieddata.dta, replace *** end do-file *** Setting a working directory will prove especially useful when we combine datasets later in the lecture. To set your working directory, you may open a dataset using the interactive menu (clicking on the open icon, or clicking on File/Open). Stata will print out the path that it used to open that directory. You may copy and paste the output into your do-file. 4

3 Installing Packages Stata has many built-in commands, but there are quite a few user-written commands available from the web. One of them is mmerge (with two m s). To install it, type. findit mmerge Search of official help files, FAQs, Examples, SJs, and STBs Web resources from Stata and other users (contacting http://www.stata.com) 6 packages found (Stata Journal and STB listed first) ----------------------------------------------------- [...output omitted...] mmerge from http://fmwww.bc.edu/repec/bocode/m MMERGE : module: Safer and easier to use variant of merge. / mmerge is an extension of merge that automatically sorts the / master and slave data sets, allows selection of variables, and / provides more readable output describing the result of a merge. / This version (2.5.0) is an update of [...output omitted...] If you click on mmerge (in blue), you will see a description of the command, and click here to install option. Please click to install it. I m not sure if you can download commands into campus computers, but if you can t, there s a way around it. Stata saves the commands you download in folders. To see these folders, from the command window, type:. sysdir STATA: BASE: SITE: PLUS: PERSONAL: OLDPLACE: /Applications/Stata/ /Applications/Stata/ado/base/ /Applications/Stata/ado/site/ ~/Library/Application Support/Stata/ado/plus/ ~/Library/Application Support/Stata/ado/personal/ ~/ado/ Stata stores its commands in these folders (there is no need for further detail now). You can change the location of the PERSONAL folder by typing:. sysdir set PERSONAL "C:\\...[working directory]" Now Stata will download commands to that directory, and -more importantly- it knows to look for those commands are in that directory. 5

4 Merging Datasets: mmerge Say we have two datasets. In one, we have data on people s education. In the other, we have data on their wage. mmerge (with two m s) allows us to merge those two datasets. In Cameron and Trivedi s words, the dataset becomes wider : new variables from the second dataset are added to existing variables of the first dataset. Merging implies adding information from a dataset in the disk (that has not been opened) to the dataset in memory (the one you have already open). The dataset that is already opened is known as the master dataset. The dataset in disk is known as the using dataset. education.dta and wage.dta are two datasets with (simulated) information on years of schooling and wage for 1500 individuals. Let s see what they look like. *** begin do-file *** use "education.dta", clear describe summarize use "wage.dta", clear describe summarize *** end do-file *** Lets assume we have documentation for these files that says that the id variable identifies observations in both datasets. So, we know that the person with ID=766 is the same in both datasets. That person has 8 years of schooling and earns 9.52 per hour. We can use mmerge to bring together the schooling and wage data for all the people in our sample.. mmerge id using education ------------------------------------------------------------------------------- merge specs matching type auto mv s on match vars none unmatched obs from both ---------------------+--------------------------------------------------------- master file wage.dta obs 1350 vars 2 match vars id (key) -------------------+--------------------------------------------------------- using file education.dta 6

obs 1400 vars 2 match vars id (key) ---------------------+--------------------------------------------------------- result file wage.dta obs 1500 vars 5 (including _merge) ------------+--------------------------------------------------------- _merge 100 obs only in master data (code==1) 150 obs only in using data (code==2) 1250 obs both in master and using data (code==3) ------------------------------------------------------------------------------- Lets look at the result for a minute. It says that 100 observations appear only in the master data. The master data is the one we have open, in this case the wage data. This means that there are 100 observations that appear in the wage data but not in the education data. Next, there are 150 observations that appear only in the education data but not in the wage data. This is the converse problem. Finally, there are 1250 observations that appear in both datasets. This mean that for those 1250 people we have information on their education and their wage. Note that mmerge created a merge variable, which stores a summary of the merging result. Now that we have the data put together, we can analyze the relation between education and wage. For instance, we can generate a scatterplot and a linear fit, with the commands we learned in the previous lecture:. twoway (scatter lnwage education) (lfit lnwage education) Simple analysis shows that people with no schooling earn about $9.00 an hour, which is consistent with the minimum wage in California. In addition, going from 0 to 10 years of schooling increases the hourly wage from $9 to almost $10, which implies returns to education of the order of 10% per year, as usually found in the literature. 7

8 9 10 11 12 0 5 10 15 20 years of schooling log wage Fitted values Lets have a look at some of the most important features of help file for mmerge. ------------------------------------------------------------------------------ help for mmerge [jw] Feb 26, 2002 ------------------------------------------------------------------------------ Easy and safe merging of datasets Basic syntax mmerge match-variable(s) using filename [, {simple table} umatch(varlist) ukeep(varlist) ] Full Syntax mmerge match-variable(s) using filename [, { type(type_value) unmatched(unmatched_value) simple table } missing(m_value) nolabel replace update _merge(varname) noshow { ukeep(varlist) udrop(varlist) } uif(exp) umatch(varlist) { uname(stub) urename(rename_specs) } ulabel(stub) ] [...output omitted...] Options for manipulating the using data ("u"-options) ukeep(varlist) udrop(varlist) specifies a varlist in the using data that has to be kept (dropped) before being merged into the master data. It is not valid to specify both ukeep and udrop. If neither is specified, all variables of the using data are used. The match variable(s) need not be specified in ukeep; they are automatically included in ukeep (excluded 8

from udrop). [...output omitted...] umatch(varlist) specifies the names of the match variables in the using data. The umatch variables are associated with the match variables in the specified order. Clearly, the number of match variables in umatch should be the same as the number of matching variables in the master. mmerge renames the umatch variables to the master match variable names after ukeep/udrop have been processed, but before urename is processed. An error occurs if there are naming conflicts. [...output omitted...] 5 Appending Datasets: append append creates a longer dataset, with the observations for the second dataset appended after all the observations from the first dataset. If the same variable has different names in the two datasets, the variable names should be changed to ensure they match. Say we are interested in the relationship between SAT scores and undergraduate GPA. We have contacted two universities requesting data on their students and two of them sent their respective files. EasternUniv.dta and WesternUniv.dta are two datasets with (simulated) information on SAT scores and undergraduate GPA for 700 students each. Lets see what they look like: use "EasternUniv.dta", clear describe summarize use "WesternUniv.dta", clear describe summarize As you can see, they seem to have the same variables, but the variable name are different. In the Eastern University dataset the variable names are gap and sat in lower case letters, while in the Western University dataset the variable names are SAT and GPA, in block capitals. We can change the name of a variable with the rename command. The syntax is rename [oldname] [newname], meaning that after the command we type the variable s current name followed by its new name. 9

use "WesternUniv.dta", clear ren SAT sat ren GPA gpa append using EasternUniv \end{vervatim} Let s see a portion of append s help file \begin{verbatim} Title [D] append -- Append datasets Syntax append using filename [filename...] [, options] You may enclose filename in double quotes and must do so if filename contains blanks or other special characters. options Description ----------------------------------------------------------------------- generate(newvar) newvar marks source of resulting observations keep(varlist) keep specified variables from appending dataset(s) nolabel do not copy value-label definitions from dataset(s) on disk nonotes do not copy notes from dataset(s) on disk force append string to numeric or numeric to string without error ----------------------------------------------------------------------- [...output omitted...] We could do a couple of graphs like we ve been doing up to now, but instead lets run some regressions. The easiest way of running a regression with Stata is by using the regress command.. regress gpa sat Source SS df MS Number of obs = 1400 -------------+------------------------------ F( 1, 1398) =41573.23 Model 1359.315 1 1359.315 Prob > F = 0.0000 Residual 45.7102453 1398.032696885 R-squared = 0.9675 -------------+------------------------------ Adj R-squared = 0.9674 10

Total 1405.02524 1399 1.00430682 Root MSE =.18082 ------------------------------------------------------------------------------ gpa Coef. Std. Err. t P> t [95% Conf. Interval] -------------+---------------------------------------------------------------- sat.0170599.0000837 203.90 0.000.0168958.0172241 _cons -9.799888.0586972-166.96 0.000-9.915032-9.684744 ------------------------------------------------------------------------------ I will not spend a lot of time explaining the output, since this is something that you will see in your econometrics course. Instead, I ll show you how to send these results directly to a spreadsheet or a word processor. 6 Sending Stata Output to Word or Excel Stata has a number of commands that allow exporting results to other programs, like Word, Excel, or Latex. In this course we will use two of those commands: outreg2 and esttab. I rarely use outreg2, but it is easier to use, so we will start with that one. 6.1 outreg2 outreg2 provides a simple way of outputting results to word processors or spreadsheets. It may be enough in some cases. The help file is long and maybe a bit confusing, but I ll show you some of the most important options. This is the most basic use of outreg. After running a regression, we can send the output to Word like this:. regress gpa sat [...output omitted...]. outreg2 using lab2.doc, replace lab2.doc dir : seeout If you click on dir Stata will open the directory where your table was saved. If you click on seeout Stata will show your table in its browser. If you click on lab2.doc (from a PC only) Stata will open the.doc document you just saved (this may not work in MacOSX). The table will look something like this: 11

VARIABLES (1) gpa sat 0.0159*** (0.000100) Constant -9.061*** (0.0705) Observations 1,400 R-squared 0.947 Standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1 We can add the results from other regressions to compare results by appending them to each other. Don t confuse the append command with the append option of outreg. In outreg, append will act as Stata s mmerge command, making tables wider (adding columns to the existing table). Lets see how it works: *** begin do-file *** regress gpa sat outreg2 using lab2.doc, replace regress gpa sat if university==1 outreg2 using lab2.doc, append regress gpa sat if university==2 outreg2 using lab2.doc, append *** end do-file *** Your table may look like this: (1) (2) (3) VARIABLES gpa gpa gpa sat 0.0159*** 0.0171*** 0.0145*** (0.000100) (0.000115) (0.000118) Constant -9.061*** -9.826*** -8.197*** (0.0705) (0.0808) (0.0824) Observations 1,400 700 700 R-squared 0.947 0.969 0.956 Standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1 12

This is pretty neat. If you wanted to get this by hand, you would need to copy and paste, format columns, and then send that to word. In addition, if you, like me, are not familiar with excel, you need to sort out how to deal with parentheses in excel (excel handles numbers in parentheses as negative numbers, so if you type a standard error like (0.0705), excel will change it to -0.0705 and erase the parentheses. However, there is some room for improvement. We could add a label to sat, name the models in each column, and format the coefficients. Note that some have three decimals, while others have six. We can use a few options of outreg2 to deal with that. We would like to know what each column is (having gpa ) in each column isn t very helpful. Finally, we would like our table to have a title. We can use some of the basic options in outreg2 to get the desired results. This is also an example of how useful are line breaks. *** begin do-file *** regress gpa sat outreg2 using lab2-table2.doc, /// word replace label bdec(4) sdec(4) ctitle(whole Sample) /// title("gpa and SAT scores, OLS regression") regress gpa sat if university==1 outreg2 using lab2-table2.doc, /// word append label bdec(4) sdec(4) ctitle(eastern Univ) regress gpa sat if university==2 outreg2 using lab2-table2.doc, /// word append label bdec(4) sdec(4) ctitle(western Univ) *** end do-file *** These are the options that I used for the last table: label: use the variable label bdec: number of decimals for the estimated coefficients sdec: number of decimals for their standard errors ctitle: column title title: table title. 13

GPA and SAT scores, OLS regression (1) (2) (3) VARIABLES Whole Sample Eastern Univ Western Univ SAT scores 0.0159*** 0.0171*** 0.0145*** (0.0001) (0.0001) (0.0001) Constant -9.0606*** -9.8256*** -8.1974*** (0.0705) (0.0808) (0.0824) Observations 1,400 700 700 R-squared 0.947 0.969 0.956 Standard errors in parentheses *** p<0.01, ** p<0.05, * p<0.1 14