Introduction to STATA 6.0 ECONOMICS 626

Size: px

Start display at page:

Download "Introduction to STATA 6.0 ECONOMICS 626"

Bartholomew Gallagher
5 years ago
Views:

1 Introduction to STATA 6.0 ECONOMICS 626 Bill Evans Fall 2001 This handout gives a very brief introduction to STATA 6.0 on the Economics Department Network. In a few short years, STATA has become one of the leading programs used by researchers in applied micro economics. STATA was written by a labor economists and it contains many econometric procedures (fixed-effects, two-stage least squares, sample selection correction, quantile regressions, probit/logit, etc.) used in the analysis of cross-sectional data. It is fast and relatively easy to use. STATA s speed advantage comes from the fact that all data is loaded into RAM. Subsequently, the amount of high memory restricts the size of the problem. Given the size of the data sets we will use in class and the available memory on department machines, this should not prove to be much of a constraint. Some projects are however just too big for STATA. For example, one of our graduate students is using the Natality Detail data for the years The Detail data contain a census of all births in the US, or about 4 million observations per year. With a data set of 32 million observations, this project is only feasible in STATA if the host machine has about 1 GIG of RAM. Most machines do not have this much RAM so for problems of this magnitude, other programs like SAS may be preferred. All the STATA data files, sample programs, data dictionaries and results for this class can be found on the network drive h:\econ626\stata. These programs can be accessed once you log onto the Economics Department Network. For those with access to a department machine but no permanent network account, you can log onto the network as a guest using temp as your login ID and economics as the password. Sample programs and data files are also located on the class web page The output from the sample programs in this handout will be written to an external file. The files are written by default to the active subdirectory. I will illustrate below how to change the subdirectory once in STATA. First, however you must construct a local subdirectory for the results. Using WINDOWS EXPLORER or a DOS window, create a subdirectory on d: called ECON626. Please do not write files to the class subdirectory or a permanent network drive. Use either the local driver or the temporary drive t:\ for results. If you use a local drive, please delete results when you have finished using them. Once you construct the subdirectory for temporary output, Load h:\econ626\stata\cps87.do and h:\econ626\stata\cps87.dct into your favorite ASCII text editor, or going to the web, call these programs up in your favorite browser. We will look at these programs in a moment. Getting onto STATA STATA can be accessed through the Departments Network. From START, go APPLICATIONS NETWORK MENU, STATISTICS PROGRAMS, STATA 6.0. Once in STATA, you will notice 4 windows. The small window at the bottom is where commands will be typed that execute programs, generate descriptive statistics, etc. The box labeled Review contains a history of the programing statements executed by STATA in this current session. The box at the bottom left contains a list of all the variables currently loaded into memory. Once you execute a program, the results from the exercise will 1

2 scroll through the large black box in the middle of the page. STATA can be run interactive or programs can be submitted in batch format. In this handout, we will use a linear combination of these two procedures. A Basic STATA program From the command line, you can write individual lines of code that read data, get means, etc. In this example, we will instead execute an existing program. Executable STATA programs are called DO files and must be written in ASCII format and file names must end with the.do extension. I have written a sample program that illustrates many of the basic features of STATA. The program is called h:\econ626\stata\cps87.do. Click on your ASCII text editor and bring up the program on the screen. Alternatively, a copy of the program is included in Appendix D. This program illustrates 5 basic features of STATA. How to: M M M M M read raw data into STATA format construct new variables generate descriptive statistics run a simple regression output results to a LOG file Throughout the program, there will be two types of lines. The first is a comment that is a non-executable statement that begins with a star (*). These lines are programing directions that supply the program reader with a road map of what the author is trying to accomplish. The second line type are executable STATA commands. At the top of the program are a few commands that configure STATA for our use. The first line tells STATA that the semicolon (;) will be used to indicate the end of line. Because we delimit line with (;), commands can be written over more than one line. The next statement indicates the amount of RAM to use for data. In this case, we are using 10 meg of RAM. For larger problems you will need to increase RAM use. Results generated from the command line will be printed in the black box in the middle of the screen. If you want to save the results from your current STATA session, you must open a LOG file. The line log using c:\econ626\cps87,replace; opens a file called cps87.log on d:\econ626 subdirectory. If no directory is specified, the file will be written to the active subdircteory. The replace option tells STATA to over write any previous version of the file. At the end of the program, I CLOSE this file. Reading raw data into STATA For most empirical projects, you will receive data in some format like ASCII, and you must put the data into a STATA data file. This is accomplished through the use of data dictionaries. A dictionary defines where the raw data is stored, the variables in the data set, the type of variable (integer, scientific notation, etc.), and a short variable description called a label. The syntax for data dictionaries is illustrated in the file h:\econ626\stata\cps87.dct and a copy of the dictionary is listed in Appendix C. If you have not 2

3 already done so, please load this file into an ASCII text editor. The first line in the data dictionary defines the name of the file and where raw data is located. If no subdirectory is given, STATA assumes the raw file is located in the active subdirectory. The raw data file is called h:\econ626\stata\cps87.raw. The data file is a matrix with 7 columns and roughly 19,000 rows. Each row represents data for another observation (person) and each column is a new variable. The first 10 observations from this data are listed in Appendix A and the variables are defined in detail in Appendix B. The raw data is stored in space-delimited format which means variables are separated by spaces. By listing 7 variables in the data dictionary, STATA assumes that there are 7 unique variables per observation. With space-delimited text, the computer reads strings of numbers until it finds a space. The first number is variable 1, observation 1, the second is variable 2, observation 1, etc. In this instance, you must list the same number of variables in the data dictionary as in the data set. If there are 7 variables (A,B..G) in the data set but only 4 variables are listed (A,B,C,D) in the data dictionary, STATA will read the first 4 variables into observations 1 (A..D), then assume variables E,F,G and A from the next observation are the fours variables for observation 2. Variable names must be 8 or fewer characters, names must begin with a letter or an underscore _, and numbers can used in all but the first position. The data dictionary is referenced in the STATA DO program by the line. infile using h:\econ626\stata\cps87; Just a note about the data set. The data for this project comes from the 1987 Current population Survey (CPS). The CPS is a monthly survey of about 60,000 US households and it is the Federal Government s source of basic labor market data such as the monthly unemployment rate. Households are in the survey for the same 4 months in a two-year cycle. One quarter of the survey leaves the sample temporarily (month 4) or permanently (month 8). This group is called the out-going rotation sample. These individuals are asked a set of detailed questions about their job such as the amount and method of payment (salary of hourly), usual weekly hours and whether they belong to a union. From these responses, one can construct usual weekly hours of work. The sample we will use consists of a 25 percent random sample of male full time (30 or more hours per week) workers. Space-delimited data is the easiest type of text to read into a STATA data file. However, most data sets are packed which means that variables fill specified columns and there are no blank spaces between variables. For example, consider the following 7 lines from a data set. 58FB MW MB FW MW FW MW

4 In this case, columns 1-2 measure age, column 3 is sex, column 4 is race, and columns 8-9 are weeks worked in the previous year. To read this type of data into STATA, use column pointers that tell STATA to look on columns 1-2 for age, column 3 for sex, etc. Using pointers is beyond this simple example but such topics are discussed in the STATA manuals. STATA automatically stores all variables in single-precision (8-bit) unless otherwise specified. For example, a variable that equals 1 or 0 will be store as or Single-precision is overkill for many variables since they only occupy less than 8 spaces. Once raw data has been read into STATA, it is good programing practice to COMPRESS the data, which means STATA stores the information in the most efficient format. For example, UNIONM is a one-character variable (1 or 2). COMPRESS will store the variable as a 1 digit integer, freeing-up 7 bits of storage.` GENerating new variables in STATA Additional variables can easily be created with the GEN command. The syntax for GEN is gen new variable name=mathematic expression; Below are six lines that construct new variables. gen age2=age*age; gen earnwkl=ln(earnwke); gen union=unionm==1; gen topcode=earnwke==999; gen nonwhite=((race==2) (race==3)); gen big_ne=((region==1)&(smsa==1)); The first two lines use standard mathematical operators to construct new variables. One of the most common variables in applied work is a dummy variable that equals 1 or 0, separating people into two groups (male or female, black or white, etc). These variables are easy to construct with the use of logical operators. Logical operators are of the form GEN Y=(logical statement) that construct a new variable Y that equals 1 when the logical statement is true and zero otherwise. The next four variables demonstrate how to use logical operators. In the first two, we construct a variable that equals 1 for union members, or a variable that equals 1 for top-coded wages. Notice that two equal signs must be used when exact equality is indicated in a logical statement. Combinations of logical statements can be used to construct dummy variables. The verticle line represents or and the & sign represent and. The variable NONWHITE equals 1 if races equals 1 OR 2, and BIG_NE equals 1 if a respondent comes from a big SMSA from the Northeast census region. After the variables are constructed, I add a set of variable LABELs. The syntax for labels is illustrated in the next six lines. label var age2 "age squared"; 4

5 label var earnwkl "log earnings per week"; label var topcode "=1 if earnwkl is topcoded"; label var union "1=in union, 0 otherwise"; label var nonwhite "1=nonwhite, 0=white" ; label var big_ne "1= live in big smsa from northeast, 0=otherwsie"; compress; Once the new variables are constructed, I compress the data set again. You will notice throughout the program there are numerous MORE; commands. Results from the program will scroll on the screen. The MORE; command pauses until a hard return is issued. This allows you to examine output on the fly. Getting descriptive statistics Once you have the correct collection of variables in you STATA data file, you may want to construct some simple descriptive statistics. For example, you can ask to DESCribe the data in your STATA file. Summary statistics (mean, min, max and standard deviation) are produced with the SUM; command. If you want more detailed information on a particular variable (quantiles, medians, skewness, kurtosis, etc.), use the SUM command, list the variables, and ask for DETAILed calculations. You can obtain complete distributions for discrete variables by using the TABULATE command. You can construct two-way contingency tables by listing the two variables in the TABULATE command. For example, in the line tabulate region smsa, row column; STATA will count the number of observations for all 12 unique groups of region and SMSA. The row and column options to the command tell STATA to produce row and column totals. Running a simple OLS regression The most-often estimated model in labor economics, if not all of economics is the standard human capital earnings function. Log weekly wages has been shown to be roughly linear in education and quadratic in income. In the next few lines, we run a simple OLS regression. The syntax of a regression is simple where the first variable after REG is the dependent variable and all other variables are independent variables. In this example, there are five covariates. STATA automatically adds a constnat to every model unless otherwise specified. In many empirical models, observations can be grouped into discrete categories. Sometimes, the number of categories is small (e.g., race and sex) Sometimes the categories are numerous (states and countries). In a sample with people from 50 states, to add state dummy variables requires the construction of 49 variables. STATA has an automated procedure that will construct the discrete variables and add them to a model. Before the REG command is invoked, the XI option signals to STATA that the variables defined by i.name. 5

6 Submitting the sample program We can execute the sample DO program from the command line by typing do h:\econ626\stata\cps87 and hitting RETURN. Notice that as the program executes, the results pause until a hard return is issued. Once the program completes, the log is closed and the results are written to an external file. The variables are however still in active memory, so by typing SUM and hitting return, sample means return to the results window. Typing TABLULATE EDUC and hitting return generates the distribution of education. Note however that because the log is closed, the results are not written to disk. The data set constructed by this program is still active. Looking at the VARIABLES window, you see a list of variables that are active in the data set. Scroll down to the bottom of the window, then return to the STATA command line. Type in the following commands, hitting return after you type. gen bigcity=smsa==1 Notice that the variable BIGCITY has now been added to the data file. Type in the next line, hitting return afterwards label var bigcity "live in top 19 smsa" and you notice this label is added to the variable list. Now that you have anew variable, you can use it in your work. For example, you can sort the data by bigcity by typing the following line sort bigcity then hitting return. Once we have the data sorted by bigcity, we can get mean log weekly earnings by city size by typing by bigcity: sum earnwkl You see that mean log earnings are 6.00 outside of big SMSAs and 6.18 in the top 19 SMSAs. As another example, we can sort the data by race by typing sort race and hitting return, then run human capital earnings regressions by race by race: reg earnwkl educ age age2 You see that the rate of return to education is 6.9% for whites, 7.5% for blacks, and 5.58% for Hispanics. Handling errors 6

7 If your program has errors, enter and ASCII editor then edit and save the program. You will need to close an open log from the command line by typing LOG CLOSE and CLEAR any active variables in memory. You are then ready to re-run your program. If you hit the page up key, you will notice that previously-entered commands appear in the command line. This is a quick way of recalling lines of code. Getting online help This program illustrates a small fraction of the procedures and capabilities of STATA. For example, from the toolbar at the top of the window, clink on HELP then SEARCH. Type OLS then hit return. The output from this search lists all procedures that utilize an OLS model. Go down to about the seventh entry and click on the red REGRESS. Here is the syntax for the regression procedure. In this section you learn that to run a regression for only whites, we type reg earnwkl educ age age2 union if race==1 or you want robust standard errors, we type reg earnwkl educ age age2 union, robust Exiting STATA To exit STATA, please do to the command line, type CLEAR and hit return which clears all variables from memory, then type EXIT and hit return. Please delete the d:\econ626 subdirectory from the local drive. 7

8 Appendix A 1 st 10 observations from h:\econ626\stata\cps87.raw Variable Positions h:\econ626\stata\cps87.raw age race education union smsa region weekly earnings

9 Appendix B Detailed Variable Definitions h:\econ626\stata\cps87.raw Variable AGE RACE Definition Age in years =1 if white, non hispanic, =2 id black, non hispanic, =3 if hispanic EDUC Years of completed education, top-coded at 18. UNIONM SMSA REGION =1 if a uion member, =2 otherwise =1 if live in one of 19 largest Standard metropolitan Statistical Areas (SMSA), =2 if live in other SMSA, =3 if live in non-smsa =1 if live in Northeast, =2 if live in Midwest, =3 if live in South, =4 if live in West EARNWKE Usual weekly earnings, nominal 1987 dollars, top-coded at $999 Appendix C Contents of h:\econ626\stata\cps87.dct STATA Data Dictionary dictionary using h:\eccon626\stata\cps87.raw{ age "age in years" race "1=white, non-hisp, 2=black, n.h, 3=hisp" educ "years of education" unionm "1=union member, 2=otherwise" smsa "1=live in 19 largest smsa, 2=other smsa, 3=non smsa" region "1=east, 2=midwest, 3=south, 4=west" earnwke "usual weekly earnings" } 9

10 Appendix D h:\econ626\stata\cps87.do * this line defines the semicolon as the line delimiter; # delimit ; * set memory for 10 meg; set memory 10m; * write results to a log file; * please do not write files to permanent; * network drives or the class subdirectory; * write results to a subdirectory on t:\; * or local drives; log using c:\econ626\cps87,replace; *read in raw data; infile using h:\econ626\stata\cps87; * compress saves data in efficient manner; * turning double precision values into integer, etc; compress; * generate new variables; * lines 1-2 illustrate basic math functoins; * lines 3-4 line illustrate logical operators; * line 5 illustrate the OR statement; * line 6 illustrates the AND statement; * after you construct new variables, compress the data again; gen age2=age*age; gen earnwkl=ln(earnwke); gen union=unionm==1; gen topcode=earnwke==999; gen nonwhite=((race==2) (race==3)); gen big_ne=((region==1)&(smsa==1)); label var age2 "age squared"; label var earnwkl "log earnings per week"; label var topcode "=1 if earnwkl is topcoded"; label var union "1=in union, 0 otherwise"; label var nonwhite "1=nonwhite, 0=white" ; label var big_ne "1= live in big smsa from northeast, 0=otherwise"; compress; * the more command pauses the screen until a hard return is issued; more; * list variables and labels in data set; desc; more; * get descriptive statistics; 10

11 sum; more; * get detailed statistics for continuous variables; sum earnwke, detail; more; * get frequencies of discrete variables; tabulate unionm; tabulate race; more; * get two-way table of frequencies; tabulate region smsa, row column; more; *run simple regression; reg earnwkl age age2 educ nonwhite union; more; * run regression adding smsa, region and race fixed-effects; xi: reg earnwkl age age2 educ union i.race i.region i.smsa; more; * close log file; log close; 11

12 Results h:\econ626\stata\cps87.log. *read in raw data;. infile using h:\econ626\stata\cps87; dictionary using h:\econ626\stata\cps87.raw{ age "age in years" race "1=white, non-hisp, 2=black, n.h, 3=hisp" educ "years of education" unionm "1=union member, 2=otherwise" smsa "1=live in 19 largest smsa, 2=other smsa, 3=non smsa" region "1=east, 2=midwest, 3=south, 4=west" earnwke "usual weekly earnings" } (19906 observations read). * compress saves data in efficient manner;. * turning double precision values into integer, etc;. compress; age was float now byte race was float now byte educ was float now byte unionm was float now byte smsa was float now byte region was float now byte earnwke was float now int. * generate new variables;. * lines 1-2 illustrate basic math functoins;. * lines 3-4 line illustrate logical operators;. * line 5 illustrate the OR statement;. * line 6 illustrates the AND statement;. * after you construct new variables, compress the data again;. gen age2=age*age;. gen earnwkl=ln(earnwke);. gen union=unionm==1;. gen topcode=earnwke==999;. gen nonwhite=((race==2) (race==3));. gen big_ne=((region==1)&(smsa==1));. label var age2 "age squared";. label var earnwkl "log earnings per week";. label var topcode "=1 if earnwkl is topcoded";. label var union "1=in union, 0 otherwise";. label var nonwhite "1=nonwhite, 0=white" ;. label var big_ne "1= live in big smsa from northeast, 0=otherwsie";. compress; age2 was float now int union was float now byte topcode was float now byte nonwhite was float now byte big_ne was float now byte 12

13 . * the more command pauses the screen until a hard return is issued;. more;. * list variables and labels in data set;. desc; Contains data obs: 19,906 vars: 13 size: 437,932 (95.8% of memory free) age byte %9.0g age in years 2. race byte %9.0g 1=white, non-hisp, 2=black, n.h, 3=hisp 3. educ byte %9.0g years of education 4. unionm byte %9.0g 1=union member, 2=otherwise 5. smsa byte %9.0g 1=live in 19 largest smsa, 2=other smsa, 3=non smsa 6. region byte %9.0g 1=east, 2=midwest, 3=south, 4=west 7. earnwke int %9.0g usual weekly earnings 8. age2 int %9.0g age squared 9. earnwkl float %9.0g log earnings per week 10. union byte %9.0g 1=in union, 0 otherwise 11. topcode byte %9.0g =1 if earnwkl is topcoded 12. nonwhite byte %9.0g 1=nonwhite, 0=white 13. big_ne byte %9.0g 1= live in big smsa from northeast, 0=otherwsie Sorted by: Note: dataset has changed since last saved. more;. * get descriptive statistics;. sum; Variable Obs Mean Std. Dev. Min Max age race educ unionm smsa region earnwke age earnwkl union topcode nonwhite big_ne more; 13

14 . * get detailed descriptics for continuous variables;. sum earnwke, detail; usual weekly earnings Percentiles Smallest 1% % % Obs % Sum of Wgt % 449 Mean Largest Std. Dev % % Variance % Skewness % Kurtosis more;. * get frequencies of discrete variables;. tabulate unionm; 1=union member, 2=otherwise Freq. Percent Cum Total tabulate race; 1=white, non-hisp, 2=place, n.h, 3=hisp Freq. Percent Cum Total more;. * get two-way table of frequencies;. tabulate region smsa, row column; 1=east, 2=midwest, 1=live in 19 largest smsa, 3=south, 2=other smsa, 3=non smsa 4=west Total

15 Total more;. *run simple regression;. reg earnwkl age age2 educ nonwhite union; Source SS df MS Number of obs = F( 5, 19900) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE = earnwkl Coef. Std. Err. t P> t [95% Conf. Interval] age age educ nonwhite union _cons more;. * run regression addinf smsa, region and race fixed-effects;. xi: reg earnwkl age age2 educ union i.race i.region i.smsa; i.race Irace_1-3 (naturally coded; Irace_1 omitted) i.region Iregio_1-4 (naturally coded; Iregio_1 omitted) i.smsa Ismsa_1-3 (naturally coded; Ismsa_1 omitted) Source SS df MS Number of obs = F( 11, 19894) = Model Prob > F = Residual R-squared = Adj R-squared = Total Root MSE =

16 earnwkl Coef. Std. Err. t P> t [95% Conf. Interval] age age educ union Irace_ Irace_ Iregio_ Iregio_ Iregio_ Ismsa_ Ismsa_ _cons more;. * close log file;. log close 16

I Launching and Exiting Stata. Stata will ask you if you would like to check for updates. Update now or later, your choice.

I Launching and Exiting Stata. Stata will ask you if you would like to check for updates. Update now or later, your choice. I Launching and Exiting Stata 1. Launching Stata Stata can be launched in either of two ways: 1) in the stata program, click on the stata application; or 2) double click on the short cut that you have