Introduction to STATA 6.0 ECONOMICS 626

Similar documents
I Launching and Exiting Stata. Stata will ask you if you would like to check for updates. Update now or later, your choice.

Introduction to Stata Toy Program #1 Basic Descriptives

Heteroskedasticity and Homoskedasticity, and Homoskedasticity-Only Standard Errors

/23/2004 TA : Jiyoon Kim. Recitation Note 1

Econ Stata Tutorial I: Reading, Organizing and Describing Data. Sanjaya DeSilva

Recoding and Labeling Variables

Dr. Barbara Morgan Quantitative Methods

After opening Stata for the first time: set scheme s1mono, permanently

. predict mod1. graph mod1 ed, connect(l) xlabel ylabel l1(model1 predicted income) b1(years of education)

Introduction to Stata: An In-class Tutorial

Stata v 12 Illustration. First Session

Stata version 13. First Session. January I- Launching and Exiting Stata Launching Stata Exiting Stata..

STAT:5400 Computing in Statistics

Stata versions 12 & 13 Week 4 Practice Problems

TYPES OF VARIABLES, STRUCTURE OF DATASETS, AND BASIC STATA LAYOUT

Week 4: Simple Linear Regression II

BIOSTATISTICS LABORATORY PART 1: INTRODUCTION TO DATA ANALYIS WITH STATA: EXPLORING AND SUMMARIZING DATA

Introduction to Stata - Session 2

Review of Stata II AERC Training Workshop Nairobi, May 2002

25 Working with categorical data and factor variables

Introduction to Stata

ECON Introductory Econometrics Seminar 4

Empirical Asset Pricing

A Short Introduction to STATA

Introduction to Stata First Session. I- Launching and Exiting Stata Launching Stata Exiting Stata..

Bivariate (Simple) Regression Analysis

Exercise 1: Introduction to Stata

An Introduction to STATA ECON 330 Econometrics Prof. Lemke

A QUICK INTRODUCTION TO STATA

Week 10: Heteroskedasticity II

PubHlth 640 Intermediate Biostatistics Unit 2 - Regression and Correlation. Simple Linear Regression Software: Stata v 10.1

Week 4: Simple Linear Regression III

An Introduction to Stata Part I: Data Management

schooling.log 7/5/2006

AcaStat User Manual. Version 8.3 for Mac and Windows. Copyright 2014, AcaStat Software. All rights Reserved.

Appendix II: STATA Preliminary

Chapter 1. Looking at Data-Distribution

GETTING DATA INTO THE PROGRAM

texdoc 2.0 An update on creating LaTeX documents from within Stata Example 2

Panel Data 4: Fixed Effects vs Random Effects Models

Week 11: Interpretation plus

Preparing Data for Analysis in Stata

Appendix II: STATA Preliminary

Creating LaTeX and HTML documents from within Stata using texdoc and webdoc. Example 2

STATA Hand Out 1. STATA's latest version is version 12. Most commands in this hand-out work on all versions of STATA.

Chapter 2 Organizing and Graphing Data. 2.1 Organizing and Graphing Qualitative Data

THE LINEAR PROBABILITY MODEL: USING LEAST SQUARES TO ESTIMATE A REGRESSION EQUATION WITH A DICHOTOMOUS DEPENDENT VARIABLE

Epidemiology Principles of Biostatistics Chapter 3. Introduction to SAS. John Koval

Stata Session 2. Tarjei Havnes. University of Oslo. Statistics Norway. ECON 4136, UiO, 2012

range: [1,20] units: 1 unique values: 20 missing.: 0/20 percentiles: 10% 25% 50% 75% 90%

Cluster Randomization Create Cluster Means Dataset

For many people, learning any new computer software can be an anxietyproducing

Basic Stata Tutorial

Basics of Stata, Statistics 220 Last modified December 10, 1999.

Lecture 3: The basic of programming- do file and macro

April 4, SAS General Introduction

STATA Tutorial. Introduction to Econometrics. by James H. Stock and Mark W. Watson. to Accompany

RUDIMENTS OF STATA. After entering this command the data file WAGE1.DTA is loaded into memory.

Principles of Biostatistics and Data Analysis PHP 2510 Lab2

Centering and Interactions: The Training Data

CLAREMONT MCKENNA COLLEGE. Fletcher Jones Student Peer to Peer Technology Training Program. Basic Statistics using Stata

Economics 145 Fall 2009 Howell Getting Started with Stata

CAPACITY BUILDING WORKSHOP- DATA MANAGEMENT SOFTWARE

ASSIGNMENT 6 Final_Tracts.shp Phil_Housing.mat lnmv %Vac %NW Final_Tracts.shp Philadelphia Housing Phil_Housing_ Using Matlab Eire

THE UNIVERSITY OF BRITISH COLUMBIA FORESTRY 430 and 533. Time: 50 minutes 40 Marks FRST Marks FRST 533 (extra questions)

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

STATA 13 INTRODUCTION

Data Management 2. 1 Introduction. 2 Do-files. 2.1 Ado-files and Do-files

Lecture Notes 3: Data summarization

ECONOMICS 452* -- Stata 12 Tutorial 1. Stata 12 Tutorial 1. TOPIC: Getting Started with Stata: An Introduction or Review

Introduction to Programming in Stata

Lab #1: Introduction to Basic SAS Operations

Factorial ANOVA. Skipping... Page 1 of 18

STAT 7000: Experimental Statistics I

International Graduate School of Genetic and Molecular Epidemiology (GAME) Computing Notes and Introduction to Stata

A Short Guide to Stata 10 for Windows

ECONOMICS 452 TIME SERIES WITH STATA

STAT 3304/5304 Introduction to Statistical Computing. Introduction to SAS

Getting Our Feet Wet with Stata SESSION TWO Fall, 2018

Averages and Variation

You will learn: The structure of the Stata interface How to open files in Stata How to modify variable and value labels How to manipulate variables

Health Disparities (HD): It s just about comparing two groups

Week 5: Multiple Linear Regression II

SAS PROGRAMMING AND APPLICATIONS (STAT 5110/6110): FALL 2015 Module 2

optimization_machine_probit_bush106.c

Notes for Student Version of Soritec

Brief Guide on Using SPSS 10.0

SOCY7706: Longitudinal Data Analysis Instructor: Natasha Sarkisian. Panel Data Analysis: Fixed Effects Models

Lab 1: Basics of Stata Short Course on Poverty & Development for Nordic Ph.D. Students University of Copenhagen June 13-23, 2000

SAS Training Spring 2006

Week 9: Modeling II. Marcelo Coca Perraillon. Health Services Research Methods I HSMP University of Colorado Anschutz Medical Campus

PHPM 672/677 Lab #2: Variables & Conditionals Due date: Submit by 11:59pm Monday 2/5 with Assignment 2

Table of Contents (As covered from textbook)

1. Descriptive Statistics

A Quick Guide to Stata 8 for Windows

B.2 Measures of Central Tendency and Dispersion

A quick introduction to STATA

Lab 2: OLS regression

Introduction to Stata. Written by Yi-Chi Chen

Getting Started Using Stata

Transcription:

Introduction to STATA 6.0 ECONOMICS 626 Bill Evans Fall 2001 This handout gives a very brief introduction to STATA 6.0 on the Economics Department Network. In a few short years, STATA has become one of the leading programs used by researchers in applied micro economics. STATA was written by a labor economists and it contains many econometric procedures (fixed-effects, two-stage least squares, sample selection correction, quantile regressions, probit/logit, etc.) used in the analysis of cross-sectional data. It is fast and relatively easy to use. STATA s speed advantage comes from the fact that all data is loaded into RAM. Subsequently, the amount of high memory restricts the size of the problem. Given the size of the data sets we will use in class and the available memory on department machines, this should not prove to be much of a constraint. Some projects are however just too big for STATA. For example, one of our graduate students is using the Natality Detail data for the years 1989-1996. The Detail data contain a census of all births in the US, or about 4 million observations per year. With a data set of 32 million observations, this project is only feasible in STATA if the host machine has about 1 GIG of RAM. Most machines do not have this much RAM so for problems of this magnitude, other programs like SAS may be preferred. All the STATA data files, sample programs, data dictionaries and results for this class can be found on the network drive h:\econ626\stata. These programs can be accessed once you log onto the Economics Department Network. For those with access to a department machine but no permanent network account, you can log onto the network as a guest using temp as your login ID and economics as the password. Sample programs and data files are also located on the class web page www.bsos.umd.edu/econ/evans/econ626.htm. The output from the sample programs in this handout will be written to an external file. The files are written by default to the active subdirectory. I will illustrate below how to change the subdirectory once in STATA. First, however you must construct a local subdirectory for the results. Using WINDOWS EXPLORER or a DOS window, create a subdirectory on d: called ECON626. Please do not write files to the class subdirectory or a permanent network drive. Use either the local driver or the temporary drive t:\ for results. If you use a local drive, please delete results when you have finished using them. Once you construct the subdirectory for temporary output, Load h:\econ626\stata\cps87.do and h:\econ626\stata\cps87.dct into your favorite ASCII text editor, or going to the web, call these programs up in your favorite browser. We will look at these programs in a moment. Getting onto STATA STATA can be accessed through the Departments Network. From START, go APPLICATIONS NETWORK MENU, STATISTICS PROGRAMS, STATA 6.0. Once in STATA, you will notice 4 windows. The small window at the bottom is where commands will be typed that execute programs, generate descriptive statistics, etc. The box labeled Review contains a history of the programing statements executed by STATA in this current session. The box at the bottom left contains a list of all the variables currently loaded into memory. Once you execute a program, the results from the exercise will 1

scroll through the large black box in the middle of the page. STATA can be run interactive or programs can be submitted in batch format. In this handout, we will use a linear combination of these two procedures. A Basic STATA program From the command line, you can write individual lines of code that read data, get means, etc. In this example, we will instead execute an existing program. Executable STATA programs are called DO files and must be written in ASCII format and file names must end with the.do extension. I have written a sample program that illustrates many of the basic features of STATA. The program is called h:\econ626\stata\cps87.do. Click on your ASCII text editor and bring up the program on the screen. Alternatively, a copy of the program is included in Appendix D. This program illustrates 5 basic features of STATA. How to: M M M M M read raw data into STATA format construct new variables generate descriptive statistics run a simple regression output results to a LOG file Throughout the program, there will be two types of lines. The first is a comment that is a non-executable statement that begins with a star (*). These lines are programing directions that supply the program reader with a road map of what the author is trying to accomplish. The second line type are executable STATA commands. At the top of the program are a few commands that configure STATA for our use. The first line tells STATA that the semicolon (;) will be used to indicate the end of line. Because we delimit line with (;), commands can be written over more than one line. The next statement indicates the amount of RAM to use for data. In this case, we are using 10 meg of RAM. For larger problems you will need to increase RAM use. Results generated from the command line will be printed in the black box in the middle of the screen. If you want to save the results from your current STATA session, you must open a LOG file. The line log using c:\econ626\cps87,replace; opens a file called cps87.log on d:\econ626 subdirectory. If no directory is specified, the file will be written to the active subdircteory. The replace option tells STATA to over write any previous version of the file. At the end of the program, I CLOSE this file. Reading raw data into STATA For most empirical projects, you will receive data in some format like ASCII, and you must put the data into a STATA data file. This is accomplished through the use of data dictionaries. A dictionary defines where the raw data is stored, the variables in the data set, the type of variable (integer, scientific notation, etc.), and a short variable description called a label. The syntax for data dictionaries is illustrated in the file h:\econ626\stata\cps87.dct and a copy of the dictionary is listed in Appendix C. If you have not 2

already done so, please load this file into an ASCII text editor. The first line in the data dictionary defines the name of the file and where raw data is located. If no subdirectory is given, STATA assumes the raw file is located in the active subdirectory. The raw data file is called h:\econ626\stata\cps87.raw. The data file is a matrix with 7 columns and roughly 19,000 rows. Each row represents data for another observation (person) and each column is a new variable. The first 10 observations from this data are listed in Appendix A and the variables are defined in detail in Appendix B. The raw data is stored in space-delimited format which means variables are separated by spaces. By listing 7 variables in the data dictionary, STATA assumes that there are 7 unique variables per observation. With space-delimited text, the computer reads strings of numbers until it finds a space. The first number is variable 1, observation 1, the second is variable 2, observation 1, etc. In this instance, you must list the same number of variables in the data dictionary as in the data set. If there are 7 variables (A,B..G) in the data set but only 4 variables are listed (A,B,C,D) in the data dictionary, STATA will read the first 4 variables into observations 1 (A..D), then assume variables E,F,G and A from the next observation are the fours variables for observation 2. Variable names must be 8 or fewer characters, names must begin with a letter or an underscore _, and numbers can used in all but the first position. The data dictionary is referenced in the STATA DO program by the line. infile using h:\econ626\stata\cps87; Just a note about the data set. The data for this project comes from the 1987 Current population Survey (CPS). The CPS is a monthly survey of about 60,000 US households and it is the Federal Government s source of basic labor market data such as the monthly unemployment rate. Households are in the survey for the same 4 months in a two-year cycle. One quarter of the survey leaves the sample temporarily (month 4) or permanently (month 8). This group is called the out-going rotation sample. These individuals are asked a set of detailed questions about their job such as the amount and method of payment (salary of hourly), usual weekly hours and whether they belong to a union. From these responses, one can construct usual weekly hours of work. The sample we will use consists of a 25 percent random sample of male full time (30 or more hours per week) workers. Space-delimited data is the easiest type of text to read into a STATA data file. However, most data sets are packed which means that variables fill specified columns and there are no blank spaces between variables. For example, consider the following 7 lines from a data set. 58FB261 0 0 0 33MW2655040 21000 46MB2715240 28000 40FW241 0 0 0 20MW255 440 4000 46FW2735260 40000 51MW2745280330659 3

In this case, columns 1-2 measure age, column 3 is sex, column 4 is race, and columns 8-9 are weeks worked in the previous year. To read this type of data into STATA, use column pointers that tell STATA to look on columns 1-2 for age, column 3 for sex, etc. Using pointers is beyond this simple example but such topics are discussed in the STATA manuals. STATA automatically stores all variables in single-precision (8-bit) unless otherwise specified. For example, a variable that equals 1 or 0 will be store as 1.000000 or 0.000000. Single-precision is overkill for many variables since they only occupy less than 8 spaces. Once raw data has been read into STATA, it is good programing practice to COMPRESS the data, which means STATA stores the information in the most efficient format. For example, UNIONM is a one-character variable (1 or 2). COMPRESS will store the variable as a 1 digit integer, freeing-up 7 bits of storage.` GENerating new variables in STATA Additional variables can easily be created with the GEN command. The syntax for GEN is gen new variable name=mathematic expression; Below are six lines that construct new variables. gen age2=age*age; gen earnwkl=ln(earnwke); gen union=unionm==1; gen topcode=earnwke==999; gen nonwhite=((race==2) (race==3)); gen big_ne=((region==1)&(smsa==1)); The first two lines use standard mathematical operators to construct new variables. One of the most common variables in applied work is a dummy variable that equals 1 or 0, separating people into two groups (male or female, black or white, etc). These variables are easy to construct with the use of logical operators. Logical operators are of the form GEN Y=(logical statement) that construct a new variable Y that equals 1 when the logical statement is true and zero otherwise. The next four variables demonstrate how to use logical operators. In the first two, we construct a variable that equals 1 for union members, or a variable that equals 1 for top-coded wages. Notice that two equal signs must be used when exact equality is indicated in a logical statement. Combinations of logical statements can be used to construct dummy variables. The verticle line represents or and the & sign represent and. The variable NONWHITE equals 1 if races equals 1 OR 2, and BIG_NE equals 1 if a respondent comes from a big SMSA from the Northeast census region. After the variables are constructed, I add a set of variable LABELs. The syntax for labels is illustrated in the next six lines. label var age2 "age squared"; 4

label var earnwkl "log earnings per week"; label var topcode "=1 if earnwkl is topcoded"; label var union "1=in union, 0 otherwise"; label var nonwhite "1=nonwhite, 0=white" ; label var big_ne "1= live in big smsa from northeast, 0=otherwsie"; compress; Once the new variables are constructed, I compress the data set again. You will notice throughout the program there are numerous MORE; commands. Results from the program will scroll on the screen. The MORE; command pauses until a hard return is issued. This allows you to examine output on the fly. Getting descriptive statistics Once you have the correct collection of variables in you STATA data file, you may want to construct some simple descriptive statistics. For example, you can ask to DESCribe the data in your STATA file. Summary statistics (mean, min, max and standard deviation) are produced with the SUM; command. If you want more detailed information on a particular variable (quantiles, medians, skewness, kurtosis, etc.), use the SUM command, list the variables, and ask for DETAILed calculations. You can obtain complete distributions for discrete variables by using the TABULATE command. You can construct two-way contingency tables by listing the two variables in the TABULATE command. For example, in the line tabulate region smsa, row column; STATA will count the number of observations for all 12 unique groups of region and SMSA. The row and column options to the command tell STATA to produce row and column totals. Running a simple OLS regression The most-often estimated model in labor economics, if not all of economics is the standard human capital earnings function. Log weekly wages has been shown to be roughly linear in education and quadratic in income. In the next few lines, we run a simple OLS regression. The syntax of a regression is simple where the first variable after REG is the dependent variable and all other variables are independent variables. In this example, there are five covariates. STATA automatically adds a constnat to every model unless otherwise specified. In many empirical models, observations can be grouped into discrete categories. Sometimes, the number of categories is small (e.g., race and sex) Sometimes the categories are numerous (states and countries). In a sample with people from 50 states, to add state dummy variables requires the construction of 49 variables. STATA has an automated procedure that will construct the discrete variables and add them to a model. Before the REG command is invoked, the XI option signals to STATA that the variables defined by i.name. 5

Submitting the sample program We can execute the sample DO program from the command line by typing do h:\econ626\stata\cps87 and hitting RETURN. Notice that as the program executes, the results pause until a hard return is issued. Once the program completes, the log is closed and the results are written to an external file. The variables are however still in active memory, so by typing SUM and hitting return, sample means return to the results window. Typing TABLULATE EDUC and hitting return generates the distribution of education. Note however that because the log is closed, the results are not written to disk. The data set constructed by this program is still active. Looking at the VARIABLES window, you see a list of variables that are active in the data set. Scroll down to the bottom of the window, then return to the STATA command line. Type in the following commands, hitting return after you type. gen bigcity=smsa==1 Notice that the variable BIGCITY has now been added to the data file. Type in the next line, hitting return afterwards label var bigcity "live in top 19 smsa" and you notice this label is added to the variable list. Now that you have anew variable, you can use it in your work. For example, you can sort the data by bigcity by typing the following line sort bigcity then hitting return. Once we have the data sorted by bigcity, we can get mean log weekly earnings by city size by typing by bigcity: sum earnwkl You see that mean log earnings are 6.00 outside of big SMSAs and 6.18 in the top 19 SMSAs. As another example, we can sort the data by race by typing sort race and hitting return, then run human capital earnings regressions by race by race: reg earnwkl educ age age2 You see that the rate of return to education is 6.9% for whites, 7.5% for blacks, and 5.58% for Hispanics. Handling errors 6

If your program has errors, enter and ASCII editor then edit and save the program. You will need to close an open log from the command line by typing LOG CLOSE and CLEAR any active variables in memory. You are then ready to re-run your program. If you hit the page up key, you will notice that previously-entered commands appear in the command line. This is a quick way of recalling lines of code. Getting online help This program illustrates a small fraction of the procedures and capabilities of STATA. For example, from the toolbar at the top of the window, clink on HELP then SEARCH. Type OLS then hit return. The output from this search lists all procedures that utilize an OLS model. Go down to about the seventh entry and click on the red REGRESS. Here is the syntax for the regression procedure. In this section you learn that to run a regression for only whites, we type reg earnwkl educ age age2 union if race==1 or you want robust standard errors, we type reg earnwkl educ age age2 union, robust Exiting STATA To exit STATA, please do to the command line, type CLEAR and hit return which clears all variables from memory, then type EXIT and hit return. Please delete the d:\econ626 subdirectory from the local drive. 7

Appendix A 1 st 10 observations from h:\econ626\stata\cps87.raw 55 1 12 2 1 4 750 57 1 16 2 1 4 690 30 3 12 2 1 4 240 34 1 18 2 1 4 800 31 1 16 2 1 4 999 32 1 18 2 1 4 750 39 3 17 2 1 4 240 55 3 12 1 1 4 440 39 1 12 2 1 4 999 52 3 0 2 2 4 420 Variable Positions h:\econ626\stata\cps87.raw age race education union smsa region weekly earnings 55 1 12 2 1 4 750 57 1 16 2 1 4 690 30 3 12 2 1 4 240 34 1 18 2 1 4 800 31 1 16 2 1 4 999 32 1 18 2 1 4 750 39 3 17 2 1 4 240 55 3 12 1 1 4 440 39 1 12 2 1 4 999 52 3 0 2 2 4 420 8

Appendix B Detailed Variable Definitions h:\econ626\stata\cps87.raw Variable AGE RACE Definition Age in years =1 if white, non hispanic, =2 id black, non hispanic, =3 if hispanic EDUC Years of completed education, top-coded at 18. UNIONM SMSA REGION =1 if a uion member, =2 otherwise =1 if live in one of 19 largest Standard metropolitan Statistical Areas (SMSA), =2 if live in other SMSA, =3 if live in non-smsa =1 if live in Northeast, =2 if live in Midwest, =3 if live in South, =4 if live in West EARNWKE Usual weekly earnings, nominal 1987 dollars, top-coded at $999 Appendix C Contents of h:\econ626\stata\cps87.dct STATA Data Dictionary dictionary using h:\eccon626\stata\cps87.raw{ age "age in years" race "1=white, non-hisp, 2=black, n.h, 3=hisp" educ "years of education" unionm "1=union member, 2=otherwise" smsa "1=live in 19 largest smsa, 2=other smsa, 3=non smsa" region "1=east, 2=midwest, 3=south, 4=west" earnwke "usual weekly earnings" } 9

Appendix D h:\econ626\stata\cps87.do * this line defines the semicolon as the line delimiter; # delimit ; * set memory for 10 meg; set memory 10m; * write results to a log file; * please do not write files to permanent; * network drives or the class subdirectory; * write results to a subdirectory on t:\; * or local drives; log using c:\econ626\cps87,replace; *read in raw data; infile using h:\econ626\stata\cps87; * compress saves data in efficient manner; * turning double precision values into integer, etc; compress; * generate new variables; * lines 1-2 illustrate basic math functoins; * lines 3-4 line illustrate logical operators; * line 5 illustrate the OR statement; * line 6 illustrates the AND statement; * after you construct new variables, compress the data again; gen age2=age*age; gen earnwkl=ln(earnwke); gen union=unionm==1; gen topcode=earnwke==999; gen nonwhite=((race==2) (race==3)); gen big_ne=((region==1)&(smsa==1)); label var age2 "age squared"; label var earnwkl "log earnings per week"; label var topcode "=1 if earnwkl is topcoded"; label var union "1=in union, 0 otherwise"; label var nonwhite "1=nonwhite, 0=white" ; label var big_ne "1= live in big smsa from northeast, 0=otherwise"; compress; * the more command pauses the screen until a hard return is issued; more; * list variables and labels in data set; desc; more; * get descriptive statistics; 10

sum; more; * get detailed statistics for continuous variables; sum earnwke, detail; more; * get frequencies of discrete variables; tabulate unionm; tabulate race; more; * get two-way table of frequencies; tabulate region smsa, row column; more; *run simple regression; reg earnwkl age age2 educ nonwhite union; more; * run regression adding smsa, region and race fixed-effects; xi: reg earnwkl age age2 educ union i.race i.region i.smsa; more; * close log file; log close; 11

Results h:\econ626\stata\cps87.log. *read in raw data;. infile using h:\econ626\stata\cps87; dictionary using h:\econ626\stata\cps87.raw{ age "age in years" race "1=white, non-hisp, 2=black, n.h, 3=hisp" educ "years of education" unionm "1=union member, 2=otherwise" smsa "1=live in 19 largest smsa, 2=other smsa, 3=non smsa" region "1=east, 2=midwest, 3=south, 4=west" earnwke "usual weekly earnings" } (19906 observations read). * compress saves data in efficient manner;. * turning double precision values into integer, etc;. compress; age was float now byte race was float now byte educ was float now byte unionm was float now byte smsa was float now byte region was float now byte earnwke was float now int. * generate new variables;. * lines 1-2 illustrate basic math functoins;. * lines 3-4 line illustrate logical operators;. * line 5 illustrate the OR statement;. * line 6 illustrates the AND statement;. * after you construct new variables, compress the data again;. gen age2=age*age;. gen earnwkl=ln(earnwke);. gen union=unionm==1;. gen topcode=earnwke==999;. gen nonwhite=((race==2) (race==3));. gen big_ne=((region==1)&(smsa==1));. label var age2 "age squared";. label var earnwkl "log earnings per week";. label var topcode "=1 if earnwkl is topcoded";. label var union "1=in union, 0 otherwise";. label var nonwhite "1=nonwhite, 0=white" ;. label var big_ne "1= live in big smsa from northeast, 0=otherwsie";. compress; age2 was float now int union was float now byte topcode was float now byte nonwhite was float now byte big_ne was float now byte 12

. * the more command pauses the screen until a hard return is issued;. more;. * list variables and labels in data set;. desc; Contains data obs: 19,906 vars: 13 size: 437,932 (95.8% of memory free) ------------------------------------------------------------------------------ - 1. age byte %9.0g age in years 2. race byte %9.0g 1=white, non-hisp, 2=black, n.h, 3=hisp 3. educ byte %9.0g years of education 4. unionm byte %9.0g 1=union member, 2=otherwise 5. smsa byte %9.0g 1=live in 19 largest smsa, 2=other smsa, 3=non smsa 6. region byte %9.0g 1=east, 2=midwest, 3=south, 4=west 7. earnwke int %9.0g usual weekly earnings 8. age2 int %9.0g age squared 9. earnwkl float %9.0g log earnings per week 10. union byte %9.0g 1=in union, 0 otherwise 11. topcode byte %9.0g =1 if earnwkl is topcoded 12. nonwhite byte %9.0g 1=nonwhite, 0=white 13. big_ne byte %9.0g 1= live in big smsa from northeast, 0=otherwsie ------------------------------------------------------------------------------ - Sorted by: Note: dataset has changed since last saved. more;. * get descriptive statistics;. sum; Variable Obs Mean Std. Dev. Min Max ---------+----------------------------------------------------- age 19906 37.96619 11.15348 21 64 race 19906 1.199136.525493 1 3 educ 19906 13.16126 2.795234 0 18 unionm 19906 1.769065.4214418 1 2 smsa 19906 1.908369.7955814 1 3 region 19906 2.462373 1.079514 1 4 earnwke 19906 488.264 236.4713 60 999 age2 19906 1565.826 912.4383 441 4096 earnwkl 19906 6.067307.513047 4.094345 6.906755 union 19906.2309354.4214418 0 1 topcode 19906.0719381.2583919 0 1 nonwhite 19906.1408118.3478361 0 1 big_ne 19906.1409625.3479916 0 1. more; 13

. * get detailed descriptics for continuous variables;. sum earnwke, detail; usual weekly earnings ------------------------------------------------------------- Percentiles Smallest 1% 128 60 5% 178 60 10% 210 60 Obs 19906 25% 300 63 Sum of Wgt. 19906 50% 449 Mean 488.264 Largest Std. Dev. 236.4713 75% 615 999 90% 865 999 Variance 55918.7 95% 999 999 Skewness.668646 99% 999 999 Kurtosis 2.632356. more;. * get frequencies of discrete variables;. tabulate unionm; 1=union member, 2=otherwise Freq. Percent Cum. ------------+----------------------------------- 1 4597 23.09 23.09 2 15309 76.91 100.00 ------------+----------------------------------- Total 19906 100.00. tabulate race; 1=white, non-hisp, 2=place, n.h, 3=hisp Freq. Percent Cum. ------------+----------------------------------- 1 17103 85.92 85.92 2 1642 8.25 94.17 3 1161 5.83 100.00 ------------+----------------------------------- Total 19906 100.00. more;. * get two-way table of frequencies;. tabulate region smsa, row column; 1=east, 2=midwest, 1=live in 19 largest smsa, 3=south, 2=other smsa, 3=non smsa 4=west 1 2 3 Total -----------+---------------------------------+---------- 1 2806 1349 842 4997 14

56.15 27.00 16.85 100.00 38.46 18.89 15.39 25.10 -----------+---------------------------------+---------- 2 1501 1742 1592 4835 31.04 36.03 32.93 100.00 20.58 24.40 29.10 24.29 -----------+---------------------------------+---------- 3 1501 2542 1904 5947 25.24 42.74 32.02 100.00 20.58 35.60 34.80 29.88 -----------+---------------------------------+---------- 4 1487 1507 1133 4127 36.03 36.52 27.45 100.00 20.38 21.11 20.71 20.73 -----------+---------------------------------+---------- Total 7295 7140 5471 19906 36.65 35.87 27.48 100.00 100.00 100.00 100.00 100.00. more;. *run simple regression;. reg earnwkl age age2 educ nonwhite union; Source SS df MS Number of obs = 19906 ---------+------------------------------ F( 5, 19900) = 1775.70 Model 1616.39963 5 323.279927 Prob > F = 0.0000 Residual 3622.93905 19900.182057239 R-squared = 0.3085 ---------+------------------------------ Adj R-squared = 0.3083 Total 5239.33869 19905.263217216 Root MSE =.42668 ------------------------------------------------------------------------------ earnwkl Coef. Std. Err. t P> t [95% Conf. Interval] ---------+-------------------------------------------------------------------- age.0679808.0020033 33.934 0.000.0640542.0719075 age2 -.0006778.0000245-27.691 0.000 -.0007258 -.0006299 educ.069219.0011256 61.496 0.000.0670127.0714252 nonwhite -.1716133.0089118-19.257 0.000 -.1890812 -.1541453 union.1301547.0072923 17.848 0.000.1158613.1444481 _cons 3.630805.0394126 92.123 0.000 3.553553 3.708057 ------------------------------------------------------------------------------. more;. * run regression addinf smsa, region and race fixed-effects;. xi: reg earnwkl age age2 educ union i.race i.region i.smsa; i.race Irace_1-3 (naturally coded; Irace_1 omitted) i.region Iregio_1-4 (naturally coded; Iregio_1 omitted) i.smsa Ismsa_1-3 (naturally coded; Ismsa_1 omitted) Source SS df MS Number of obs = 19906 ---------+------------------------------ F( 11, 19894) = 920.86 Model 1767.66908 11 160.697189 Prob > F = 0.0000 Residual 3471.66961 19894.174508375 R-squared = 0.3374 ---------+------------------------------ Adj R-squared = 0.3370 Total 5239.33869 19905.263217216 Root MSE =.41774 15

------------------------------------------------------------------------------ earnwkl Coef. Std. Err. t P> t [95% Conf. Interval] ---------+-------------------------------------------------------------------- age.070194.0019645 35.732 0.000.0663435.0740446 age2 -.0007052.000024-29.374 0.000 -.0007522 -.0006581 educ.0643064.0011285 56.983 0.000.0620944.0665184 union.1131485.007257 15.592 0.000.0989241.1273729 Irace_2 -.2329794.0110958-20.997 0.000 -.254728 -.2112308 Irace_3 -.1795253.0134073-13.390 0.000 -.2058047 -.1532458 Iregio_2 -.0088962.0085926-1.035 0.301 -.0257383.007946 Iregio_3 -.0281747.008443-3.337 0.001 -.0447238 -.0116257 Iregio_4.0318053.0089802 3.542 0.000.0142034.0494071 Ismsa_2 -.1225607.0072078-17.004 0.000 -.1366886 -.1084328 Ismsa_3 -.2054124.0078651-26.117 0.000 -.2208287 -.1899961 _cons 3.76812.0391241 96.312 0.000 3.691434 3.844807 ------------------------------------------------------------------------------. more;. * close log file;. log close 16