set mem 10m we can also decide to have the more separation line on the screen or not when the software displays results: set more on set more off

Setting up Stata We are going to allocate 10 megabites to the dataset. You do not want to allocate to much memory to the dataset because the more memory you allocate to the dataset, the less memory will be available to perform the commands. You could reduce the speed of Stata or even kill it. set mem 10m we can also decide to have the more separation line on the screen or not when the software displays results: set more on set more off Setting up a panel Now, we have to instruct Stata that we have a panel dataset. We do it with the command tsset, or iis and tis iis idcode tis year or tsset idcode year In the previous command, idcode is the variable that identifies individuals in our dataset. Year is the variable that identifies time periods. This is always the rule. The commands refering to panel data in Stata almost always start with the prefix xt. You can check for these commands by calling the help file for xt. help xt Thierry Warin, 2006-2007 1

You should describe and summarize the dataset as usually before you perform estimations. Stata has specific commands for describing and summarizing panel datasets. xtdes xtsum xtdes permits you to observe the pattern of the data, like the number of individuals with different patterns of observations across time periods. In our case, we have an unbalanced panel because not all individuals have observations to all years. The xtsum command gives you general descriptive statistics of the variables in the dataset, considering the overall, the between and the within variations. Overall refers to the whole dataset. Between refers to the variation of the means to each individual (across time periods). Within refers to the variation of the deviation from the respective mean to each individual. You may be interested in applying the panel data tabulate command to a variable. For instance, to the variable south, in order to obtain a one-way table. xttab south As in the previous commands, Stata will report the tabulation for the overall variation, the within and the between variation. How to generate variables Generating variables gen age2=age^2 gen ttl_exp2=ttl_exp^2 gen tenure2=tenure^2 Thierry Warin, 2006-2007 2

Now, let's compute the average wage for each individual (across time periods). bysort idcode: egen meanw=mean(ln_wage) In this case, we did not apply the sort command previously and then the by prefix command. We could have done it, but with this only command, you can always abreviate the implementation of the by prefix command. The command egen is an extension of the gen command to generate new variables. The general rule to apply egen is when you want to generate a new variable that is created using a function inside Stata. In our case, we used the function mean. You can apply the command list to list the first 10 observations of the new variable mwage. list meanw in 1/10 And then apply the xtsum command to summarize the new variable. xtsum meanw You may want to obtain the average of the logarithm of wages to each year in the panel. bysort year: egen meanw1=mean(ln_wage) And then you can apply the xttab command. xttab meanw1 Generating dates Let s generate dates: Gen varname2 = date(varname1, dmy ) Thierry Warin, 2006-2007 3

And format: Format varname2 %d How to generate dummies Generating general dummies Let's generate the dummy variable black, which is not in our dataset. gen black=1 if race==2 replace black=0 if black==. Suppose you want to generate a new variable called tenure1 that is equal to the variable tenure lagged one period. Than you would use a time series operator (l). First, you would need to sort the dataset according to idcode and year, and then generate the new variable with the "by" prefix on the variable idcode. sort idcode year by idcode: gen tenure1=l.tenure If you were interested in generating a new variable tenure3 equal to one difference of the variable tenure, you would use the time series d operator. by idcode: gen tenure3=d.tenure If you would like to generate a new variable tenure4 equal to two lags of the variable tenure, you would type: by idcode: gen tenure4=l2.tenure The same principle would apply to the operator d. Let's just save our data file with the changes that we made to it. Thierry Warin, 2006-2007 4

save, replace Another way would be to use the xi command. It takes the items (string of letters, for instance) of a designated variable (category, for instance) and create a dummy variable for each item. You need to change the base anyway: char _dta[omit] prevalent xi: i.category tabulate category Generating time dummies In order to do this, let's first generate our time dummies. We use the "tabulate" command with the option "gen" in order to generate time dummies for each year of our dataset. We will name the time dummies as "y", and we will get a first time dummy called "y1" which takes the value 1 if year=1980, 0 otherwise, a second time dummy "y2" which assumes the value 1 if year=1982, 0 otherwise, and similarly for the remaining years. You could give any other name to your time dummies. tab year, g(y) Thierry Warin, 2006-2007 5

Running OLS regressions Let's now turn to estimation commands for panel data. The first type of regression that you may run is a pooled OLS regression, which is simply an OLS regression applied to the whole dataset. This regression is not considering that you have different individuals across time periods, and so, it is not considering for the panel nature of the dataset. reg ln_wage grade age ttl_exp tenure black not_smsa south In the previous command, you do not need to type age1 or age2. You just need to type age. When you do this, you are instructing Stata to include all the variables starting with the expression age to be included in the regression. Suppose you want to observe the internal results saved in Stata associated with the last estimation. This is valid for any regression that you perform. In order to observe them, you would type: ereturn list If you want to control for some categories: xi: reg dependent ind1 ind2 i.category1 i.category2 i.time Let's perform a regression where only the variation of the means across individuals is considered. This is the between regression. xtreg ln_wage grade age ttl_exp tenure black not_smsa south, be Thierry Warin, 2006-2007 6

Running Panel regressions In empirical work in panel data, you are always concerned in choosing between two alternative regressions. This choice is between fixed effects (or within, or least squares dummy variables - LSDV) estimation and random effects (or feasible generalized least squares - FGLS) estimation. In panel data, in the two-way model, the error term can be the result of the sum of three components: 1. The two-way model assumes the error term as having a specific individual term effect, 2. a specific time effect 3. and an additional idiosyncratic term. In the one-way model, the error term can be the result of the sum of one component: 1. assumes the error term as having a specific individual term effect It is absolutely fundamental that the error term is not correlated with the independent variables. If you have no correlation, then the random effects model should be used because it is a weighted average of between and within estimations. But, if there is correlation between the individual and/or time effects and the independent variables, then the individual and time effects (fixed effects model) must be estimated as dummy variables in order to solve for the endogeneity problem. The fixed effects (or within regression) is an OLS regression of the form: (yit - yi. - y.t + y..) = (xit - xi. - x.t + x..)b + (vit - vi. - v.t + v..) Thierry Warin, 2006-2007 7

where yi., xi. and vi. are the means of the respective variables (and the error) within the individual across time, y.t, x.t and v.t are the means of the respective variables (and the error) within each time period across individuals and y.., x.. and v.. is the overall mean of the respective variables (and the error). Choosing between Fixed effects and Random effects? The Hausman test The generally accepted way of choosing between fixed and random effects is running a Hausman test. Statistically, fixed effects are always a reasonable thing to do with panel data (they always give consistent results) but they may not be the most efficient model to run. Random effects will give you better P-values as they are a more efficient estimator, so you should run random effects if it is statistically justifiable to do so. Thierry Warin, 2006-2007 8

The Hausman test checks a more efficient model against a less efficient but consistent model to make sure that the more efficient model also gives consistent results. To run a Hausman test comparing fixed with random effects in Stata, you need to first estimate the fixed effects model, save the coefficients so that you can compare them with the results of the next model, estimate the random effects model, and then do the comparison. 1. xtreg dependentvar independentvar1 independentvar2..., fe 2. estimates store fixed 3. xtreg dependentvar independentvar1 independentvar2..., re 4. estimates store random 5. hausman fixed random The hausman test tests the null hypothesis that the coefficients estimated by the efficient random effects estimator are the same as the ones estimated by the consistent fixed effects estimator. If they are insignificant (P-value, Prob>chi2 larger than.05) then it is safe to use random effects. If you get a significant P- value, however, you should use fixed effects. If you want a fixed effects model with robust standard errors, you can use the following command: areg ln_wage grade age ttl_exp tenure black not_smsa south, absorb(idcode) robust You may be interested in running a maximum likelihood estimation in panel data. You would type: xtreg ln_wage grade age ttl_exp tenure black not_smsa south, mle If you qualify for a fixed effects model, should you include time effects? Thierry Warin, 2006-2007 9

Other important question, when you are doing empirical work in panel data is to choose for the inclusion or not of time effects (time dummies) in your fixed effects model. In order to perform the test for the inclusion of time dummies in our fixed effects regression, 1. first we run fixed effects including the time dummies. In the next fixed effects regression, the time dummies were abbreviated to "y" (see Generating time dummies, but you could type them all if you prefer. xtreg ln_wage grade age ttl_exp tenure black not_smsa south y, fe 2. Second, we apply the "testparm" command. It is the test for time dummies, which assumes the null hypothesis that the time dummies are not jointly significant. testparm y 3. We reject the null hypothesis that the time dummies are not jointly significant if p-value smaller than 10%, and as a consequence our fixed effects regression should include time effects. Fixed effects or random effects when time dummies are involved: a test What about if the inclusion of time dummies in our regression would permit us to use a random effects model in the individual effects? [This question is not usually considered in typical empirical work- the purpose here is to show you an additional test for random effects in panel data.) 1. First, we will run a random effects regression including our time dummies, xtreg ln_wage grade age ttl_exp tenure black not_smsa south y, re Thierry Warin, 2006-2007 10

2. and then we will apply the "xttest0" command to test for random effects in this case, which assumes the null hypothesis of random effects. xttest0 3. The null hypothesis of random effects is again rejected if p-value smaller than 10%, and thus we should use a fixed effects model with time effects. Thierry Warin, 2006-2007 11

GMM estimations Two additional commands that are very usefull in empirical work are the Arellano and Bond estimator (GMM estimator) and the Arellano and Bover estimator (system GMM). Both commands permit you do deal with dynamic panels (where you want to use as independent variable lags of the dependent variable) as well with problems of endogeneity. You may want to have a look at them The commands are respectively "xtabond" and "xtabond2". "xtabond" is a built in command in Stata, so in order to check how it works, just type: help xtabond "xtabond2" is not a built in command in Stata. If you want to look at it, previously, you must get it from the net (this is another feature of Stata- you can always get additional commands from the net). You type the following: findit xtabond2 The next steps to install the command should be obvious. How does it work? The xtabond2 commands allows to estimate dynamic models either with the GMM estimator in difference or the GMM estimator in system. xtabond2 dep_variable ind_variables (if, in), noleveleq gmm(list1, options1) iv(list2, options2) two robust small 1. When noleveleq is specified, it is the GMM estimator in difference that s used. Otherwise, if noleveleq is not specified, it is the GMM estimator in system that s used. Thierry Warin, 2006-2007 12

2. gmm(list1, options): list1 is the list of the non-exogenous independent variables options1 may take the following values: lag(a,b), eq(diff), eq(level), eq(both) and collapse o lag(a,b) means that for the equation in difference, the lagged variables (in level) of each variable from list1, dated from t-a to t- b, will be used as instruments; whereas for the equation in level, the first differences dated t-a+1 will be used as instruments. If b=, it means b is infinite. By default, a=1, and b=. Example: gmm(x y, lag(2.)) all the lagged variables of x and y, lagged by at least two periods, will be used as instruments. Example 2: gmm(x, lag(1 2)) gmm (y, lag (2 3)) for variable x, the lagged values of one period and two periods will be used as instruments, whereas for variable y, the lagged values of two and three periods will be used as instruments. o Options eq(diff), eq(level) or eq(both) mean that the instruments must be used respectively for the equation in first difference, the equation in level, or for both. By default, the option is eq(both). o Option collapse reduces the size of the instruments matrix and aloow to prevent the overestimation bias in small samples when the number of instruments is close to the number of observations. But it reduces the statistical efficiency of the estimator in large samples. 3. iv(list2, options2): List2 is the list of variables that are strictly exogenous, and options2 may take the following values: eq(diff), eq(level), eq(both), pass and mz. o Eq(diff), eq(level), and eq(both): see above o By default, the exogenous variables are differentiated to serve as instruments in the equations in first difference, and are used undifferentiated to serve as instruments in the equations in level. The pass option allows to prevent that exogenous variables are differentiated to serve as instruments in equations in first difference. Example: gmm(z, eq(level)) gmm(x, eq(diff) pass) allows to use variable x in level as an instrument in the equation in level as well as in the equation in difference. o Option mz replaces the missing values of the exogenous variables by zero, allowing thus to include in the regression the observations whose data on exogenous variables are missing. This option impacts the coefficients only if the variables are exogenous. Thierry Warin, 2006-2007 13

4. Option two: This option specifies the use of the GMM estimation in two steps. But although this two-step estimation is asymptotically more efficient, leads to biased results. To fix this issue, the xtabond2 command proceeds to a correction of the covariance matrix for finite samples. So far, there is no test to know whether the on-step GMM estimator or two-step GMM estimator should be used. 5. Option robust: This option allows to correct the t-test for heteroscedasticity. 6. Option small: This option replaces the z-statistics by the t-test results. Thierry Warin, 2006-2007 14