A Short Introduction to STATA - PDF Free Download

A Short Introduction to STATA 1) Introduction: This session serves to link everyone from theoretical equations to tangible results under the amazing promise of Stata! Stata is a statistical package that includes a wide variety of capabilities, such as data management, statistical and econometric analysis, graphics, etc. The user s interface includes the following windows (see Figure 1.) Command Window (highlighted in red): the window where we can type all the commands; Results Window (highlighted in blue): the window displays all the results and output generated by the commands we have typed; Variables Window (highlighted in orange): the window shows all the variables currently stored in the Stata s memory. We can visualize these variables as in spreadsheet by typing in the Command Window browse (br) followed by the variables to be displayed (if no variables are specified, Stata will show all the variables). If we want to make changes to the data, we will type edit in the Command Window. Command History (highlighted in green): the window keeps a record of all the commands used in each session. Current Working Directory (highlighted in black): the window shows the current directory in the file of your computer from where State will read or save any files. It can be changed by writing in the Command Window cd path_to_the_new_directory (e.g. cd c:\desktop\state11\session1 or cd c:\desktop\state11\session 1 if the directory contains a space); or from the Stata menu: File/Change Working Directory. Figure 1: State User s Interface

2) Some Basic Commands: To clear all the variables saved in Stata s memory from last session, we can type in the Command Window clear; When we need to learn the use of a command, like what options it allows, or to see some examples of its uses, we can type help name_of_the_command or findit name_of_the_command in the Command Window. Try help reg and findit reg, and see the differences. If we are not sure about the name the command we need, we can type search instead. Any command in Stata that is preceded by a star (*) will be regarded as comment, and will not be executed by Stata. Stata can also be used a calculator by using the command display (e.g. display 4+5). 3) Entering Data: I. Input from.xls or.xlsx files If your original data source in an excel files or workbook looks like this: Econ526 students may recognize this is the data set from C. Dougherty s textbook Introduction to Econometrics, with eaef21.xls as its file name. The command to input this into Stata is import excel using eaef21, firstrow case(lower) Here, excel cannot be omitted, as we do not only import excel, we also import others like txt file. firstrow means to treat the first row in the excel file as the default variable names in Stata. Notice they are all in upper case letters, so case(lower) is used as part of the command to have lower case letters as variable names. A Capital letter and the same lower case letter are different variables in Stata. So likewise, case(preserve) keeps the names unchanged from the excel file; use case(upper) if you want upper case names anyway.

II. Input from.csv files A.csv file is different from an.xls file in that data are separated by comma in.csv files. Using the same data set for example, save is as an.csv file, you are supposed to use thefollowing command to load it: import delimited using eaef21.csv Here, you don t need to specify the firstrow or case(lower) as the first row from.csv file serves as variable names and they are in lower case automatically. It makes sense since.csv file has separated data already, it eases Stata to pin down the data structure, thus you benefit by having an easier command. Another way to load a.csv file is to usean older version command insheet: insheet using eaef21.csv These two commands yield the same result. Starting from Stata14, insheet is replaced by a new command import delimited. So if you are using an old version, use insheet. It still works in up-to-date versions of Stata, its help file just may no longer update. III. Input from.txt files A.txt file may look like this: This data "earnings" is taken from R. Davidson and J.G. MacKinnon Econometric Theory and Method, New York, Oxford University Press, 2004. The first column is observation number; column 2 to 4 are dummy variables for individuals in group 1, 2 and 3 respectively. The last column is average annual earnings in 1988 and 1989, measured in 1982 US dollars. You may notice there are no names shown up in the first row, so you are supposed to key in the variable names all by yourself, and the command for dealing with.txt files is infile: infile obs d1 d2 d3 earnings using earnings.txt where obs is the variable name for observation numbers, so are d1 d2 d3 and earnings.

IV. Miscellaneous Actually it s also quite easy for us to generate number of observations in a given data set: gen n = _n gen is short for generate, n is the variable name, _n is the way Stata tracks observations. For example, Let s regress earnings on two dummies d1 and d2. reg earnings d1 d2 lf you want to run a regression without using the first 500 observations, just plus if_n>500 in the command: reg earnings d1 d2 if _n > 500 Since referring to a specific observation is quite handy, we don t really need the variable obs in our data set. The way to delete it is to use drop drop obs You can drop variables, you can also drop part of the observations, before we do that, let s preserve the data first so that we can restore it easily after this destructive trial. preserve drop if _n <=1000 restore After carrying out the second command, Stata reminds you that 1000 obs have been deleted. But once you preserve the data, you can always restore it, and restore it onceonly! Au contraire, the reverse operation of drop is keep. keep earnings is equivalent to drop n d1 d2 d3 To prevent you from forgetting about what a particular variable is about, label it: label var earnings "Average annual earnings" var stands for variable, anything put in the quotation is the label, pretty self-clear. Stata stores on hard drive its own data set as a.dta file. Whenever you want to open an existing data set, use the following command: use earnings Again, like every case above, you have to put earnings.dta under the current working directory. Stata also contain 27 data sets (in the 14th version) of its own, those data sets cannot be deleted providing

your Stata is intact, and they also serve repeatedly as example data for demonstrative purpose in Stata s User Reference Manual which I highly recommend anyone who wants to learn more. Please type sysuse dir to form an initial impression of these data sets. The command to invoke any of them is sysuse (e.g. sysuse auto). 4) Exploring the Data: We have seen commands that can help us explore and understand the data better. Type the following command to use the NLSW88 dataset (National Longitudinal Survey of Women in 1988) webuse nlsw88 or webuse nlsw88, clear if you need to clear preloaded variables Now, try the following commands and see the differences between them: describe describe wage age summarize wage sum wage summarize wage, detail sum wage, de list age race married list age race married in 1/10 codebook wage inspect wage tab race collgrad tab race collgrad, nolabel tab race collgrad if wage>16.5 Note that when we add if followed by a condition (e.g. wage>16.5 the command will be executed only for those observations in the dataset that meet this condition.

0.05 Density.1.15 5) Visualizations A. Histograms To see the distribution of a variable graphically, we use command histogram or hist: For example, type histogram wage; or hist wage, normal if you would like to add a normal distortion to it in the Command Window, you should see the following picture. 0 10 20 30 40 hourly wage The picture shows that wage is right skewed. B. Scatter Graphs graph twoway scatter wage tenure graph twoway (scatter wage tenure)(lfit wage tenure) We use lfit to create a liner predication over the variable scatter wage tenure scatter wage tenure, by(race) Note that in the context of graphs, by is used as an option (after a comma) rather than as a prefix. C. Matrix Graphs graph matrix wage tenure hours D. Box Graphs graph box wage, over(race) The following picture will be generated:

0 hourly wage 10 20 30 40 white black other From the picture, it seems that median wage among the three ethnic groups does not differ too much, even though the whites have more high income outlier. 6) An OLS regression: To run an OLS regression we can use the command regress or, in short, reg followed by the dependent variable (the one we want to explain) and the independent variable or variables (the ones that we suspect explain the dependent variable). For example: runs a regression of wage on tenure, collgrad, and married. reg wage tenure collgrad married After running a regression, Stata temporarily stores (until another regression is run) some useful items. For example we can generate the residuals of the regression by using the command predict: predict myresids, residuals Residuals of the aforementioned regression are then saved in the variable myresids. Are my residuals correlated with any other variables that perhaps is missing in my regression? Use the command correlate or a scatter graph as shown below to check this. 7) Hypothesis Testing Hypothesis testing is straight forward in Stata, for instance, if we want to test the coefficient of tenure equals zero: test tenure = 0 and it give the result: ( 1) tenure = 0 F( 1, 2227) = 58.18

Prob > F = 0.0000 This is a single variable test. The joint significant test for the coefficients on collgrad and marrid equal zero is: test collgrad = marrid = 0 and it gives the result ( 1) collgrad - married = 0 ( 2) collgrad = 0 F( 2, 2227) = 80.20 Prob > F = 0.0000 The following commands get you fitted values y and the residuals u predict yhat, xb predict u, res To get them out of the regression, the command is predict, yhat and u are names, option xb tells Stata you want the fitted values, and resid is just short for residuals. You ll find two more variables appear on your variable list. Finally, all the useful information has been stored in the e-class 3 (e stands for estimation) returns. Please take a look at them by using the following command after the regression: ereturn list 8) Extra Resources http://www.stata.com/links/resources-for-learning-stata/ http://www.stata.com/links/video-tutorials/