Introduction to Stata

Size: px

Start display at page:

Download "Introduction to Stata"

Darren McCarthy
6 years ago
Views:

1 Workshop Introduction to Stata MSc Economics / MSc STREEM / AUC Aug 2017 Zichen Deng VU University Amsterdam

2 0 PREFACE GETTING STARTED Stata at VU University and AUC Start Memory Interactive mode and batch mode Log-files DO-FILE INGREDIENTS AND KEY COMMANDS Administrative commands Loading and viewing the data Generating and transforming variables Describing variables Saving the data Executing the do-file LEARNING TO HELP YOURSELF DIFFERENT TYPES OF VARIABLES Storage types Categorical variables (among which: dummy variables) Converting strings Working with dates Different data types in the Data Editor Missing observations/missing values

3 5 MORE ON SYNTAX Functions If-statements Loops By and bysort Recode Abbreviating variable names Macros Scalars GRAPHS Saving a graph (Overlaid) two-way graphs INSTALLING USER WRITTEN COMMANDS ECONOMETRIC ANALYSIS Correlation coefficient T-test of equal means Linear regression model Post-estimation commands Storing estimation results All estimation results in one table MISCELLANEOUS TOPICS Reading a dataset with a different format Combining datasets System variables and information Stata stores from statistical analysis: _n, _N, and (e)return list

4 9.4 Other useful commands: assert, capture, quietly STATA RESOURCES Books available in the VU library Online resources COMMAND OVERVIEW

5 0 Preface Stata is a statistical software with large versatility and enjoys widespread application in the international research community in economics and other social sciences. It has features both of a software package for data management and statistical/econometric work ( press the button and get results ), and of a programming language ( tell the computer exactly what you want to compute ). Learning to use it will pay off certainly in the long run, but may have immediate returns already in the first course of the MSc curriculum on Advanced Methods. Author of the present document is Jonneke Bolhaar (VU University Amsterdam and CPB Netherlands Bureau for Economic Policy Analysis, The Hague). It has been slightly revised by Stefan Hochguertel (VU University Amsterdam) and Zlata Tanović (VU University Amsterdam and Amsterdam Institute for International Development). Errors in this document may occur, apologies for them. If you encounter any, please let us know at zichen.deng@vu.nl. 1 Getting Started Stata is available for different computer systems (for Windows, the one that is used in this tutorial, and for Mac and Unix) and comes in 4 different types: Small Stata: has a maximum of 99 variables and 1200 observations. Stata/IC (intercooled Stata): regular version, up to 2047 variables. Stata/SE (special edition): for large datasets Stata/MP (multiprocessor): same as Stata/SE, but faster because it can use multiple processors at the same time to perform Stata commands. Stata comes in versions. StataCorp regularly releases a new version of the program. The latest version is Stata 14. Newer versions have additional capabilities, but you can use your programs in different versions of Stata. In general, changes between versions are documented and you can find out about differences between versions by typing help whatsnew 1.1 Stata at VU University and AUC For VU students, Stata/SE 14 is available in the computer rooms. The number of licenses is limited (university wide), but the number is large enough to hardly ever cause problems. It is good to know however, that when you try to access Stata and an error message appears 5

. 1.2 Start Once you have started Stata, you will see a large window containing several

6 on your screen saying Stata cannot be opened, the cause may lie in the maximum number of licenses being in use at that moment. AUC students have a license for Small Stata Start Once you have started Stata, you will see a large window containing several smaller windows (Figure 1). review results variables command properties Figure 1: the different windows of Stata 6

7 The largest window is called Results window and will show the result from the analyses you perform. The window at the bottom is the Command window, where you tell Stata what you want it to do. All commands you have given Stata since you started your session are listed in the Review window. After you have loaded your dataset, the Variables window will contain a list of all variables. Stata 12/13 /14 looks slightly different from version 11 (Figure 1). A new Window is added to the interface, the Properties Window. You can manage the variables in your dataset directly from the Properties Window (e.g. add or modify a label, change the format of the variable. 1.3 Memory When you open Stata, it automatically assigns 50 MB of memory (only 10 MB in older versions! How much is assigned in the version you re working with is stated in the results window when you open Stata). This might be too little if you work with a large dataset or want to create many variables. To assign more memory to Stata (for example 100 MB), type set memory 100m You can only change the amount of memory that is assigned when there is currently nothing in Stata s memory. If you encounter memory problems while working, you first have to save your data and clear the memory with the command clear, before changing the amount of memory assigned. The maximum amount of memory that can be assigned depends on the internal memory of the PC you re working on. For Stata 12/13/14 you no longer have to set the memory yourself. Stata automatically takes care of it. 1.4 Interactive mode and batch mode Stata can be operated interactively or in batch mode. When you use Stata interactively, you type a Stata command in the Command window and hit the Return/Enter key on your keyboard. Stata executes the command and the results are displayed in the Results window. Then you enter the next command, Stata executes it, and so forth, until the analysis is complete. You can see the commands that you entered previously in the Review window and bring them to the command line again by a single click. In the same way you can bring the name of a variable from the list in the Variable window to the command line. Using one of the pull-down menus is another variant of using Stata in interactive mode. Stata executes the command that you specified in the dialog box. It also appears in the 7

8 Review pane and you can access it again by single-clicking on it, after which it will appear in the Command window. You can then edit it like a command that you entered yourself on the command line before you press Enter. This is a handy way of accessing commands that you are not yet familiar with, but is rather slow. When Stata is used in batch mode, all of the commands for the analysis are listed in a file, and Stata is told to read the file and execute all of the commands. Such a file with a series of commands is called a do-file by Stata and is saved using a.do suffix. Using Stata in batch mode has two important advantages over using Stata interactively. First, the do file provides an audit trail for your work. The file provides an exact record of each Stata command. This might not seem that important in the course where we will see only simple and short command sequences. But for serious applications later, like in your thesis, in scientific, consulting, or government work, reproducibility of results is a major issue. At any point in time, you should be able to reproduce your results from the original dataset. The order in which you manipulate variables and run regression commands may be very important. Second, even the best computer programmers will make typing or other errors when using Stata. When a command contains an error, it won t be executed by Stata, or worse, it will be executed but produce the wrong result. Following an error, it is often necessary to start the analysis from the beginning. If you are using Stata interactively, you must retype all of the commands. If you are using a do-file, then you only need to correct the command containing the error and rerun the file. You open a do-file by clicking: 1.5 Log-files A log-file is a file containing all the output of your program (which is everything that appears in the Results window), in a text-file. Log-files have the suffix.log. Storing your results in a log-file is useful when you want to be able to access the results of your analysis, without having to run the program again. For example, on a computer where Stata is not available (using Notepad to read the log-file). Or to take with you to a meeting with the supervisor of your thesis. In the next section you will learn how to create do-files and logfiles. 8

9 2 Do-file Ingredients and Key Commands A Stata do-file has four different kinds of commands or ingredients: 1. Administrative commands that tell Stata where to save results, how to manage computer memory, and so forth. 2. Commands that tell Stata to read and manage datasets. 3. Commands that tell Stata to modify existing variables or to create new variables. 4. Commands that tell Stata to carry out the statistical analysis. 5. Here is an example: A Stata.do file is nothing but a plain text file, and hence it can be edited in any editor (such as Notepad). Stata has its own editor (the aptly named do-file editor ), however, that also offers a few conveniences. To open this editor, type doedit <filename> in the command line. If the file <filename.do> exists, it will be opened (if not, an error message ( file not found ) will pop up). Without <filename>, the editor starts with a blank sheet. Note: If <filename> contains a space, then quotation marks should be added to the name, i.e. <filename>. It is good practice to always use quotation marks with path and file names. 9

10 The Stata editor automatically assigns different colors to different types of commands. Commands are things that Stata understands and it acts on them, it executes them. Very useful for any type of computer work is the stuff that should not be understood by a computer but rather by a human: comments. Use comments judiciously to document your work (so that you are able to retrace your steps when you look at things again after a while; comments also help others when they want to read or use your do-file). Use // or * to tell Stata that what follows on this line is a comment. If the comment you want to type stretches out over more than one line, use /* and */ to denote the start and end of the comment. We will now go through this small program and discuss its elements step-by-step. 2.1 Administrative commands The second line, cd H:\Documents\statafiles\ tells Stata which folder/directory is the one that we will be working from. All files will be used from and saved to this folder. This is convenient, because now in all commands that use or save a file, we only need to type the name and not its full path. Note that the directory is in quotation marks. In this case it is not strictly necessary, but it would be if there be a space in the path. For example writing H:\myfiles\stata files\ instead of H:\myfiles\stata files\ would give an error! The third line is a command that tells Stata where to write the log-file with the results of the analysis. To open a log-file called stata1.log in the current folder, use the command log using stata1.log, replace With replace you instruct Stata to replace any existing file with the same name in the same folder. For the course Advanced Methods, you usually have to hand in the do-file and log-file with your solution to the homework assignment. Don t forget to end your do-file with closing the log-file with the command log close. The command set more off tells Stata not to pause after displaying every new page of results. By default, Stata pauses every time the Results window is filled with new results and -more- is displayed at the bottom of the Results window. Execution of your program will only continue after you have pressed a key on the keyboard. As you are saving all results in a log-file, you may not find it necessary that Stata pauses after every page of results. By including set more off in the 10

beginning of your do-file you can get rid of this. (Why is this important? Sometimes you want Stata to do lengthy calculations while you go for a coffee.

11 beginning of your do-file you can get rid of this. (Why is this important? Sometimes you want Stata to do lengthy calculations while you go for a coffee. If it stops execution with a - more- it will not have made any progress when you come back from your coffee break.) 2.2 Loading and viewing the data Line 7 in our example do-file loads the dataset with the command use dataset_workshop.dta,clear The name of the dataset contains no spaces, so we don t need to use quotation marks. clear makes sure that if there was a dataset still in memory, it is cleared (note that this clears the memory without saving it first, so all unsaved changes in the dataset in memory will be lost by using clear!). describe in line 8 tells Stata to describe the dataset. This command produces a list of the variable names and any variable descriptions stored in the dataset. The latter are called variable labels. The list also contains the storage type (more about this in section 4.1), display format (not important for now), the name of the value label (more about value labels in section 4.2) and the variable label attached to the variable (more about variable labels in section 2.3). Another command that provides a lot of general information on your data is summarize. It gives a table with the number of non-missing observations and the mean, standard deviation, minimum and maximum for a variable. If you use the summarize command without a list of variables, Stata produces summary statistics for all variables in the dataset. 11

In this case, zero observations means zero numerical observations.

12 A look at the output generated by the summarize command for our dataset shows that this command only works for numerical variables. For the two string variables ( zdate and sex ) the table is empty except indicating that there are zero observations for this variable. In this case, zero observations means zero numerical observations. Both variables do actually have observations, but the variables are string variables ( strings is nonnumerical text, but even numbers can be stored as text, e.g. 99 ). Keep in mind when using this command that it has no relevance for string variables. The command tabstat is a more advanced version of summarize. For example: tabstat yearb, stats(mean N) will show the average of the yearb variable, as well as the number of observations for which this variable has (non-missing) observations. Type help tabstat to see the list of statistics that can be shown, as well as explanations for other useful options such as by(). If you want to have a look at the real data, for example to see whether everything went fine with loading the data, you can open the Data Editor by clicking one of these buttons: 12

13 The left button opens the Data Editor in the edit mode, the right button opens it in the browse mode. In the edit mode you can change the data (by clicking on a cell or column), in browse mode you can t. The Data Editor will open in a new window. 1 As with most commands in Stata, you can also open the Data Editor with a command in the Command line or in your do-file. The command to open the Data Editor in the browse of edit mode is, respectively, browse or edit. This has the advantage that you can select which variables you want to be displayed in the Data Editor. For example, browse id yearb monb dayb opens the Data Editor with only these three variables. From Stata 12, the new Properties Window is also part of the Data Editor. In addition, a special version of the Variables Window where you can select which variables are shown in the Editor is part of the Data Editor. 1 In Stata 9 (and all older versions), you cannot use the Command line or run a do-file while the Data Editor is open (they are blocked automatically). So first close it before you continue (minimizing it is not enough!). In newer versions of Stata, you can leave the Data Editor opened while entering commands in the Command line or running a do-file. 13

14 2.3 Generating and transforming variables In line 12 of our program we generate a new variable with the generate command. Other much used commands to transform data are replace, which modifies an existing variable: replace var1 = 0 to replace the value of var1 by 0 for all observations. rename, which changes the name of a variable: rename oldname newname changes the name of variable oldname into newname drop, which drops a variable: drop var1 var2 drops the variables with the names var1 and var2 keep, which keeps the listed variables: keep var1 var2 only keep the variables var1 and var2 and drops all others The name of a variable may not contain blanks and is case-sensitive. Try to keep names short and clear. The maximum number of characters for a variable name is 32, but Stata prints only 12 in the output of many commands (for example regression results). You can attach a label to the variable of maximum 80 characters. You can use the label to give a more precise description of the variable. This is how you create a label for the variable intervdate that says date of interview : label variable intervdate date of interview Stata uses this label whenever you make a table or graph instead of the variable name. 2.4 Describing variables To describe one of the variables in the data in a frequency table, you can use the command tabulate, as we did in line 16 of the do-file. The table will appear in the Result Window: 14

15 tabulate can also be used to make a cross-tab if you put two variable names after the command. For example tabulate children sex will create a table with two columns: one with frequencies for the variable children for males and one with the frequencies for females: 2.5 Saving the data In line 23 of our program we save the altered (because we generated a new variable called wagegr_year) dataset. For this, we use the command save dataset_workshop.dta, replace The option replace is used to overwrite any existing dataset with this name. If there is no existing file with this name, a new file is created. There is one thing you should take care of though, if you want to work in more than one version: while version 8 and 9 have no problem understanding datasets created by one of these versions and the same holds for versions 10 to 12, you will encounter problems if you try to load a dataset created in version 10 or above into version 9. The solution though is simple: the command saveold saves a version of your datatset created in version 10 or above that can be read by version 9 or below. In Stata 13, saveold will save a version of the data that can be read by versions To save the data in another format, e.g. Excel, use Stata s export command. For example: export excel using dataset_workshop.xls, firstrow(variables) replace The dataset has now been saved as an.xls file, such that the variable names (not variable labels) have been saved in the first row. See section 9.1 for an explanation how Stata can import.xls files. 15

16 2.6 Executing the do-file There are two ways to execute your do-file, and for each of them there is a button in the dofile editor. The right button executes your do-file normally and will show the results in the Results window. Clicking this button is the same as typing in the Command line do stata1.do Alternatively, you can click on the File menu, then Do, and then select the file C:/myfiles/statafiles/stata1.do. This will also run the do file. The left button executes your do-file quietly and is equal to typing run stata1.do in the Command line. Running a do-file quietly implies that the results will not be displayed in the Results window, and will also not appear in the log-file. If you want to execute only some lines of your do-file (for example because you want to add some lines to an existing do-file that you already ran and stored earlier), you can do so by selecting the lines you want to execute and click on the do or the run button. 3 Learning to Help Yourself Using a program like Stata, you will frequently encounter situations where you either don t know what a particular command is doing exactly or where you don t know how to perform a particular analysis in Stata. There are a variety of possibilities for moving on in a situation like this. If you know the command, it is useful to start with the built-in help of Stata. Stata has detailed help files available for all Stata commands. You can access these by selecting Stata Command from the Help drop-down menu, and enter the command in the window that pop up. You can also just type help <command_name> at the command line. Similar pages can be accessed at Stata s website, and doing a google search stata help <command_name> usually also gets you there quickly. Stata commands are described in detail in the Stata Users Guide and Reference Manual. In Stata 11-13, the built-in Help in 16

Stata contains also a link on the end of each lemma (under Also see ) to relevant pages in these manuals. Clicking will open a pdf of the manual on the right page.

17 Stata contains also a link on the end of each lemma (under Also see ) to relevant pages in these manuals. Clicking will open a pdf of the manual on the right page. 2 If you know what you want to do but don t know the exact Stata command, there are two things you can do. First, you can select Search.. from the Help drop-down menu and type in a keyword (for example, the name of an estimation method). Second, it is likely that a Google search will also get you to the information you are looking for quickly. Stata is widely used and you are probably not the first one looking for a command to perform a particular action. 2 A paper version of the manual may be available at the IT help desk. 17

18 In the example above you see the help file for the tabulate command. The syntax of the command is always built up in a similar manner. Here, the syntax for a two-way table is tabulate varname1 varname2 [if] [in] [weight] [, options] The Stata command (here: tabulate) itself is always in bold type. The underlined part of the command, tab, is the way a command may be abbreviated. Ingredients of the command are displayed in italics. Necessary ingredients appear without square brackets and ingredients that are not strictly necessary appear between square brackets. Here, varname1 varname2 in the help file teaches us that for a two-way table, we need to include two variable names. [if] [in] [weight] tells us that if we want to, we can add an if-statement, an in-statement or add variable weights. We should add this immediately after the variable names and before the comma if we want to use one of these statements. Finally, [, options] indicates that if we want to use any of the options available for this command, we should place them after a comma. Almost every command has several options to adapt the command. The options are listed and explained in the help file (you find the explanation for each option if you scroll down in the help file). tabulate has, for example, the option row. With this option, the two-way table that is produced does not only contain frequencies, but also the relative frequency within a particular row. If we want a two-way table that contains only these relative frequencies per row and does not contain the frequencies, we not only add the option row, but also the option nofreq : The blue colored words in the syntax indicate that there is a separate entry available for this. Clicking for example on the word weight will redirect you to the Help-file on using weights. Note that all the brackets that appear in the command description, are not to be included in the syntax! They are merely there to indicate the required and optional parts of the command. 18

$Here is another example, the entry of the command destring: The syntax for the command destring is destring [varlist], {generate(newvarlist) replace} [destring_options] Again, the bold words are the$

19 Here is another example, the entry of the command destring: The syntax for the command destring is destring [varlist], {generate(newvarlist) replace} [destring_options] Again, the bold words are the part of the syntax is always required. There is one difference here, however. The boldface words generate and replace are captured in curly brackets and separated by a vertical line. This means that the syntax for the command destring requires either generate or replace to be specified. All the available options are specified at the lower part of the screen. These are explained in more detail if you scroll down the Help-file. 19

20 4 Different Types of Variables 4.1 Storage types In section 2.2 we already came across the term storage type when we discussed the output describe generates. Stata has six different types of data (storage types), of which five for numbers (numeric data) and one for text (string data). The five numeric data types are called byte, int, long, float and double. What are the differences between these five types? First, byte, int, and long can only hold integers (i.e. no decimals). float and double can also hold non-integers. Second, the precision of each type differs. In the table below is the minimum and maximum value a variable of each type can take. Larger or smaller values will result in a missing observation, denoted with. in Stata. minimum maximum closest to 0, but not 0 Byte / -1 Int -32,767 32,740 1 / -1 Long -2,147,483,647 2,147,483,620 1 / -1 float *10^ *10^38 10^-38 / -10^-38 double *10^ *10^307 10^323 / -10^-323 A variable that contains text is stored as a string variable, or str#. On the place of the # is the maximum number of characters the string can contain. This can be anything from 1 to 244. If you try to fit more characters than the data type allows, every character beyond # is ignored. The default data type for numeric data when you generate a variable is float. If you want to generate another type, place the name of the type between the generate command and the name of the variable to generate. The default data type is fine for most purposes, but there are cases where the default is problematic. One case is identification numbers. A float has 7 digits of accuracy, and will therefore round an identification number with 8 or more digits. The preserve the identification as it is, use long for 8 or 9 digits and double for up to 16 digits of accuracy. Precision is often important to people doing numerical work, and a little reading on numerical issues will tell you that computers cannot uniformly handle all numbers that humans typically deal with in the same way. For instance, did you know that the decimal number 0.1 has no finite precision representation in binary floating-point arithmetic? 20

21 (See, e.g., Wikipedia s entry on floating point for details). This property can occasionally lead to unexpected results. Changing data type from float to double can partly address such issues. However, it is not necessarily a good idea to store everything as double, because double eats up lots of memory. To strike a balance, you can ask Stata to convert numerical storage types to the lowest level without losing significant information by typing compress [varlist] 4.2 Categorical variables (among which: dummy variables) Categorical variables are variables where the value of the variable has not the meaning of a normal number, but where each value stands for a category. An example is variable za1 in our dataset, that equals 1 if the answer is yes and 2 if the answer is no to the question Do you work at least 15 hours per week?. Instead of working with 1 and 2, we can attach a value label to the variable. The value label contains the information about the meaning of every number of the categorical variable. When you ask Stata to make, for example, a frequency table it will use the text you attached to each number instead of the numbers. The value labels are also used in the Data Editor. To create a value label with the name yesno that contains the information that 1 means yes and 2 means no, you type: label define yesno 1 yes 2 no and to assign this label to the variable za1: label values employment yesno Use label list to get a list with the names and content of all value labels in the dataset. You can modify an existing label using the command label define and one of the options add, modify or replace after the comma. Although the label instead of the number appears in tables and graphs, you can keep using the number in if-statements etc. Dummy variables are special cases of categorical variables with binary information (i.e., they can take two values). Dummy variables are very often used in econometric work, and here they typically take values 0 and 1. Often, a categorical variable that has more than 2 discrete values is split into a set of binary dummy variables. For instance, suppose you have a variable color taking values 1, 2, 3 for blue, red, and green. You might want to split this into two dummy variables, one for blue (values 1 for blue and 0 for non-blue), and one for red (values 1 for red and 0 for non-red) the remainder then is green by implication (i.e., observations for which both blue and red have value 0). 21

22 One quick way to convert a categorical variable with many discrete values into a set of dummy variables is afforded by the tabulate command and issuing the option generate, as in tabulate color, generate(dumclr) This will first give you the usual tabulation of color, and then make a number of dummy variables called DumClr1, DumClr2, etc. flagging the different color values; these newly created variables are all of storage type byte and come automatically with variable labels. In addition, this way of creating dummy variables also properly deals with missing values. (More on missing values: below). 4.3 Converting strings Your dataset may contain string variables that only include numbers. As numerical variables can be included in a regression, but string variables cannot, you may want to convert the data in this string variable from string to numeric. For this purpose you can use the command destring. The complete syntax to create a new, numeric, variable called new_var1 that contains the converted content of the string variable var1 is destring var1, generate(new_var1) Instead of the option to generate a new variable, we could also have chosen to overwrite the content of the existing variable by using replace instead of generate(new_var1). If you want to perform exactly the opposite action, converting a numerical variable into a string, you can use tostring. Another possibility is that the dataset contains a string variable with a limited number of different texts. For example, the string variable may contain only yes or no, or it may contain one of ratings very good, good, reasonable, bad, very bad. To create a categorical variable from this string, you can use the command encode. If var2 contains yes or no for each observation, encode var2, generate(new_var2) creates a new categorical variable named new_var2. The labels created for this variable are stored in a label with the same name as the new variable. If you want to use an existing label for the new categorical variable, add label(labelname) to the syntax. 4.4 Working with dates Working with dates is one case where older and newer versions of Stata differ. The method described here applies to version 11 and up. 22

23 Stata has a special format to store dates. The advantage of using this format is that it understands things like 1feb2012 minus one day is 31jan2012. Usually when a dataset is supplied to you, dates are recorded as a string variable. To create a new variable dob that converts the variable dateofbirth (containing dates that look like ) to the date format, you type generate dob = date(dateofbirth, DMY ) format %td The second element in the date command, DMY, indicates how the dates are built up. In this case it is day-month-year. The format for daily dates is %td. It is no problem if the original string variable with dates had hyphens or slashes to separate day, month and year. Stata will ignore them. What date essentially does is creating an integer with the number of days (for daily data, or weeks for weekly data, etc.) since January 1, 1960 (it is a negative number for dates earlier than January 1, 1960). With format %td you tell Stata how to interpret this number. %td means it should interpret it as days since January 1, 1960, %tw means it should be interpreted as weeks since the first week of 1960, etc. In the Data Editor and in tables and graphs, Stata will display for example 16jan1983 if %td is specified and 1983w3 if %tw is specified. You can now easily generate a variable that contains the duration between two dates: will do just that. generate duration = datevar1 - datevar2 4.5 Different data types in the Data Editor The Data Editor uses different colors for different types of variables. Numerical variables that are not categorical and variable in date format are displayed in black. Categorical numeric variables are displayed in blue. The Data Editor displays the name assigned to the number for categorical variables. If you want to open the Data Editor showing numbers instead of the labels assigned to them, use browse, nolabel. String variables are displayed in red. 4.6 Missing observations/missing values Missing values deserve special attention because they can surface in unexpected situations and may lead to unexpected results. Simply put, a missing value is a value that is not or should not be there. Let s start with a simple example of a numerical missing value that is kept in the data as a dot:. Consider the following story. An interviewer asks respondents if they own a car, and 23

24 the ones that answer yes (value 1) get a follow-up question on the color of their car blue (value 1), green (value 2), and red (value 3). So our data set may look like person havecar carcolor Person 3 has no car and therefore gets a missing value for carcolor. The symbol used for the missing value is a so-called system missing value that tells Stata how to handle it. In real data, one may encounter all kinds of values that are actually missing but not necessarily recognizable as such without further documentation. For instance, we may have values such as -9 or 9999 or 97, or any other number. This may matter because depending on what value is being used, the rules of arithmetic may deliver very different results. Suppose, we were to add up the variables havecar and carcolor, using generate nonsense = (1-havecar) + 2*carcolor then the variable nonsense in our data set would take values 2, 4,., 6 for the four persons; that is, adding (or multiplying, etc.) a system missing value to any numerical value delivers a system missing. Had our missing value been 99 instead, we would have obtained 2, 4, 199, 6, and we would have no way of telling that 199 is in fact the result of a missing value operation. It is therefore good practice to closely inspect all variables in a data set for special values that may in fact be missings. With some luck, there is documentation on all values of variables in the data set, telling that, e.g., 998 is not actually 1000 minus 2, but rather something else (for instance a code for not applicable ). With less luck, you need to figure this out yourself. Remark. Stata has next to system missing other types of missings that are being treated arithmetically in the same way. These are extended missing values that have the codes.a,.b,.c,,.z. There are not many instances where these extended missing values are actually being used in practice, but they may come in handy to convey different meanings. For instance,.a may indicate I don t know,.b may mean I do not want to say (refusal), and.n may mean question not asked to respondent. So, the way Stata handles missings is quite convenient. However, you need to know that system missings are internally handled as if they represented the value infinity (indeed, infinity plus or times something else also results in infinity). This can be confusing when you want to use operators on variables that contain missings. Consider the following 24

25 example in which the variable nonblue is supposed to flag all cars that are not blue (blue was value 1): generate nonblue = carcolor>1 This will result in a new variable that has two values: 1 if the expression carcolor>1 is true, and 0 if it is not true. In the data, nonblue equals 0 for person 1 and 1 for persons 2, 3, and 4. However, person 3 has no car, its carcolor is missing, and yet our definition has assigned value 1 to variable nonblue. We could now go on and do calculations for person 3 using variable nonblue even though the original variable carcolor would not allow us to do calculations for that observation. Warning: this example illustrates a common mistake; the danger is that missing values disappear from the data. This behavior may be unintended, although it follows from the logic that the missing value is treated as infinity (that is, the expression carcolor>1 is true for person 3 since Stata reads it as infinity>1 ). In order to fix this, we have to be more explicit: generate nonblue = carcolor>1 if carcolor!=. The condition if carcolor!=. means that the expression carcolor>1 is only evaluated if carcolor!=. is true. For person 3 it is false, and the expression is not being evaluated. In that case, the new variable gets value system missing. The story so far applied to numerical missing values. In the Data Editor, numerical missing observations are also indicated by a dot. Missing string variables have a completely empty field: missing values of a string variable are equal to (double quotation marks with nothing in between). You can also use these representations in expressions, e.g. generate myvar= would generate a string variable with missing values only. Important note: In regressions, observations that have a missing value for any one of the specified variables are NOT taken into account in the estimation! Adding one variable with quite some missings to your model, may therefore dramatically decrease your sample size and hence affect the results. This is particularly true if your regression model has many variables, each of them having many missing values. 25

26 5 More on Syntax 5.1 Functions Stata has many built-in functions. Using the keyword functions in the help file gives an overview of the different types of functions: Clicking on one of the words in blue, redirects you to a list of all functions Stata has in this category. For example, selecting math functions gives a list of commands related to mathematical functions. Here we can find, for example, the syntax needed to create a new variable var2 that equals e to the power var1, rounded to the nearest integer: generate var2 = round( exp(var1) ) 5.2 If-statements You can use an if-statement if you want to generate a variable or perform an analysis only for a subset of the data. For example, generate x=1 if employment==1 will create a variable x that equals 1 if employment equals 1 and is missing otherwise. If there is more than one condition to be satisfied, use & between the conditions: generate x=1 if employment==1 & sex== F Note that with string variables you always have to use quotation marks. The relational and logical operators you can use in Stata are: 26

27 == equal to!= not equal to (same as ~=) ~= not equal to (same as!=) > greater than >= greater than or equal to < less than <= less than or equal to & and or Note that a condition of the type some variable is equal to.. requires a double equality sign! Instead, in an expression like generate x=1 only a single equality sign is needed. The difference is that the same symbol = carries different meanings: generate x=1 should be read as: value 1 is being assigned to the new variable x ; generate y=(x==1) should be read as the result of the assertion x is equal to 1 is being assigned to the new variable y, where x is equal to 1 can be either true (value 1) or not true (value 0). 5.3 Loops Sometimes you want to perform almost the same commands many times. For example, you want to generate a separate variable for each category of a categorical variable. Instead of typing the same commands over and over again, you can use a loop. Suppose your categorical variable is called wagecat and has 5 categories, 1 to 5. forvalues i=1(1)5 { } generate x`i = wagecat==`i will generate 5 new variables: x1, x2, x3, x4, and x5. What Stata actually interprets is the following as it goes through the loop in 5 rounds: i=1 generate x1=wagecat==1 i=2 generate x2=wagecat==2 i=3 generate x3=wagecat==3 i=4 generate x4=wagecat==4 27

28 i=5 generate x5=wagecat==5 There are a few important remarks to make about this syntax. First, note that the single quotation mark ` before the i is different from the single quotation mark after the i. The command will not work if you have these quotation marks incorrect. (i is actually a local macro, on which more below). Second, the number between round brackets denotes the size of the steps to take when going from 1 to 5. (If we had coded i=0(100)500 we would have stepped from 0, 100, 200,, 500.) Third, you can choose any other symbol (or even a word) instead of i. Forth, syntax requires that the curly open brace { is on the same line as forvalues, not followed by anything executable on the same line (line break required, unless only comments follow), and the curly closing brace } is on a line of its own. The forvalues command only works for numerical values. To make a loop over a list of text or other objects instead of numbers we use a different command, foreach : foreach var in x1 x2 x3 x4 x5 { } replace `var =. if `var ==0 The same remarks as before for forvalues apply. So-called while and if/else loops can be programmed as well, see help while and help ifcmd in Stata. 5.4 By and bysort When you want to perform the same Stata commands on a number of subsets of the data, the by command can be helpful. bysort sex sector: summarize hours will create summary statistics for the variable hours for each combination of the variables sex and sector. To use by, the data must be sorted by the variables that determine the subgroups (here sex and sector ). If the data are not yet sorted, you need to specify that by using bysort instead of by. Alternatively, you can sort them first with the command sort sex sector, and subsequently use by. Not all Stata commands allow that they are use in combination with by or bysort. The help-files indicate in each lemma whether the command can be used in combination with bysort. 28

29 5.5 Recode When you want to generate a categorical variable from a discrete or continuous variable, the command recode can save a lot of typing and if-statements. Suppose we have a variable called age that contains the age of the observed individuals and we want to create 5 categories: 0-18, 19-25, 26-40,41-65, 66+. recode age (0/18=1) (19/25=2) (26/40=3) (41/65=4) (66/max=5),generate(agecat) creates a variable called agecat with the appropriate value for each observation. max and min can be used if you do not know the maximum or minimum value the variable takes. The example above works fine if age is a discrete variable. If age is a continuous variable, it is better to use recode age (0/18=1) (18/25=2) (25/40=3) (40/65=4) (65/max=5),generate(agecat) which refers to the categories 0<age<=18, 18<age<=25, etc. 5.6 Abbreviating variable names In section 3 when we discussed the syntax of tabulate, we already saw that commands can be abbreviated (tabulate could be abbreviated by tab for example). But the amount of typing can be reduced even further, by also abbreviating variable names. These are Stata s rules for abbreviating variable names: grinc* are all variables that start with grinc, so in our example grinc* stands for: grinc_wage grinc_sempl grinc_sw grinc_pens *b are all variables that end with b, so in our example *b stands for: yearb monb dayb startjob satisf_job s*b are all variables that start with s and end with b, with any number of characters in between. So in our example s*b stands for: startjob satisf_job y~b is a variable that starts with y and ends with b, with any number of characters in between. However, in contrast to using *, ~ refers to a single variable. If more than one variable matches the description, you will get an error message. In our example y~b stands for: yearb grinc_wage-grinc_pens are all variables from the list with variable names between grinc_wage and grinc_pens, so in our example grinc_wage-grinc_pens stands for: grinc_wage mainactivity grinc_sempl grinc_sw grinc_pens 5.7 Macros Macros are underappreciated and misconceived essential parts of Stata. Basically, they are just bits of text or numbers that can be referred to. But they can also be manipulated and 29

30 that makes them extremely versatile and useful. If you are new to Stata and you see yourself being confronted with macros, it may take a little to get used to them. Macros come in two guises: local and global macros. Let us start with a local macro. We have seen one already in the code: forvalues i=1(1)5 { } generate x`i =wagecat==`i We mentioned that i is a local macro. It is in some sense like a variable since it holds a value, but it is not listed among the variables, and it does not have observations (but just a single, scalar value). In the code above, local i takes values 1,2,, 5. The use of single quotation marks around it actually retrieves the current value: `i' results in 1 in the first round, 2 in the second round, etc. 5 in the fifth round. The first line is assigns particular values to the local macro, the second line chucks out the value of the macro (in two places: once in the definition of a variable name, and once in the evaluation of a logical expression). There are other ways of achieving the same goal, when explicitly using the local macro syntax. We can alternatively code local i=1 while `i'<=5 { } generate x`i' = wagecat==`i' local i=`i'+1 to obtain the same result. Note the last line: here, the content of the local macro called i is being overwritten, i is being reassigned the result (value) of the calculation `i'+1, where `i' is the currently known value; after the assignment has been concluded the value has been updated (with 1). Next to local macros there are global macros. They pretty much do the same thing, although they are more frequently encountered as holding strings rather than numerical values. The main distinguishing feature is that their syntax looks a little different, in particular as retrieval of values is concerned. So, we could say global i=1 generate x$i=wagecat==$i That is, we use a dollar-sign prefixed to the global s name if we are to retrieve its value. We can rewrite our while loop using globals, but we cannot rewrite a foreach or forvalues loop using globals. Globals are often used to replay text. Here is an application. Suppose you run a large number of regressions, each with different options or on different samples, but 30

31 all with the same specification. You can collect your variables in globals and simply refer to those rather than retype all your variable lists: Instead of typing you could type regress wage female nkids black south eduhigh edulow if region==1 regress wage female nkids black south eduhigh edulow if region==2 regress wage female nkids black south eduhigh edulow if region==3 global yvar wage global xvars female nkids black south eduhigh edulow forvalues i=1(1)3 { } regress $yvar $xvars if region==`i next time you discover you want to change your specification and replace nkids with nsons ndaughters, you only need to change the line global xvars female nsons ndaughters black south eduhigh edulow The content of macros can be listed and displayed using macro list or macro dir Local macros will show up in the table having a leading underscore, as in _i. In addition, the listing will also show so-called system macros (defined by Stata, not by the user). 5.8 Scalars A scalar is an element that has one value. For example typing: count scalar num = r(n) display num Saves the output of the count command, i.e. number of observations, into num (a scalar). Typing display num then shows you what num is exual to. Note the difference between a scalar and a variable: the latter is a column with one value for every observation, whereas a scalar has one value only. For more information, see help scalar. 31

This gives an extensive description of all possibilities. But, especially if you just started to work with Stata, there is an easier alternative.

32 6 Graphs For simple graphs (a scatter plot, histogram, regression line, etc. in standard lay-out), using the command line is the easiest way to go. If you want to customize a graph, for example because you want to use it in your thesis, you have two options. First, you can go to Stata s help-file and type graph. This gives an extensive description of all possibilities. But, especially if you just started to work with Stata, there is an easier alternative. With a click on Graphics, a dropdown menu will open (see left part of Figure 2) with a list of different types of graphs. If you click on, for example, Twoway Graphs (scatter, line, etc.) opens a new window (see right part of Figure 2) in which you can simply select the type of graph, the variables to be used and many lay-out parameters, like the color of the line (or dots), the symbol used for a scatter diagram, titles of the axes, range of the axes, etc. After you ve clicked OK Stata will write down the correct syntax and create the graph. By looking at the syntax that Stata prints in the main window, you will learn how the syntax for a customized graph is build up. This graph creator is somewhat slower than using the command line. Figure 2: creating graphs from the menu 32

33 Graphs always open in a separate window. This also implies that they will not appear in your log-file! For example, histogram hours, start(0) width(2) produces the following graph with bar width 2 and starting point for the bars equal to 0 : 6.1 Saving a graph If you want to save a graph that you have created, using the command graph save histogram will save the graph under the name histogram.gph. Stata s standard format for graphs is.gph. If you want to save your graph in a different format, you can choose from the available formats.ps,.eps,.wmf,.emf,.png and.tif. The command to save a graph in one of these format is graph export instead of graph save. The command graph export histogram.tif, replace saves my histogram as a.tif file and replaces any existing graph called histogram.tif in the folder I m working in. 6.2 (Overlaid) two-way graphs A graph that shows the relation between two variables is called a two-way graph in Stata. An example of such a graph is a scatter plot: graph twoway scatter hours grinc_wage creates a scatter plot of hours work per week on the y-axis and gross wage on the x-axis (see left panel of Figure 3). 33

34 Creating one graph that contains the result of merging two two-way graphs is called an overlaid two-way graph in Stata. For example, you might want to make a graph that contains both a scatter plot of two variables X and Y ánd the regression line of regressing X on Y. The command graph twoway (scatter hours grinc_wage) (lfit hours grinc_wage) creates the scatter plot we made before, and on top of it the regression line of hours on grinc_wage (see right panel of Figure 3). You can customize both layers as much as you want. All commands for the first layer (the scatter plot) are between the first set of brackets, all commands for the second layer (the fitted regression line) are between the second set of brackets. Figure 3: two-way graphs 7 Installing user written commands Stata has a large array of built-in commands, but sometimes it is useful to be able to perform user written commands that are not standard in Stata. These have been written by other users (in so-called.ado files) and shared. These programs can be installed easily by typing the ssc install <commandame> (possibly adding the option, replace if the program has been installed previously) in Stata s command window. There are a lot of user written commands available. To know which user written commands are most widely used by other Stata users, type ssc hot in the command window. Important: At the university computers it is not permitted to write on the C:\ drive. To be able to install user written commands, you will have to instruct Stata to install files on your personal drive ( H:\ ). This can be done as follows. Before writing ssc install <commandame>, type: 34

$sysdir set UPDATES "<DIRECTORY>" sysdir set PLUS "<DIRECTORY>" Where <DIRECTORY>" is the place on your personal drive where Stata will store all installation files, e.g. H:\Documents\Mystatafolder\.$

35 sysdir set UPDATES "<DIRECTORY>" sysdir set PLUS "<DIRECTORY>" Where <DIRECTORY>" is the place on your personal drive where Stata will store all installation files, e.g. H:\Documents\Mystatafolder\. After installing the package, the corresponding help file will be available in Stata as well. In the next section you will see a helpful user written program called estout. 8 Econometric analysis In the end, the reason you are using Stata is that you want to do econometric analysis. There a multiple ways to do econometric analysis with Stata. The first way is by writing a do-file with all the commands for your analysis and run it. Second, you could also use the command line. And third, Stata also offers statistical analysis from a drop-down menu. To use this last option, click on Statistics and choose one of the methods of analysis from the drop-down menu (see Figure 4). Figure 4: the drop-down menu for Statistics This will open a separate window where you can specify the exact model. In this tutorial we will focus on the first method, writing a do-file with all necessary commands. 35

36 8.1 Correlation coefficient If you are interested in the correlation coefficient or correlation matrix between two or more variables, you can use the commands correlate or pwcorr. Both commands are similar, but there is for example an extra option available only with correlate that enables you to show the covariance matrix instead of the correlation matrix, namely the option covariance. With pwcorr command you can also see the p-value that corresponds to null hypothesis of zero correlation, by specifying the extra option sig, like this: pwcorr hours grinc_wage, sig The output is shown below. Note: if this command gives you an error, then the reason is probably because hours is a string variable instead of a numerical one. You will first need to make it into a numerical variable with the command destring hours, replace. The correlation table shows that the correlation between the variables hours and grinc_wage is The p-value that corresponds to the null hypothesis that the correlation is zero is equal to Important note 1: If you find that two variables are statistically significantly correlated in your dataset, this doesn t mean that their correlation is nonzero in the whole population. Characteristics of small or unrepresentative samples do not necessarily reflect characteristics of the population. Important note 2: If two variables are statistically significantly correlated, this does not mean that one causes the other, i.e. correlation does not imply causation! Example: the number of ice-creams eaten per month correlates positively with the number of drownings per month. Does eating ice-cream cause drownings, or the other way around? Or is there another reason (so-called omitted variable) causing both to increase at the same time? 36

37 8.2 T-test of equal means If you want to test whether the mean of a variable (statistically significantly) differs between two subgroups, you can use the ttest command. For example: ttest grinc_wage, by(sector) tests whether the mean gross weekly wage (variable grinc_wage ) differs between the private and public sector (variable sector ). The resulting Stata output is: Here the null hypothesis is that of equal means (H0: diff = 0). The p-values for three alternative hypotheses Ha stands for alternative (a) hypothesis (H) are given in the final two rows. The corresponding t-statistic is Note: The standard assumption of ttest is that the variances of the two group-means are equal. If you have reason to believe that the variances are unequal, then add unequal at the end of the command (after the comma). 8.3 Linear regression model The command for OLS regression in Stata is regress. regress hours grinc_wage children runs a linear regression with hours (hours worked per week) as the dependent variable and grinc_wage (gross weekly wage) and children (number of children) as independent variables. This order, the first variable being the dependent variable followed by the independent variables, is common to all estimation methods. Stata automatically includes a constant. If you do not want to have a constant included in the regression, you need to specify the option noconstant. 37

The regression specified above produces the following output in the Results Window: Note that the estimation is based on 250 observations, while we have 500 observations in our dataset (see the

38 The regression specified above produces the following output in the Results Window: Note that the estimation is based on 250 observations, while we have 500 observations in our dataset (see the results from describe in section 2.2). How did we lose half of our observations? The reason is that for 250 observations, one of the three variables used in the regression is missing. Important note (repeated from section 4.6): In regressions, observations that have a missing value for any one of the specified variables are NOT taken into account in the estimation! Adding one variable with quite some missings to your model, may therefore dramatically decrease your sample size and hence affect the results. This is particularly true if your regression model has many variables, each of them having many missing values. To make nice regression tables which can be exported to Excel or LaTeX, install the estout package 3 : ssc install estout To see how the above regression results look in estout, type: eststo clear eststo: regress hours grinc_wage children esttab eststo clear clears all previous regression results. eststo: <regression> stores the regression. esttab displays the regression results. The result is: 3 Package documentation for estout can be found at 38

39 This type representation of the regression results is now ready to be included in a report or (with slight adaptation) in an academic paper. To export the table into Excel or LaTeX type: esttab using <filename> [, options] where the filename extension specifies to which program you would like to export the table, and the options specify how exactly the table should be formatted. See the eststo help file and.pdf documentation for more information on its (many) options. Another popular package for handling regression output is outreg Post-estimation commands For every estimation method, there are also some post-estimation commands available. These commands use the results of the last estimation in memory (=the last one you performed) to provide additional information on the estimation. Which post-estimation commands are available for a particular estimation method can be found in the Help-file. The lemma of the estimation method always provides a link to a list of available postestimation commands. For regress, one of the available post-estimation commands is predict. predict pred_hours, xb uses the estimation results to provide a linear prediction based on the estimation results, stored under the new variable name that is specified (here pred_hours ). The added option xb indicates using a linear prediction. The command predict can not only provide a prediction based on the estimation results, but has many more options. For example, predict resid, residuals 39

creates a new variable called resid that contains the residuals of the last estimation. 8.5 Storing estimation results The estimation results in memory can be stored with the command estimates store.

40 creates a new variable called resid that contains the residuals of the last estimation. 8.5 Storing estimation results The estimation results in memory can be stored with the command estimates store. For example, estimates store model1 saves the estimation results of the last estimation under the name model1. With estimates dir you get a list of all the estimation results you stored. The names you gave to the estimation results appear in blue. To see the estimation results of model1 on your screen again, click on the blue word model1 in the Results window. If you want to perform post-estimation commands on one of the estimation results, you first have to load them into Stata s memory (make them active ) with estimates restore model1 Estimates has some other useful subcommands: estimates drop model1 drops the estimation results stored as model1. estimates clear drops all estimation results that are stored. estimates query tells you whether the results currently in memory (the active results) have been stored already, and if so under which name they have been stored. 8.6 All estimation results in one table Why would you want to store estimation results? For example, if you want to compare two different sets of estimations. A nice feature of Stata using stored estimation results is the possibility to create tables containing the results of multiple estimations. The command for this is estimates table. estimates table model1 model2 creates a simple table with only the estimated coefficients of model1 and model2: 40

Moreover, there are many options available to customize the table the way you like it. As is common to all commands, options are place after the comma.

The significance levels can be chosen freely, but these are the conventional ones in Economics. To include standard errors in the table add se as an option.

41 Moreover, there are many options available to customize the table the way you like it. As is common to all commands, options are place after the comma. To place stars next to the coefficients to denote their significance level, use the option star( ). The significance levels can be chosen freely, but these are the conventional ones in Economics. To include standard errors in the table add se as an option. Note that Stata does not allow adding both standard errors and significance stars. It is also possible to include other statistics than the estimated coefficients. All scalars stored along with the estimation under e( ) (more on this in section 9.3) can be included. To include one or more of these statistics, add the option stats(scalarlist) where scalarlist is a list of the statistics you want to add to the table. For example, creates the following table estimates table model1 model2, star( ) stats(n) The estout command has an even easier way of storing and displaying regression results, see the help file (after installing estout manually with the command ssc install estout), or the.pdf documentation. 9 Miscellaneous Topics 9.1 Reading a dataset with a different format When you have a dataset in a format different from.dta, there best way to proceed depends on the format of your data. 41

You will learn: The structure of the Stata interface How to open files in Stata How to modify variable and value labels How to manipulate variables

Jennie Murack You will learn: The structure of the Stata interface How to open files in Stata How to modify variable and value labels How to manipulate variables How to conduct basic descriptive statistics