TYPES OF VARIABLES, STRUCTURE OF DATASETS, AND BASIC STATA LAYOUT

PRIMER FOR ACS OUTCOMES RESEARCH COURSE: TYPES OF VARIABLES, STRUCTURE OF DATASETS, AND BASIC STATA LAYOUT STEP 1: Install STATA statistical software. STEP 2: Read through this primer and complete the exercises prior to the course. We will be using STATA statistical software for the biostatistics laboratory component of the ACS outcomes research course. We would like you to be familiar with a few key concepts covered in this primer to maximize your learning during the course. Objectives: To understand the following: 1) Different types of variables 2) The basic structure of a dataset To become familiar with the following: 1) The different windows in STATA 2) How to create a small dataset 3) Two basic STATA commands to analyze data DIFFERENT TYPES OF VARIABLES: The table below reviews the most common types of variables you will encounter. It is important to understand the types of variables in your dataset because this guides the choice of a statistical test. Variable Type Description Examples Possible Values Dichotomous Categorical Ordinal Can take on only two values (usually Yes or No ) Can take on more than two values (but still confined to a limited range) The order of the categories has some inherent meaning Death after an operation Acuity of hospital admission 1 (death) 0 (alive) Elective, urgent, emergent Nominal Continuous The order of the categories has no meaning Can take on any integer value (or fractions if appropriate) The payer for a hospital stay Medicare, Medicaid, Private Patient age 18, 35, 66, 75 It is important to note two things about ordinal variables that distinguish them from continuous variables: 1) they are still confined to a limited number of categories; and 2) the distance between categories is not meaningful. 1

THE STRUCTURE OF A DATASET: Most of you have used a dataset at some point. However, the structure can vary depending on the program or purpose of the dataset. We will introduce you to the basic vocabulary necessary to describe a dataset for the purposes of using STATA. Picture a blank sheet of paper with horizontal and vertical lines intersecting each and you will have the basic structure needed to store data: The vertical lines divide the sheet into columns. Each column represents a different variable. For example, Patient ID number, age, gender, or race: Patient ID Age Gender Race And the horizontal lines divide the sheet into rows. Each row represents what is called an observation -- for example, a patient who has surgery: Patient 1 Patient 2 Patient 3 Patient 4 First observation Most datasets have the variable names (highlighted below) in the first row (or just over the first row as a label). Each additional row (observation) then includes the values of each variable for each patient. Variable name Variable value Patient ID Age Gender Race Patient 1 55 Male White Patient 2 44 Female Black Patient 3 63 Male Native American As you will learn, STATA has this same data storage structure and it is maintained behind the scenes in the Data Editor. 2

GETTING FAMILIAR WITH STATA LAYOUT: When you first open the STATA software, it should look like Figure 1. There are four basic windows that are described in the Figure. If your STATA does not look like the figure, go to the menu and click Prefs (for preferences) and scroll down to Default Windowing. Behind these four windows, there are several other hidden areas of STATA. For instance, click on the Data Editor icon on the toolbar (See Figure 1). This will open a behind the scenes spreadsheet where the data is stored (See Figure 2). We will build a small dataset by typing directly into the Data Editor later in this primer. It is worth noting that commands are usually typed directly into the Command Window, but STATA also has drop-down menus for most commands. Figure 1. The basic layout of STATA. Data editor Click this icon to open the behind the scenes spreadsheet Variables Window Variables are listed here Tip: Click and they will appear in command window Do file editor Click this icon and the do-file window will open Here you can write numerous commands, run them, and save them Review Window Previous commands appear here Tip: Click on the command and it will reappear in the command window Command window Type commands here Hit return and they will run Results window Results of your analyses will appear here 3

CREATING A SMALL STATA DATASET: To become more familiar with STATA, we will create a small dataset with information on five of your friends. You will have four variables: first name, last name, age, and whether they voted for Obama or Romney in the 2012 presidential election. To create this dataset, you will type the data directly into the STATA Data Editor. Open the Data Editor and click in the upper-most left box. Start by typing the first name of one of your friends. Hit return and the variable name, var1 will appear above the column. Now click on var1 and you can enter the variable name in the variable window in the bottom right hand corner. Type firstname in as the name of the first variable 4

Continue by adding your friend s last name in the next column. (Remember to hit return after typing the name.) Then add the variable name, last_name, just as you did above. Next add the other two variables, age and whether they voted Obama or Romney in the election. Then you can start on the next observation. 5

Repeat the process until you ve entered data for all 5 of your friends. Your finished dataset should look something like the following: Click on the X in the right upper-hand corner of the Data Editor window and this will close the Data Editor (But be careful not to close the whole program by clicking the X outside the Data Editor). Now your STATA should appear like Figure 1, except the variables listed should be the ones you just created. In this dataset, each of your friends represents a row or observation and each variable has its own column. Also note that age is a continuous variable and the vote is a dichotomous variable. But what type of variables are first_name and last_name? These variables don t fall into are previous classification because they re not numbers. However, STATA refers to data that has text or words instead of numbers as string variables. 6

STARTING TO ANALYZE YOUR DATASET: Now we will try a few tricks using your new dataset. Type the following text in your command window: (in the following text, the box around the commands just lets you know this is STATA syntax this convention will also be used during the laboratory sessions). Command: summarize age The output should appear as follows: Variable Obs Mean Std. Dev. Min Max -------------+-------------------------------------------------------- age 5 41.8 8.700575 33 55 There is a lot of interesting information here. It tells us there are 5 observations and the mean age is 41.8 years with a standard deviation of almost 9 years. The range is also given as 33 years to 55 years. Now we will examine the 2012 vote variable. Because this is a dichotomous variable, we will create a table and see who your friends voted for. Command: tab vote STATA output: vote Freq. Percent Cum. ------------+----------------------------------- obama 3 60.00 60.00 romney 2 40.00 100.00 ------------+----------------------------------- Total 5 100.00 Remember that you have entered this dichotomous variable as two text words (string variables). It would be better to enter these as 0 and 1 and then label them, which you will learn to do in the lab. This concludes the primer. We look forward to seeing you at the course! 7