Title. Description. Quick start. stata.com. misstable Tabulate missing values

Similar documents
Title. Description. Quick start. stata.com. label Manipulate labels

Community Resource: Egenmore, by command, return lists, and system variables. Beksahn Jang Feb 22 nd, 2016 SOC561

Week 1: Introduction to Stata

Multiple-imputation analysis using Stata s mi command

Introduction to Stata - Session 2

Recoding and Labeling Variables

Title. Syntax. stata.com. function. void [ function ] declarations Declarations and types. declaration 1 fcnname(declaration 2 ) { declaration 3

Introduction to Stata

Title. Syntax. svy: tabulate oneway One-way tables for survey data. Basic syntax. svy: tabulate varname. Full syntax

Introduction to Programming in Stata

Introduction to Stata - Session 1

TYPES OF VARIABLES, STRUCTURE OF DATASETS, AND BASIC STATA LAYOUT

25 Working with categorical data and factor variables

Description Syntax Remarks and examples Conformability Diagnostics Also see

Lecture 4: Programming

Lab 2 Supplement: Programming in Stata Short Course on Poverty & Development for Nordic Ph.D. Students University of Copenhagen June 13-23, 2000

More Programming Constructs -- Introduction

Introduction to Stata: An In-class Tutorial

WORKSHOP: Using the Health Survey for England, 2014

Chapter 6: Modifying and Combining Data Sets

Workshop for empirical trade analysis. December 2015 Bangkok, Thailand

Title stata.com import haver Syntax

Ninth ARTNeT Capacity Building Workshop for Trade Research "Trade Flows and Trade Policy Analysis"

[/TTEST [PERCENT={5}] [{T }] [{DF } [{PROB }] [{COUNTS }] [{MEANS }]] {n} {NOT} {NODF} {NOPROB}] {NOCOUNTS} {NOMEANS}

An Introduction to Stata

Data wrangling. Reduction/Aggregation: reduces a variable to a scalar

You will learn: The structure of the Stata interface How to open files in Stata How to modify variable and value labels How to manipulate variables

Stata: A Brief Introduction Biostatistics

BIOSTATISTICS LABORATORY PART 1: INTRODUCTION TO DATA ANALYIS WITH STATA: EXPLORING AND SUMMARIZING DATA


Title. Description. Quick start. Menu. stata.com. import haver Import data from Haver Analytics databases

SAS seminar. The little SAS book Chapters 3 & 4. April 15, Åsa Klint. By LD Delwiche and SJ Slaughter. 3.1 Creating and Redefining variables

Intro to Stata for Political Scientists

Empirical trade analysis

Stata v 12 Illustration. First Session

Exploring Data. This guide describes the facilities in SPM to gain initial insights about a dataset by viewing and generating descriptive statistics.

Compute MI estimates of coefficients using previously saved estimation results

Introduction to SAS Procedures SAS Basics III. Susan J. Slaughter, Avocet Solutions

Kadi Sarva Vishwavidyalaya, Gandhinagar

22/10/16. Data Coding in SPSS. Data Coding in SPSS. Data Coding in SPSS. Data Coding in SPSS

ECO375 Tutorial 1 Introduction to Stata

SOCY7709: Quantitative Data Management Instructor: Natasha Sarkisian. Reading and Writing Datasets

Decision Control Structure. Rab Nawaz Jadoon DCS. Assistant Professor. Department of Computer Science. COMSATS IIT, Abbottabad Pakistan

Appendix II: STATA Preliminary

SMPL - A Simplified Modeling Language for Mathematical Programming

Contents Functions References Also see

OVERVIEW OF WINDOWS IN STATA

SAS/STAT 14.2 User s Guide. The SURVEYIMPUTE Procedure

Handling missing values in Analysis

Opening a Data File in SPSS. Defining Variables in SPSS

Stata version 13. First Session. January I- Launching and Exiting Stata Launching Stata Exiting Stata..

Principles of Biostatistics and Data Analysis PHP 2510 Lab2

Algorithm Analysis Part 2. Complexity Analysis

Figure-2.1. Information system with encoder/decoders.

Introduction to Stata. Getting Started. This is the simple command syntax in Stata and more conditions can be added as shown in the examples.

A quick introduction to STATA

Statistics, Data Analysis & Econometrics

Expr Language Reference

Missing Data Part 1: Overview, Traditional Methods Page 1

STATA Version 9 10/05/2012 1

Econ Stata Tutorial I: Reading, Organizing and Describing Data. Sanjaya DeSilva

This book is licensed under a Creative Commons Attribution 3.0 License

Programming. We will be introducing various new elements of Python and using them to solve increasingly interesting and complex problems.

UNIT-II. Part-2: CENTRAL PROCESSING UNIT

Organizing and Summarizing Data

Assoc. Prof. Dr. Marenglen Biba. (C) 2010 Pearson Education, Inc. All rights reserved.

ECONOMICS 351* -- Stata 10 Tutorial 1. Stata 10 Tutorial 1

Chapter 2: Number Systems

Stat Wk 3. Stat 342 Notes. Week 3, Page 1 / 71

(Refer Slide Time: 1:40)

Introduction to STATA 6.0 ECONOMICS 626

Description Quick start Syntax Options Remarks and examples Reference Also see

by Pearson Education, Inc. All Rights Reserved. 2

Choosing the Right Procedure

The priority is indicated by a number, the lower the number - the higher the priority.

Switching Circuits and Logic Design Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Request for Comments: 5109 December 2007 Obsoletes: 2733, 3009 Category: Standards Track. RTP Payload Format for Generic Forward Error Correction

12/22/11. Java How to Program, 9/e. Help you get started with Eclipse and NetBeans integrated development environments.

Normalization. Un Normalized Form (UNF) Share. Download the pdf version of these notes.

mywbut.com UNIX Operating System

Further processing of estimation results: Basic programming with matrices

1 Why use do-files in Stata? 2 Make do-files robust 3 Make do-files legible 4 A sample template for do-files 5 How to debug simple errors in Stata

How to use Pivot table macro

Subject index. ASCII data, reading comma-separated fixed column multiple lines per observation

Introduction to Stata First Session. I- Launching and Exiting Stata Launching Stata Exiting Stata..

A Set Of Machine Language Instructions For A Program Is Called Source Code >>>CLICK HERE<<<

International Graduate School of Genetic and Molecular Epidemiology (GAME) Computing Notes and Introduction to Stata

Exploratory Data Analysis

C How to Program, 7/e by Pearson Education, Inc. All Rights Reserved.

Introduction to Stata Toy Program #1 Basic Descriptives

Introduction to STATA

Transform data - Compute Variables

Chapter 7 File Access. Chapter Table of Contents

Day 4 Percentiles and Box and Whisker.notebook. April 20, 2018

Research Methods for Business and Management. Session 8a- Analyzing Quantitative Data- using SPSS 16 Andre Samuel

Title. Syntax. stata.com. tokengetall(t) tokenget(t) tokenpeek(t) tokenrest(t) tokenoffset(t) tokenoffset(t, real scalar offset)

The main issue is that the mean and standard deviations are not accurate and should not be used in the analysis. Then what statistics should we use?

It can be confusing when you type something like the expressions below and get an error message. a range variable definition a vector of sine values

GETTING DATA INTO THE PROGRAM

Data Types. Data Types. Integer Types. Signed Integers

Transcription:

Title stata.com misstable Tabulate missing values Description Quick start Menu Syntax Options Remarks and examples Stored results Also see Description misstable makes tables that help you understand the pattern of missing values in your data. Quick start Tables with counts of missing values Missing observations in v1, v2, and v3 misstable summarize v1 v2 v3 Missing observations in v1 v3 for cases where v4 > 10 misstable summarize v1 v2 v3 if v4>10 All variables with missing values misstable summarize Include variables with no missing values in the table misstable summarize, all Create 3 missing-data indicator variables with prefix m for v1, v2, and v3 misstable summarize v1 v2 v3, generate(m) Tables of missing-value patterns Missing-value patterns for v1, v2, and v3 misstable patterns v1 v2 v3 Same as above, but for all variables with missing values misstable patterns Show variables in the order listed in the command misstable patterns v1 v3 v2, asis Show frequencies, rather than percentages, in the output misstable patterns v1 v2 v3, frequency Missing-value patterns displayed as trees A tree view of missing-value patterns for v1, v2, and v3 misstable tree v1 v2 v3 1

2 misstable Tabulate missing values For all variables with missing values misstable tree Nesting patterns of missing values Missing values for v1, v2, and v3 misstable nested v1 v2 v3 Same as above, but for all variables with missing values misstable nested Treat extended missing values (.a,.b,...,.z) as nonmissing misstable nested v1 v2 v3, exok Menu Statistics > Summaries, tables, and tests > Other tables > Tabulate missing values

misstable Tabulate missing values 3 Syntax Report counts of missing values misstable summarize [ varlist ] [ if ] [ in ] [, summarize options ] Report pattern of missing values misstable patterns [ varlist ] [ if ] [ in ] [, patterns options ] Present a tree view of the pattern of missing values misstable tree [ varlist ] [ if ] [ in ] [, tree options ] List the nesting rules that describe the missing-value pattern misstable nested [ varlist ] [ if ] [ in ] [, nested options ] summarize options Description all show all variables showzeros show zeros in table generate(stub [, exok ] ) generate missing-value indicators patterns options asis frequency exok replace clear bypatterns Description use variables in order given report frequencies instead of percentages treat.a,.b,...,.z as nonmissing replace data in memory with dataset of patterns okay to replace even if original unsaved list by patterns rather than by frequency tree options asis frequency exok Description use variables in order given report frequencies instead of percentages treat.a,.b,...,.z as nonmissing nested options exok Description treat.a,.b,...,.z as nonmissing In addition, programmer s option nopreserve is allowed with all syntaxes; see [P] nopreserve option.

4 misstable Tabulate missing values Options Options are presented under the following headings: Options for misstable summarize Options for misstable patterns Options for misstable tree Option for misstable nested Common options Options for misstable summarize all specifies that the table should include all the variables specified or all the variables in the dataset. The default is to include only numeric variables that contain missing values. showzeros specifies that zeros in the table should display as 0 rather than being omitted. generate(stub [, exok ] ) requests that a missing-value indicator newvar, a new binary variable containing 0 for complete observations and 1 for incomplete observations, be generated for every numeric variable in varlist containing missing values. If the all option is specified, missing-value indicators are created for all the numeric variables specified or for all the numeric variables in the dataset. If exok is specified within generate(), the extended missing values.a,.b,...,.z are treated as if they do not designate missing. For each variable in varlist, newvar is the corresponding variable name varname prefixed with stub. If the total length of stub and varname exceeds 32 characters, newvar is abbreviated so that its name does not exceed 32 characters. Options for misstable patterns asis, frequency, and exok see Common options below. replace specifies that the data in memory be replaced with a dataset corresponding to the table just displayed; see misstable patterns under Remarks and examples below. clear is for use with replace; it specifies that it is okay to change the data in memory even if they have not been saved to disk. bypatterns specifies the table be ordered by pattern rather than by frequency. That is, bypatterns specifies that patterns containing one incomplete variable be listed first, followed by those for two incomplete variables, and so on. The default is to list the most frequent pattern first, followed by the next most frequent pattern, etc. Options for misstable tree asis, frequency, and exok see Common options below. Option for misstable nested exok see Common options below.

misstable Tabulate missing values 5 Common options asis specifies that the order of the variables in the table be the same as the order in which they are specified on the misstable command. The default is to order the variables by the number of missing values, and within that, by the amount of overlap of missing values. frequency specifies that the table should report frequencies instead of percentages. exok specifies that the extended missing values.a,.b,...,.z should be treated as if they do not designate missing. Some users use extended missing values to designate values that are missing for a known and valid reason. nopreserve is a programmer s option allowed with all misstable commands; see [P] nopreserve option. Remarks and examples stata.com Remarks are presented under the following headings: misstable summarize misstable patterns misstable tree misstable nested Execution time of misstable nested In what follows, we will use data from a 125-observation, fictional, student-satisfaction survey:. use http://www.stata-press.com/data/r15/studentsurvey (Student Survey). summarize Variable Obs Mean Std. Dev. Min Max m1 125 2.456.8376619 1 4 m2 125 2.472.8089818 1 4 age 122 18.97541.8763477 17 21 female 122.5245902.5014543 0 1 dept 116 2.491379 1.226488 1 4 offcampus 125.36.4819316 0 1 comment 0 The m1 and m2 variables record the student s satisfaction with teaching and with academics. comment is a string variable recording any comments the student might have had.

6 misstable Tabulate missing values misstable summarize Example 1 misstable summarize reports counts of missing values:. misstable summarize Obs<. Unique Variable Obs=. Obs>. Obs<. values Min Max age 3 122 5 17 21 female 3 122 2 0 1 dept 9 116 4 1 4 Stata provides 27 different missing values, namely,.,.a,.b,...,.z. The first of those,., is often called system missing. The remaining missing values are called extended missings. The nonmissing and missing values are ordered nonmissing <. <.a <.b < <.z. Thus reported in the column Obs=. are counts of system missing values; in the column Obs>., extended missing values; and in the column Obs<., nonmissing values. The rightmost portion of the table is included to remind you how the variables are encoded. Our data contain seven variables and yet misstable reported only three of them. The omitted variables contain no missing values or are string variables. Even if we specified the varlist explicitly, those variables would not appear in the table unless we specified the all option. We can also create missing-value indicators for each of the variables above using the generate() option:. quietly misstable summarize, generate(miss_). describe miss_* storage display value variable name type format label variable label miss_age byte %8.0g (age>=.) miss_female byte %8.0g (female>=.) miss_dept byte %8.0g (dept>=.) For each variable containing missing values, the generate() option creates a new binary variable containing 0 for complete observations and 1 for incomplete observations. In our example, three new missing-value indicators are generated, one for each of the incomplete variables age, female, and dept. The naming convention of generate() is to prefix the corresponding variable names with the specified stub, which is miss in this example. Missing-value indicators are useful, for example, for checking whether data are missing completely at random. They are also often used within the multiple-imputation context to identify the observed and imputed data; see [MI] intro substantive for a general introduction to multiple imputation. Within Stata s multiple-imputation commands, an incomplete value is identified by the system missing value, a dot. By default, misstable summarize, generate() marks the extended missing values as incomplete values, as well. You can use exok within generate() to treat extended missing values as complete when creating missing-value identifiers.

misstable Tabulate missing values 7 misstable patterns Example 2 misstable patterns reports the pattern of missing values:. misstable patterns Missing-value patterns (1 means complete) Pattern Percent 1 2 3 93% 1 1 1 5 1 1 0 2 0 0 0 100% Variables are (1) age (2) female (3) dept There are three patterns in these data: (1,1,1), (1,1,0), and (0,0,0). By default, the rows of the table are ordered by frequency. In larger tables that have more patterns, it is sometimes useful to order the rows by pattern. We could have obtained that by typing mi misstable patterns, bypatterns. In a pattern, 1 indicates that all values of the variable are nonmissing and 0 indicates that all values are missing. Thus pattern (1,1,1) means no missing values, and 93% of our data have that pattern. There are two patterns in which variables are missing, (1,1,0) and (0,0,0). Pattern (1,1,0) means that age is nonmissing, female is nonmissing, and dept is missing. The order of the variables in the patterns appears in the key at the bottom of the table. Five percent of the observations have pattern (1,1,0). The remaining 2% have pattern (0,0,0), meaning that all three variables contain missing. As with misstable summarize, only numeric variables that contain missing are listed, so had we typed misstable patterns comments age female offcampus dept, we still would have obtained the same table. Variables that are automatically omitted contain no missing values or are string variables. The variables in the table are ordered from lowest to highest frequency of missing values, although you cannot see that from the information presented in the table. The variables are ordered this way even if you explicitly specify the varlist with a different ordering. Typing misstable patterns dept female age would produce the same table as above. Specify the asis option if you want the variables in the order in which you specify them. You can obtain a dataset of the patterns by specifying the replace option:. misstable patterns, replace clear Missing-value patterns (1 means complete) Pattern Percent 1 2 3 93% 1 1 1 5 1 1 0 2 0 0 0 100% Variables are (1) age (2) female (3) dept (summary data now in memory)

8 misstable Tabulate missing values. list _freq age female dept 1. 3 0 0 0 2. 6 1 1 0 3. 116 1 1 1 The differences between the dataset and the printed table are that 1) the dataset always records frequency and 2) the rows are reversed. misstable tree Example 3 misstable tree presents a tree view of the pattern of missing values:. use http://www.stata-press.com/data/r15/studentsurvey, clear (Student Survey). misstable tree, frequency Nested pattern of missing values dept age female 9 3 3 0 6 0 6 116 0 0 0 116 0 116 (number missing listed first) In this example, we specified the frequency option to see the table in frequency rather than percentage terms. In the table, each column sums to the total number of observations in the data, 125. Variables are ordered from those with the most missing values to those with the least. Start with the first column. The dept variable is missing in 9 observations and, farther down, the table reports that it is not missing in 116 observations. Go back to the first row and read across, but only to the second column. The dept variable is missing in 9 observations. Within those 9, age is missing in 3 of them and is not missing in the remaining 6. Reading down the second column, within the 116 observations that dept is not missing, age is missing in 0 and not missing in 116. Reading straight across the first row again, dept is missing in 9 observations, and within the 9, age is missing in 3, and within the 3, female is also missing in 3. Skipping down just a little, within the 6 observations for which dept is missing and age is not missing, female is not missing, too.

misstable Tabulate missing values 9 misstable nested Example 4 misstable nested lists the nesting rules that describe the missing-value pattern,. misstable nested 1. female(3) <-> age(3) -> dept(9) This line says that in observations in which female is missing, so is age missing, and vice versa, and in observations in which age (or female) is missing, so is dept. The numbers in parentheses are counts of the missing values. The female variable happens to be missing in 3 observations, and the same is true for age; the dept variable is missing in 9 observations. Thus dept is missing in the 3 observations for which age and female are missing, and in 6 more observations, too. In these data, it turns out that the missing-value pattern can be summarized in one statement. In a larger dataset, you might see something like this:. misstable nested 1. female(50) <-> age(50) -> dept(120) 2. female(50) -> m1(58) 3. offcampus(11) misstable nested accounts for every missing value. In the above, in addition to female <-> age -> dept, we have that female -> m1, and we have offcampus, the last all by itself. The last line says that the 11 missing values in offcampus are not themselves nested in the missing value of any other variable, nor do they imply the missing values in another variable. In some datasets, all the statements will be of this last form. In our data, however, we have one statement:. misstable nested 1. female(3) <-> age(3) -> dept(9) When the missing-value pattern can be summarized in one misstable nested statement, the pattern of missing values in the data is said to be monotone. Execution time of misstable nested The execution time of misstable nested is affected little by the number of observations but can grow quickly with the number of variables, depending on the fraction of missing values within variable. The execution time of the example above, which has 3 variables containing missing, is instant. In worst-case scenarios, with 500 variables, the time might be 25 seconds; with 1,000 variables, the execution time might be closer to an hour. In situations where misstable nested takes a long time to complete, it will produce thousands of rules that will defy interpretation. A 523-variable dataset we have seen ran in 20 seconds and produced 8,040 rules. Although we spotted a few rules in the output that did not surprise us, such as the year of the date being missing implied that the month and the day were also missing, mostly the output was not helpful. If you have such a dataset, we recommend you run misstable on groups of variables that you have reason to believe the pattern of missing values might be related.

10 misstable Tabulate missing values Stored results misstable summarize stores the following values of the last variable summarized in r(): Scalars r(n eq dot) number of observations containing. r(n gt dot) number of observations containing.a,.b,...,.z r(n lt dot) number of observations containing nonmissing r(k uniq) number of unique, nonmissing values r(min) variable s minimum value r(max) variable s maximum value Macros r(vartype) numeric, string, or none r(k uniq) contains. if the number of unique, nonmissing values is greater than 500. r(vartype) contains none if no variables are summarized, and in that case, the value of the scalars are all set to missing (.). Programmers intending to access results after misstable summarize should specify the all option. misstable patterns stores the following in r(): Scalars r(n complete) r(n incomplete) r(k) Macros r(vars) number of complete observations number of incomplete observations number of patterns variables used in order presented r(n complete) and r(n incomplete) are defined with respect to the variables specified if variables were specified and otherwise, defined with respect to all the numeric variables in the dataset. r(n complete) is the number of observations that contain no missing values. misstable tree stores the following in r(): Macros r(vars) variables used in order presented misstable nested stores the following in r(): Scalars r(k) number of statements Macros r(stmt1) first statement r(stmt2) second statement.... r(stmt r(k) ) last statement r(stmt1wc) r(stmt1) with missing-value counts r(vars) variables considered A statement is encoded varname, varname op varname, or varname op varname op varname, and so on; op is either -> or <->. Also see [MI] mi misstable Tabulate pattern of missing values [R] summarize Summary statistics [R] tabulate oneway One-way table of frequencies [R] tabulate twoway Two-way table of frequencies