Lecture 26: Missing data

Similar documents
FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

Lecture 13: Model selection and regularization

Lecture 16: High-dimensional regression, non-linear regression

Lecture 7: Linear Regression (continued)

Missing Data and Imputation

Missing Data Analysis for the Employee Dataset

Logistic Regression: Probabilistic Interpretation

Lecture 17: Smoothing splines, Local Regression, and GAMs

Lecture 25: Review I

Missing Data Analysis for the Employee Dataset

Missing Data. Where did it go?

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

Lecture on Modeling Tools for Clustering & Regression

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Lecture 9: Support Vector Machines

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Lecture 06 Decision Trees I

CSE446: Linear Regression. Spring 2017

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

Missing Data. SPIDA 2012 Part 6 Mixed Models with R:

Epidemiological analysis PhD-course in epidemiology

Epidemiological analysis PhD-course in epidemiology. Lau Caspar Thygesen Associate professor, PhD 25 th February 2014

Machine Learning: An Applied Econometric Approach Online Appendix

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

3 Nonlinear Regression

Dimensionality Reduction, including by Feature Selection.

Assignment 4 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Missing Data Missing Data Methods in ML Multiple Imputation

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

CSC 411: Lecture 02: Linear Regression

Nonparametric Regression

Handling Data with Three Types of Missing Values:

Outlier Pursuit: Robust PCA and Collaborative Filtering

3 Nonlinear Regression

Missing Data and Imputation

Machine Learning / Jan 27, 2010

Model selection and validation 1: Cross-validation

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Problem 1: Complexity of Update Rules for Logistic Regression

EE 511 Linear Regression

An MM Algorithm for Multicategory Vertex Discriminant Analysis

Missing data analysis. University College London, 2015

HANDLING MISSING DATA

Reviewer Profiling Using Sparse Matrix Regression

Sampling PCA, enhancing recovered missing values in large scale matrices. Luis Gabriel De Alba Rivera 80555S

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Multiple Imputation for Missing Data. Benjamin Cooper, MPH Public Health Data & Training Center Institute for Public Health

The exam is closed book, closed notes except your one-page cheat sheet.

Maths for Signals and Systems Linear Algebra in Engineering. Some problems by Gilbert Strang

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Minitab 17 commands Prepared by Jeffrey S. Simonoff

Feature selection. LING 572 Fei Xia

Clustering algorithms and autoencoders for anomaly detection

Special Topic: Missing Values. Missing Can Mean Many Things. Missing Values Common in Real Data

Statistical Matching using Fractional Imputation

Lecture 3: Linear Classification

Advanced Digital Signal Processing Adaptive Linear Prediction Filter (Using The RLS Algorithm)

Comparison of Variational Bayes and Gibbs Sampling in Reconstruction of Missing Values with Probabilistic Principal Component Analysis

Lasso. November 14, 2017

Supplementary Figure 1. Decoding results broken down for different ROIs

56:272 Integer Programming & Network Flows Final Exam -- December 16, 1997

LECTURE 13: SOLUTION METHODS FOR CONSTRAINED OPTIMIZATION. 1. Primal approach 2. Penalty and barrier methods 3. Dual approach 4. Primal-dual approach

CSE446: Linear Regression. Spring 2017

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Using Machine Learning to Optimize Storage Systems

Recommender Systems New Approaches with Netflix Dataset

Package filling. December 11, 2017

CS6220: DATA MINING TECHNIQUES

Missing Data: What Are You Missing?

Predict Outcomes and Reveal Relationships in Categorical Data

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Bias-Variance Decomposition Error Estimators

Nonparametric Importance Sampling for Big Data

I How does the formulation (5) serve the purpose of the composite parameterization

CPSC 340: Machine Learning and Data Mining. Robust Regression Fall 2015

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Bias-Variance Decomposition Error Estimators Cross-Validation

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

model order p weights The solution to this optimization problem is obtained by solving the linear system

Data Mining and Data Warehousing Classification-Lazy Learners

STA 4273H: Sta-s-cal Machine Learning

NONPARAMETRIC REGRESSION TECHNIQUES

HMC CS 158, Fall 2017 Problem Set 3 Programming: Regularized Polynomial Regression

SEMANTIC COMPUTING. Lecture 9: Deep Learning: Recurrent Neural Networks (RNNs) TU Dresden, 21 December 2018

CPSC 340: Machine Learning and Data Mining. Kernel Trick Fall 2017

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis

MODEL SELECTION AND MODEL AVERAGING IN THE PRESENCE OF MISSING VALUES

OLS Assumptions and Goodness of Fit

Multiple Imputation with Mplus

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Data corruption, correction and imputation methods.

Transcription:

Lecture 26: Missing data Reading: ESL 9.6 STATS 202: Data mining and analysis December 1, 2017 1 / 10

Missing data is everywhere Survey data: nonresponse. 2 / 10

Missing data is everywhere Survey data: nonresponse. Longitudinal studies and clinical trials: dropout. 2 / 10

Missing data is everywhere Survey data: nonresponse. Longitudinal studies and clinical trials: dropout. Recommendation systems: different individuals interact with or express preferences for different items. 2 / 10

Missing data is everywhere Survey data: nonresponse. Longitudinal studies and clinical trials: dropout. Recommendation systems: different individuals interact with or express preferences for different items. Data integration: different variables collected by different organizations or in different experiments or trials. 2 / 10

Mechanisms for missing data Missing completely at random: Pattern of missingness independent of missing values and the values of any measured variables Example. We run a taste study for 20 different drinks. Each subject was asked to rate only 4 drinks chosen at random. 3 / 10

Mechanisms for missing data Missing completely at random: Pattern of missingness independent of missing values and the values of any measured variables Example. We run a taste study for 20 different drinks. Each subject was asked to rate only 4 drinks chosen at random. Missing at random: The pattern of missingness depends on other predictors, but conditional on observed variables, missingness is independent of missing value. Example. In a survey, poor subjects were less likely to answer a question about drug use than wealthy subjects. 3 / 10

Mechanisms for missing data Missing completely at random: Pattern of missingness independent of missing values and the values of any measured variables Example. We run a taste study for 20 different drinks. Each subject was asked to rate only 4 drinks chosen at random. Missing at random: The pattern of missingness depends on other predictors, but conditional on observed variables, missingness is independent of missing value. Example. In a survey, poor subjects were less likely to answer a question about drug use than wealthy subjects. Related to observed predictors (income) but not drug use. 3 / 10

Mechanisms for missing data Missing completely at random: Pattern of missingness independent of missing values and the values of any measured variables Example. We run a taste study for 20 different drinks. Each subject was asked to rate only 4 drinks chosen at random. Missing at random: The pattern of missingness depends on other predictors, but conditional on observed variables, missingness is independent of missing value. Example. In a survey, poor subjects were less likely to answer a question about drug use than wealthy subjects. Related to observed predictors (income) but not drug use. Missing not at random: The pattern of missingness is related to the missing variable, even after correcting for measured variables. EX 1: High earners less likely to report their income. EX 2: Record time until subjects have an accident but only follow for three years (censoring). 3 / 10

Dealing with missing data Categorical case: Treat "missing" as an additional category. 4 / 10

Dealing with missing data Categorical case: Treat "missing" as an additional category. Surrogate variables: Tree-based methods like CART can deal with missingness by introducing surrogate variables! 4 / 10

Dealing with missing data Categorical case: Treat "missing" as an additional category. Surrogate variables: Tree-based methods like CART can deal with missingness by introducing surrogate variables! Observation deletion: Delete observations with missing values. 4 / 10

Dealing with missing data Categorical case: Treat "missing" as an additional category. Surrogate variables: Tree-based methods like CART can deal with missingness by introducing surrogate variables! Observation deletion: Delete observations with missing values. Drawbacks: Reduces dataset size, can bias input feature space, doesn t work at test time. 4 / 10

Dealing with missing data Categorical case: Treat "missing" as an additional category. Surrogate variables: Tree-based methods like CART can deal with missingness by introducing surrogate variables! Observation deletion: Delete observations with missing values. Drawbacks: Reduces dataset size, can bias input feature space, doesn t work at test time. Variable deletion: Delete variables with missing values 4 / 10

Dealing with missing data Categorical case: Treat "missing" as an additional category. Surrogate variables: Tree-based methods like CART can deal with missingness by introducing surrogate variables! Observation deletion: Delete observations with missing values. Drawbacks: Reduces dataset size, can bias input feature space, doesn t work at test time. Variable deletion: Delete variables with missing values Drawbacks: May be throwing away valuable variable, can bias input feature space. 4 / 10

Dealing with missing data Single imputation: We replace each missing value with a single number. 1. Replace with the mean or median of the column. 5 / 10

Dealing with missing data Single imputation: We replace each missing value with a single number. 1. Replace with the mean or median of the column. 2. Replace with a random sample from the non-missing values in the column. 5 / 10

Dealing with missing data Single imputation: We replace each missing value with a single number. 1. Replace with the mean or median of the column. 2. Replace with a random sample from the non-missing values in the column. 3. Replace missing values in X j with a regression estimate from other predictors, X j. 5 / 10

Dealing with missing data Single imputation: We replace each missing value with a single number. 1. Replace with the mean or median of the column. 2. Replace with a random sample from the non-missing values in the column. 3. Replace missing values in X j with a regression estimate from other predictors, X j. Drawbacks: Methods 1 and 2 can give biased coefficients if the data is not missing completely at random. Method 3 does not have bias if the missing variable is predicted well by X j. 5 / 10

Dealing with missing data Single imputation: We replace each missing value with a single number. 1. Replace with the mean or median of the column. 2. Replace with a random sample from the non-missing values in the column. 3. Replace missing values in X j with a regression estimate from other predictors, X j. Drawbacks: Methods 1 and 2 can give biased coefficients if the data is not missing completely at random. Method 3 does not have bias if the missing variable is predicted well by X j. Resulting inferences about estimated parameters or predictions do not account for uncertainty in missing values. 5 / 10

Dealing with missing data Multiple imputation: Form many imputed datasets by positing a distribution over unobserved variables and repeatedly sampling from that distribution. For example, each sample could be obtained by replacing each missing value in X j with a regression estimate from the other predictors X j, plus some noise. This is repeated several times. Run entire analysis on each dataset, and use multiple results to get a better estimate of uncertainty. If the regression fit of Xj onto X j is good, the standard errors from this method can be unbiased. 6 / 10

Missing data in more than one variable Problem: What if some observations have multiple missing values? 7 / 10

Missing data in more than one variable Problem: What if some observations have multiple missing values? Iterative multiple imputation: Start with a simple imputation. Then, iterate the following: 1. Update imputation of X 1 given current values of X 1. 2. Update imputation of X 2 given current values of X 2.... 3. Update imputation of X p given current values of X p. 7 / 10

Missing data in more than one variable Problem: What if some observations have multiple missing values? Iterative multiple imputation: Start with a simple imputation. Then, iterate the following: 1. Update imputation of X 1 given current values of X 1. 2. Update imputation of X 2 given current values of X 2.... 3. Update imputation of X p given current values of X p. Model based imputation: Posit a joint model for all variables. Fit this model and infer best values for all missing datapoints. Rarely worth the trouble. 7 / 10

Missing data in more than one variable Problem: What if some observations have multiple missing values? Low-rank matrix completion: Motivation: In linear regression, ŷ can be understood as a projection of y onto the space spanned by the columns of X. In a sense, what matters is not X itself but this column space. 8 / 10

Missing data in more than one variable Problem: What if some observations have multiple missing values? Low-rank matrix completion: Motivation: In linear regression, ŷ can be understood as a projection of y onto the space spanned by the columns of X. In a sense, what matters is not X itself but this column space. Key observation: If predictor matrix is approximately low-rank (if points lie near a lower-dimensional subspace), then one can approximately recover X and its column space even if many entries are missing. 8 / 10

Missing data in more than one variable Problem: What if some observations have multiple missing values? Low-rank matrix completion: Motivation: In linear regression, ŷ can be understood as a projection of y onto the space spanned by the columns of X. In a sense, what matters is not X itself but this column space. Key observation: If predictor matrix is approximately low-rank (if points lie near a lower-dimensional subspace), then one can approximately recover X and its column space even if many entries are missing. Low-rank matrix completion algorithms find a matrix X which is similar to X in its non-missing values, and has a low dimensional column space: min subject to rank(x )=k X X, where X X is the sum of squared differences of the non-missing entries. 8 / 10

Missing data in more than one variable Problem: What if some observations have multiple missing values? Matrix completion: This problem can be relaxed to a convex optimization: min X X + λ p σ p, where σ 1,..., σ p are the singular values of X. Here, the penalty λ is inversely related to the rank and can be used as a tuning parameter. i=1 9 / 10

Some practical considerations It is important to visualize summaries or plots for the pattern of missingness. 10 / 10

Some practical considerations It is important to visualize summaries or plots for the pattern of missingness. If the pattern of missingness is informative, include it as a dummy variable. 10 / 10

Some practical considerations It is important to visualize summaries or plots for the pattern of missingness. If the pattern of missingness is informative, include it as a dummy variable. If a variable has too many missing values, you may want to exclude it from your analysis (you can still include a missingness indicator for that variable.) 10 / 10

Some practical considerations It is important to visualize summaries or plots for the pattern of missingness. If the pattern of missingness is informative, include it as a dummy variable. If a variable has too many missing values, you may want to exclude it from your analysis (you can still include a missingness indicator for that variable.) If we are using a method that allows it, consider weighting variables according to the rate of missing data. Example. In nearest neighbors, scale each variable and multiply by (100 % missing). 10 / 10

Some practical considerations It is important to visualize summaries or plots for the pattern of missingness. If the pattern of missingness is informative, include it as a dummy variable. If a variable has too many missing values, you may want to exclude it from your analysis (you can still include a missingness indicator for that variable.) If we are using a method that allows it, consider weighting variables according to the rate of missing data. Example. In nearest neighbors, scale each variable and multiply by (100 % missing). When imputing, keep in mind that some variables are restricted to be positive or bounded. 10 / 10

Some practical considerations It is important to visualize summaries or plots for the pattern of missingness. If the pattern of missingness is informative, include it as a dummy variable. If a variable has too many missing values, you may want to exclude it from your analysis (you can still include a missingness indicator for that variable.) If we are using a method that allows it, consider weighting variables according to the rate of missing data. Example. In nearest neighbors, scale each variable and multiply by (100 % missing). When imputing, keep in mind that some variables are restricted to be positive or bounded. Some variables are well modeled as non-linear functions of other variables. 10 / 10