Handling Missing Values

Similar documents
Getting started with ggplot2

Basic R QMMA. Emanuele Taufer. 2/19/2018 Basic R (1)

Chapter 7. The Data Frame

Resources for statistical assistance. Quantitative covariates and regression analysis. Methods for predicting continuous outcomes.

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018

Stat 579: More Preliminaries, Reading from Files

This is a simple example of how the lasso regression model works.

Stat 241 Review Problems

The xtablelist Gallery. Contents. David J. Scott. January 4, Introduction 2. 2 Single Column Names 7. 3 Multiple Column Names 9.

NEURAL NETWORKS. Cement. Blast Furnace Slag. Fly Ash. Water. Superplasticizer. Coarse Aggregate. Fine Aggregate. Age

Introduction for heatmap3 package

Quick Guide for pairheatmap Package

STENO Introductory R-Workshop: Loading a Data Set Tommi Suvitaival, Steno Diabetes Center June 11, 2015

Introduction to R, Github and Gitlab

S CHAPTER return.data S CHAPTER.Data S CHAPTER

Objects, Class and Attributes

Math 263 Excel Assignment 3

WEEK 13: FSQCA IN R THOMAS ELLIOTT

Graphics in R STAT 133. Gaston Sanchez. Department of Statistics, UC Berkeley

Multiple Linear Regression

getting started in R

Stat 4510/7510 Homework 4

9.1 Random coefficients models Constructed data Consumer preference mapping of carrots... 10

MBV4410/9410 Fall Bioinformatics for Molecular Biology. Introduction to R

Numeric Vectors STAT 133. Gaston Sanchez. Department of Statistics, UC Berkeley

Control Flow Structures

k-nn classification with R QMMA

610 R12 Prof Colleen F. Moore Analysis of variance for Unbalanced Between Groups designs in R For Psychology 610 University of Wisconsin--Madison

Section 2.2: Covariance, Correlation, and Least Squares

A (very) brief introduction to R

Cross-Validation Alan Arnholt 3/22/2016

Comparing Fitted Models with the fit.models Package

Regression on the trees data with R

Debugging R Code. Biostatistics

Data Structures STAT 133. Gaston Sanchez. Department of Statistics, UC Berkeley

Introduction to R: Day 2 September 20, 2017

Experiments in Imputation

Package colf. October 9, 2017

R Workshop (16/06/2017 Durham University)

Lecture 26: Missing data

Common Sta 101 Commands for R. 1 One quantitative variable. 2 One categorical variable. 3 Two categorical variables. Summary statistics

Statistics Lab #7 ANOVA Part 2 & ANCOVA

Among those 14 potential explanatory variables,non-dummy variables are:

Getting started with simulating data in R: some helpful functions and how to use them Ariel Muldoon August 28, 2018

Notes based on: Data Mining for Business Intelligence

R Graphics. SCS Short Course March 14, 2008

Package glmnetutils. August 1, 2017

Some methods for the quantification of prediction uncertainties for digital soil mapping: Universal kriging prediction variance.

In this chapter, we present how to use the multiple imputation methods

1 RefresheR. Figure 1.1: Soy ice cream flavor preferences

10/5/2017 MIST.6060 Business Intelligence and Data Mining 1. Nearest Neighbors. In a p-dimensional space, the Euclidean distance between two records,

Introduction to the R Language

DSCI 325: Handout 24 Introduction to Writing Functions in R Spring 2017

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

Sub-setting Data. Tzu L. Phang

Orange Juice data. Emanuele Taufer. 4/12/2018 Orange Juice data (1)

that is, Data Science Hello World.

Lab 1 - Worksheet Spring 2013

Bernt Arne Ødegaard. 15 November 2018

Section 2.1: Intro to Simple Linear Regression & Least Squares

Applied Statistics and Econometrics Lecture 6

Stat 5303 (Oehlert): Response Surfaces 1

The Tidyverse BIOF 339 9/25/2018

Common R commands used in Data Analysis and Statistical Inference

EXST 7014, Lab 1: Review of R Programming Basics and Simple Linear Regression

Exercise 2.23 Villanova MAT 8406 September 7, 2015

Lab #7 - More on Regression in R Econ 224 September 18th, 2018

Statistical Analysis in R Guest Lecturer: Maja Milosavljevic January 28, 2015

Section 2.1: Intro to Simple Linear Regression & Least Squares

Some issues with R It is command-driven, and learning to use it to its full extent takes some time and effort. The documentation is comprehensive,

Statistics 251: Statistical Methods

Section 2.3: Simple Linear Regression: Predictions and Inference

Logical operators: R provides an extensive list of logical operators. These include

Missing data analysis. University College London, 2015

Statistics 133 Midterm Exam

Practice in R. 1 Sivan s practice. 2 Hetroskadasticity. January 28, (pdf version)

SYS 6021 Linear Statistical Models

EPIB Four Lecture Overview of R

The knnflex Package. February 28, 2007

Gelman-Hill Chapter 3

Will Landau. January 24, 2013

Getting Started in R

Package knnflex. April 17, 2009

IST 3108 Data Analysis and Graphics Using R. Summarizing Data Data Import-Export

1 Lab 1. Graphics and Checking Residuals

References R's single biggest strenght is it online community. There are tons of free tutorials on R.

Strings Basics STAT 133. Gaston Sanchez. Department of Statistics, UC Berkeley

Getting Started in R

Programming Paradigms

Lecture 09: Feb 13, Data Oddities. Lists Coercion Special Values Missingness and NULL. James Balamuta STAT UIUC

Section 3.2: Multiple Linear Regression II. Jared S. Murray The University of Texas at Austin McCombs School of Business

Stochastic Models. Introduction to R. Walt Pohl. February 28, Department of Business Administration

Nonparametric Classification Methods

An Introductory Guide to R

Work Session on Statistical Data Editing (Paris, France, April 2014) Topic (v): International Collaboration and Software & Tools

1 Standard Errors on Different Models

Reading and wri+ng data

Package naivebayes. R topics documented: January 3, Type Package. Title High Performance Implementation of the Naive Bayes Algorithm

Regression Analysis and Linear Regression Models

Solution to Bonus Questions

Transcription:

Handling Missing Values STAT 133 Gaston Sanchez Department of Statistics, UC Berkeley gastonsanchez.com github.com/gastonstat/stat133 Course web: gastonsanchez.com/stat133

Missing Values 2

Introduction Missing Values are very common no answer in a questionnaire / survey data that are lost or destroyed machines that fail experiments/samples that are lost things not working 3

Introduction The best thing to do about missing values is not to have any Gertrude Cox 4

Missing Values Missing Values in R Missing values in R are denoted with NA NA stands for Not Available NA is actually a logical value Do not confuse NA with "NA" (character) Do not confuse NA with NaN (not a number) 5

Missing Values Functions in R # NA is a logical value is.logical(na) ## [1] TRUE # NA is not the same as NaN identical(na, NaN) ## [1] FALSE # NA is not the same as "NA" identical(na, "NA") ## [1] FALSE 6

Function is.na() is.na() indicates which elements are missing is.na() is a generic function (i.e. can be used for vectors, factors, matrices, etc) x <- c(1, 2, 3, NA, 5) x ## [1] 1 2 3 NA 5 is.na(x) ## [1] FALSE FALSE FALSE TRUE FALSE 7

Function is.na() is.na() on a factor g <- factor(c(letters[rep(1:3, 2)], NA)) g ## [1] a b c a b c <NA> ## Levels: a b c is.na(g) ## [1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE Notice how missing values are denoted in factors 8

is.na() on a matrix Function is.na() m <- matrix(c(1:4, NA, 6:9, NA), 2) m ## [,1] [,2] [,3] [,4] [,5] ## [1,] 1 3 NA 7 9 ## [2,] 2 4 6 8 NA is.na(m) ## [,1] [,2] [,3] [,4] [,5] ## [1,] FALSE FALSE TRUE FALSE FALSE ## [2,] FALSE FALSE FALSE FALSE TRUE 9

is.na() on a data.frame d <- data.frame(m) d ## X1 X2 X3 X4 X5 ## 1 1 3 NA 7 9 ## 2 2 4 6 8 NA is.na(d) Function is.na() ## X1 X2 X3 X4 X5 ## [1,] FALSE FALSE TRUE FALSE FALSE ## [2,] FALSE FALSE FALSE FALSE TRUE 10

Function is.na() If you re reading a data table with missing values codified differently from NA, you can specify the parameter na.strings url <- "http://www.esapubs.org/archive/ecol/e084/094/momv3.3.txt" df <- read.table(file = url, header = FALSE, sep = "\t", na.strings = -999) 11

Computing with NAs 12

Computing with NA s Numerical computations using NA will normally result in NA 2 + NA ## [1] NA x <- c(1, 2, 3, NA, 5) x + 1 ## [1] 2 3 4 NA 6 13

Computing with NA s sqrt(x) ## [1] 1.000000 1.414214 1.732051 NA 2.236068 mean(x) ## [1] NA max(x) ## [1] NA 14

Argument na.rm Most arithmetic/trigonometric/summarizing functions provide the argument na.rm = TRUE that removes missing values before performing the computation: mean(x, na.rm = TRUE) sd(x, na.rm = TRUE) var(x, na.rm = TRUE) min(x, na.rm = TRUE) max(x, na.rm = TRUE) sum(x, na.rm = TRUE) etc 15

Argument na.rm x <- c(1, 2, 3, NA, 5) mean(x, na.rm = TRUE) ## [1] 2.75 sd(x, na.rm = TRUE) ## [1] 1.707825 median(x, na.rm = TRUE) ## [1] 2.5 16

Argument na.rm x <- c(1, 2, 3, NA, 5) y <- c(2, 4, 7, 9, 11) var(x, y, na.rm = TRUE) ## [1] 6.666667 17

Correlations with NA # default correlation cor(x, y) ## [1] NA # argument 'use' cor(x, y, use = 'complete.obs') ## [1] 0.9968896 18

NA Actions 19

Argument na.rm Additional functions for handling missing values: anyna() na.omit() complete.cases() na.fail() na.exclude() na.pass() 20

Checking for missing values A common operation is to check for the presence of missing values in a given object: x <- c(1, 2, 3, NA, 5) any(is.na(x)) ## [1] TRUE # alternatively anyna(x) ## [1] TRUE 21

Checking for missing values Another common operation is to calculate the number of missing values: y <- c(1, 2, 3, NA, 5, NA) # how many NA's sum(is.na(y)) ## [1] 2 22

Excluding missing values Sometimes we want to remove missing values from a vector or factor: x <- c(1, 2, 3, NA, 5, NA) # excluding NA's x[!is.na(x)] ## [1] 1 2 3 5 23

Excluding missing values Another way to remove missing values from a vector or factor is with na.omit() x <- c(1, 2, 3, NA, 5, NA) # removing NA's na.omit(x) ## [1] 1 2 3 5 ## attr(,"na.action") ## [1] 4 6 ## attr(,"class") ## [1] "omit" 24

Excluding missing values There s also the na.exclude() function that we can use to remove missing values x <- c(1, 2, 3, NA, 5, NA) # removing NA's na.exclude(x) ## [1] 1 2 3 5 ## attr(,"na.action") ## [1] 4 6 ## attr(,"class") ## [1] "exclude" 25

Excluding rows with missing values Applying na.omit() on matrices or data frames will exclude the rows containing any missing value DF <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA)) DF ## x y ## 1 1 0 ## 2 2 10 ## 3 3 NA # how many NA's na.omit(df) ## x y ## 1 1 0 ## 2 2 10 26

Function complete.cases() Likewise, we can use complete.cases() to get a logical vector with the position of those rows having complete data: DF <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA)) # how many NA's complete.cases(df) ## [1] TRUE TRUE FALSE 27

Function na.fail() na.fail() returns the object if it does not contain any missing values, and signals an error otherwise x <- c(1, 2, 3, NA, 5) na.fail(x) # fails ## Error in na.fail.default(x): missing values in object y <- c(1, 2, 3, 4, 5) na.fail(y) # doesn't fail ## [1] 1 2 3 4 5 28

Handling Missing Values 29

Dealing with missing values What to do with missing values? Correct them (if possible) Deletion Imputation Leave them as is 30

Correcting Correcting NAs Perhaps there is more data now Go back to the original source Look for additional information 31

Deletion Deleting NAs How many NA s (counts, percents)? Can you get rid of them? What type of consequences? How bad is it to delete NA s? 32

Deletion Deleting NAs x[!is.na(x)] na.omit(df) na.exclude(df) Some functions-methods in R delete NA s by default; e.g. lm() 33

Imputation Imputing NAs Try to fill in values Several strategies to fill in values No magic wand technique 34

Imputation Imputing with measure of centrality One option is to filling values with some measure of centrality mean value (quantitative variables) median value (quantitative variables) most common value (qualitative variables) These options require to inspect each variable individually 35

Imputation If a variable has a symmetric distribution, we can use the mean value # mean value mean_x <- mean(x, na.rm = TRUE) # imputation x[is.na(x)] <- mean_x 36

Imputation If a variable has a skewed distribution, we can use the median value # median value median_x <- median(x, na.rm = TRUE) # imputation x[is.na(x)] <- median_x 37

Imputation For a qualitative variable we can use the mode value i.e. most common category (if there is one) # mode g <- factor(c('a', 'a', 'b', 'c', NA, 'a')) mode_g <- g[which.max(table(g))] # imputation g[is.na(g)] <- mode_g 38

Imputation Imputing with correlations Explore correlations between variables and look for high correlations cor(x, y, use = "complete.obs") What is a high correlation? 39

High correlated variables # subset of 'mtcars' df <- mtcars[,c('mpg', 'disp', 'hp', 'wt')] head(df) ## mpg disp hp wt ## Mazda RX4 21.0 160 110 2.620 ## Mazda RX4 Wag 21.0 160 110 2.875 ## Datsun 710 22.8 108 93 2.320 ## Hornet 4 Drive 21.4 258 110 3.215 ## Hornet Sportabout 18.7 360 175 3.440 ## Valiant 18.1 225 105 3.460 # missing values in 'mpg' df$mpg[c(5,20)] <- NA mpg <- df$mpg 40

# scatterplot matrix pairs(df) High correlated variables 100 300 2 3 4 5 mpg 10 20 30 100 300 disp hp 50 150 250 2 3 4 5 wt 10 15 20 25 30 50 150 250 41

High correlated variables # matrix of correlations cor(df, use = 'complete.obs') ## mpg disp hp wt ## mpg 1.0000000-0.8584325-0.7729303-0.8656222 ## disp -0.8584325 1.0000000 0.7825073 0.8909678 ## hp -0.7729303 0.7825073 1.0000000 0.6386074 ## wt -0.8656222 0.8909678 0.6386074 1.0000000 mpg is most correlated with wt 42

Regression analysis with lm() # matrix of correlations regression <- lm(mpg ~ wt, data = df) summary(regression) ## ## Call: ## lm(formula = mpg ~ wt, data = df) ## ## Residuals: ## Min 1Q Median 3Q Max ## -4.3073-2.0725-0.2766 1.5742 7.4297 ## ## Coefficients: ## Estimate Std. Error t value Pr(> t ) ## (Intercept) 36.000 1.860 19.351 < 2e-16 *** ## wt -5.013 0.548-9.148 6.62e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.883 on 28 degrees of freedom ## (2 observations deleted due to missingness) ## Multiple R-squared: 0.7493,Adjusted R-squared: 0.7403 ## F-statistic: 83.69 on 1 and 28 DF, p-value: 6.615e-10 43

Regression analysis with lm() # prediction predict(regression, newdata = df[c(5,20),-1]) ## Hornet Sportabout Toyota Corolla ## 18.75370 26.80021 # compare with true values mtcars$mpg[c(5,20)] ## [1] 18.7 33.9 44

Nearest Neighbors 45

Imputation Imputing with similarities We can calculate distances or similarities between two or more observations 46

Nearest Neighbors Idea V U 47

Nearest Neighbors Idea V U 48

Nearest Neighbors Idea V U 49

Nearest Neighbor Imputation observations near each other in (u, v) space will have similar values (circles, crosses) find the k = 3 nearest points in (u, v) to the missing value let the circles and crosses vote if the neighbors have 2/3 circles or all circles, then assign circle, (cross otherwise) 50

Nearest Neighbors Idea V U 51

Nearest Neighbors Idea V U 52

Nearest Neighbor Imputation Questions How to choose k? How to choose u, v,...? (predicting variables) What type of distance/similarity measure? 53

Nearest Neighbor Imputation Function knn() from package "class" knn(train, test, cl, k = 1, l = 0, use.all = TRUE) train matrix or data frame of training set cases test matrix or data frame of test set cases cl factor of true classifications of training set k number of neighbors 54

Function knn() # subset of 'mtcars' df <- mtcars[,c('mpg', 'disp', 'hp', 'wt')] head(df) ## mpg disp hp wt ## Mazda RX4 21.0 160 110 2.620 ## Mazda RX4 Wag 21.0 160 110 2.875 ## Datsun 710 22.8 108 93 2.320 ## Hornet 4 Drive 21.4 258 110 3.215 ## Hornet Sportabout 18.7 360 175 3.440 ## Valiant 18.1 225 105 3.460 # missing values in 'mpg' df$mpg[c(5,20)] <- NA mpg <- df$mpg 55

Function knn() library(class) df_aux <- df[,-1] df_ok <- df_aux[!is.na(mpg), ] df_na <- df_aux[is.na(mpg), ] # data without mpg # train set # test set # 1 nearest neighbor nn1 <- knn( train = df_ok, test = df_na, cl = mpg[!is.na(mpg)], k = 1) 56

Function knn() # imputed values nn1 ## [1] 19.2 32.4 ## 23 Levels: 10.4 13.3 14.3 14.7 15 15.2 15.5 15.8 16.4 17.3 17. # compared to real values mtcars$mpg[c(5,20)] ## [1] 18.7 33.9 57

# 3 nearest neighbor nn3 <- knn( train = df_ok, test = df_na, cl = mpg[!is.na(mpg)], k = 3) # imputed values nn3 Function knn() ## [1] 19.2 30.4 ## 23 Levels: 10.4 13.3 14.3 14.7 15 15.2 15.5 15.8 16.4 17.3 17. # real values mtcars$mpg[c(5,20)] ## [1] 18.7 33.9 58

R Packages VIM and missmda 59

Vim and missmda package "VIM" by Templ et al package "missmda" by Francois Husson and Julie Josse install.packacges(c("vim", "missmda")) library(vim) ## Error in library(vim): there is no package called VIM library(missmda) ## Error in library(missmda): there is no package called missmda 60

Data ozone Data ozone (in "missmda"): daily measurements of meteorological variables and ozone concentration: data(ozone) ## Warning in data(ozone): data set ozone not found head(ozone, n = 5) ## Error in head(ozone, n = 5): object ozone not found 61

Data ozone Number of missing values in each variable: num_na <- sapply(ozone, function(x) sum(is.na(x))) ## Error in lapply(x = X, FUN = FUN,...): object ozone not found num_na[1:7]; num_na[8:13] ## Error in eval(expr, envir, enclos): object num na not found ## Error in eval(expr, envir, enclos): object num na not found 62

Looking at missing values # aggregation for missing values oz_aggr <- aggr(ozone, prop = TRUE, combined = TRUE, plot = FALSE) ## Error in eval(expr, envir, enclos): could not find function "aggr" # summary res <- summary(oz_aggr) ## Error in summary(oz aggr): error in evaluating the argument object in selecting a method for function summary : Error: object oz aggr not found 63

Looking at missing values # variables sorted by number of missings res$missings[order(res$missings[,2]), ] ## Error in eval(expr, envir, enclos): object res not found 64

Looking at missing values # combinations head(res$combinations, n = 10) ## Error in head(res$combinations, n = 10): object res not found 65

Looking at missing values # combinations tail(res$combinations, n = 10) ## Error in tail(res$combinations, n = 10): object res not found 66

Looking at missing values # visualizations plot(oz_aggr) ## Error in plot(oz aggr): error in evaluating the argument x in selecting a method for function plot : Error: object oz aggr not found 67

Looking at missing values # visualizations matrixplot(ozone, sortby = 2) ## Error in eval(expr, envir, enclos): could not find function "matrixplot" 68

Looking at missing values # visualizations marginplot(ozone[,c('t9', 'maxo3')]) ## Error in eval(expr, envir, enclos): could not find function "marginplot" 69

More info... Is there a pattern of missing values? Is there a mechanism leading to missing values? purely random? probability model for missing values? There are more sophisticated options: Missing Data: Our View of the State of the Art (Schafer & Graham, 2000) Bayesian imputation Multiple imputation etc 70