Machine Learning in the Wild. Dealing with Messy Data. Rajmonda S. Caceres. SDS 293 Smith College October 30, 2017

Similar documents
Lecture 26: Missing data

Missing Data Analysis for the Employee Dataset

Handling Data with Three Types of Missing Values:

A Monotonic Sequence and Subsequence Approach in Missing Data Statistical Analysis

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Anomaly Detection on Data Streams with High Dimensional Data Environment

International Journal of Research in Advent Technology, Vol.7, No.3, March 2019 E-ISSN: Available online at

Chapter 5: Outlier Detection

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Intro to Artificial Intelligence

Missing Data and Imputation

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures

Lecture 7: Linear Regression (continued)

Clustering algorithms and autoencoders for anomaly detection

Basis Functions. Volker Tresp Summer 2017

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Lecture 25: Review I

Day 3 Lecture 1. Unsupervised Learning

Processing Missing Values with Self-Organized Maps

Missing Data Missing Data Methods in ML Multiple Imputation

Machine Learning Techniques for Data Mining

Slides for Data Mining by I. H. Witten and E. Frank

Chapter 9: Outlier Analysis

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Introduction to machine learning, pattern recognition and statistical data modelling Coryn Bailer-Jones

K-Nearest Neighbour Classifier. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Large Scale Data Analysis Using Deep Learning

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Distribution-free Predictive Approaches

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

Simulation of Imputation Effects Under Different Assumptions. Danny Rithy

Mean-shift outlier detection

Missing Data: What Are You Missing?

Data preprocessing Functional Programming and Intelligent Algorithms

Missing data analysis. University College London, 2015

Machine Learning Lecture 3

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Missing Data Analysis for the Employee Dataset

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

OUTLIER MINING IN HIGH DIMENSIONAL DATASETS

What is machine learning?

The exam is closed book, closed notes except your one-page cheat sheet.

Linear Regression and K-Nearest Neighbors 3/28/18

Machine Learning and Pervasive Computing

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

On Classification: An Empirical Study of Existing Algorithms Based on Two Kaggle Competitions

3. Data Preprocessing. 3.1 Introduction

CSC411/2515 Tutorial: K-NN and Decision Tree

Last time... Bias-Variance decomposition. This week

2. Data Preprocessing

Machine Learning Lecture 3

Section 4 Matching Estimator

Network Traffic Measurements and Analysis

Nearest Neighbor Classification. Machine Learning Fall 2017

Features: representation, normalization, selection. Chapter e-9

Active Appearance Models

Data: a collection of numbers or facts that require further processing before they are meaningful

Statistical Analysis Using Combined Data Sources: Discussion JPSM Distinguished Lecture University of Maryland

Applying Supervised Learning

Data Preprocessing. Slides by: Shree Jaswal

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

Challenges motivating deep learning. Sargur N. Srihari

Supervised vs unsupervised clustering

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Data Mining. Lecture 03: Nearest Neighbor Learning

REMOVAL OF REDUNDANT AND IRRELEVANT DATA FROM TRAINING DATASETS USING SPEEDY FEATURE SELECTION METHOD

Topics in Machine Learning

Semi-supervised Learning

Using Machine Learning to Optimize Storage Systems

PSY 9556B (Jan8) Design Issues and Missing Data Continued Examples of Simulations for Projects

Adaptive Metric Nearest Neighbor Classification

INF 4300 Classification III Anne Solberg The agenda today:

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract

Lecture 9: Support Vector Machines

DATA MINING II - 1DL460

Assignment 4 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

Filtered Clustering Based on Local Outlier Factor in Data Mining

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Bootstrap and multiple imputation under missing data in AR(1) models

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

CHAPTER 4: CLUSTER ANALYSIS

Ronald H. Heck 1 EDEP 606 (F2015): Multivariate Methods rev. November 16, 2015 The University of Hawai i at Mānoa

Machine Learning. Supervised Learning. Manfred Huber

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

CSC 411: Lecture 05: Nearest Neighbors

Machine Learning Lecture 3

OUTLIER DATA MINING WITH IMPERFECT DATA LABELS

CS570: Introduction to Data Mining

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

Missing Data. SPIDA 2012 Part 6 Mixed Models with R:

Transcription:

Machine Learning in the Wild Dealing with Messy Data Rajmonda S. Caceres SDS 293 Smith College October 30, 2017

Analytical Chain: From Data to Actions Data Collection Data Cleaning/ Preparation Analysis Visual Representation Decision Making What this lecture is about? What are some of data quality issues in real applications? Why should we care about data quality? How can we leverage statistical techniques to improve the quality of real data?

Data Quality Issues In Real Applications Missing data Outliers Noisy data Data mismatch Curse of dimensionality Data quality greatly effects analysis and insights we draw from data Introduces bias Causes information loss

Effects of Missing Data On Regression Which regression model should we pick? Sufficient data points to perform regression Insufficient data points to perform regression Figures are from Chris Bishop s Book Pattern Recognition and Machine Learning

Effects of Outliers & Missing Data On PCA Case 1 Case 2 Poor quality data greatly affects dimensionality reduction methods

Missing Data Sample X1 X2 X3 X4 Y Feature Level 1 2 3?? Observed subpopulation Sample Level 4 5?????????? Unobserved subpopulation How much missing data is too much? Rough guideline: If less < 5-10%, likely inconsequential, if greater need to impute

Mechanisms of Missing Data (Rubin 1976) Missing Completely At Random(MCAR): missing data does not depend on observed and unobserved data P(missing complete data) = P(missing) Missing At Random (MAR): missing data depends only on observed data P(missing complete data) = P(missing observed data) Missing Not At Random (MNAR): P(missing complete data) P (missing observed data) Harder to address

Mean Imputation 1. Assume the distribution of missing data is the same as observed data Replace with mean, median or other point estimates Advantages: Works well if MCAR Convenient, easy to implement Disadvantages: Introduces bias by smoothing out the variance Changes the magnitude of correlations between variables Imputed Data Point Figure taken from: https://www.iriseekhout.com/missing-data/missing-data-methods/imputation-methods/

Regression Imputation Better imputation - leverage attribute relationships Monotone missing patterns Replace missing values with predicted scores from regression equation ŷ i =âx i + ˆb Stochastic regression: add an error term Advantage: Use information from observed data Disadvantage: Overestimate model fit, weakens variance Imputed Data Point Figure taken from: https://www.iriseekhout.com/missing-data/missing-data-methods/imputation-methods/

k-nearest Neighbor (knn) Imputation Leverage similarities among different samples in the data Advantages: Doesn t require a model to predict the missing values Simple to implement Can capture the variation in data due to its locality Disadvantages: Sensitive to how we define what similar means? knn Imputation, k=3

Multiple Imputation Imputed Data Analysis Results Set 1 R 1 Incomplete Data Set 2... R 2... Pooled Results Set m R m Treat the missing value as a random variable and impute it m times Minimize the bias introduced by data imputation through averaging

Outliers Outlier: Observation which deviates so much from other observations as to arouse suspicion it was generated by a different mechanism, Hawkins(1980) Outliers vs noise Outlier detection vs. novelty detection

Outlier Detection Techniques Make assumptions about normal data and outliers Statistical Proximity-based Clustering-based methods Have access to labels of outlier examples Supervised Semi-supervised Unsupervised

Types of Outliers Universal vs contextual O1, O2 local outliers (relative to cluster C1) 03 global outlier Singular vs collective

Statistical Methods Assume ``normal observations follow some statistical model Example: Assume the normal points come from a Gaussian distribution. Learn the parameters of the model from the data Data points not following the model are considered outliers Example 1: Throw away points that fall at the tails Example 2: Throw away low probability points Disadvantage: Method is depended on model assumption 15

Classification-Based Method If labels of outlier data points exist we could treat the problem as a classification problem Train a classifier to separate the two classes, normal and outliers class Usually heavily biased toward the normal class: Unbalanced classification problem Cannot detect unseen outliers Often unrealistic to assume we have labels for outlier points One class classification: learn the boundary for the normal class. Points outside the boundary are considered outliers Can detect new outliers

Proximity-Based Methods If the near-by points are far away, consider the data point an outlier No assumption on labels or models of normal distribution No free lunch though: we rely on the robustness of the proximity measure Distance-based methods: An observation is an outlier if its neighborhood does not have enough other observations Density-based methods: An observation is an outlier if its density is relatively much lower than that of its neighbors

Density-based Outlier Detection Local Outlier Factor (LOF) Algorithm (Breuning2000) For each point, compute the k nearest neighbors N(j) Compute the point density f(i) = P Compute the local outlier score: k j2n(j) d(i,j) LOF (i) = 1/k P j2n(j) f(j) f(i) Figure taken from: https://commons.wikimedia.org/wiki/file:lof-idea.svg As the name suggests, LOF is robust in detecting local outliers

Main Takeaways Explore and understand as much as possible the quality of your data Identify appropriate techniques that can mitigate some of the issues of data quality Some of the same techniques you are learning in this course can be leveraged to improve the quality of the training data Know when you need more data Understand biases introduced by data imputation techniques Approach data science as an iterative process - all the components are connected Your analysis is only as good as your data

References Rubin, D. B. (1976). Inference and missing data. Biometrika 63(3): 581-592. Rubin, D. B. (1987). Multiple Imputation for Nonresponse in Surveys. New York, J. Wiley & Sons. Little, R. J. and D. B. Rubin (2002). Statistical Analysis with Missing Data. Hoboken, NJ, John Wiley & Sons. Breunig, M. M., Kriegel, H., Ng, R. T., Sander, J. (2000). LOF: Identifying Density- Based Local Outliers