ActiveClean: Interactive Data Cleaning For Statistical Modeling. Safkat Islam Carolyn Zhang CS 590

Similar documents
CS 179 Lecture 16. Logistic Regression & Parallel SGD

A Brief Look at Optimization

Logistic Regression. Abstract

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

CS5670: Computer Vision

Perceptron: This is convolution!

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Hyperparameter optimization. CS6787 Lecture 6 Fall 2017

Part I: Preliminaries 24

CS381V Experiment Presentation. Chun-Chen Kuo

Ensemble methods in machine learning. Example. Neural networks. Neural networks

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Louis Fourrier Fabien Gaie Thomas Rolf

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Classification: Linear Discriminant Functions

Structured Learning. Jun Zhu

Slides for Data Mining by I. H. Witten and E. Frank

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Deep Reinforcement Learning

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

Network Traffic Measurements and Analysis

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

In stochastic gradient descent implementations, the fixed learning rate η is often replaced by an adaptive learning rate that decreases over time,

Practice Questions for Midterm

Simple Model Selection Cross Validation Regularization Neural Networks

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Deep Learning With Noise

Convex Optimization. Lijun Zhang Modification of

Optimization. Industrial AI Lab.

Homework 2. Due: March 2, 2018 at 7:00PM. p = 1 m. (x i ). i=1

Multi-label classification using rule-based classifier systems

Applying Supervised Learning

For Monday. Read chapter 18, sections Homework:

Machine Learning / Jan 27, 2010

6. Linear Discriminant Functions

CS 223B Computer Vision Problem Set 3

What to come. There will be a few more topics we will cover on supervised learning

Using Machine Learning to Identify Security Issues in Open-Source Libraries. Asankhaya Sharma Yaqin Zhou SourceClear

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Lecture #11: The Perceptron

Conflict Graphs for Parallel Stochastic Gradient Descent

Constrained Convolutional Neural Networks for Weakly Supervised Segmentation. Deepak Pathak, Philipp Krähenbühl and Trevor Darrell

Distributed Machine Learning" on Spark

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

CSC 411: Lecture 02: Linear Regression

Constrained optimization

A study of classification algorithms using Rapidminer

Introduction to Data Science. Introduction to Data Science with Python. Python Basics: Basic Syntax, Data Structures. Python Concepts (Core)

Machine Learning Classifiers and Boosting

Case Study 1: Estimating Click Probabilities

The exam is closed book, closed notes except your one-page cheat sheet.

Detecting Malicious URLs. Justin Ma, Lawrence Saul, Stefan Savage, Geoff Voelker. Presented by Gaspar Modelo-Howard September 29, 2010.

S8873 GBM INFERENCING ON GPU. Shankara Rao Thejaswi Nanditale, Vinay Deshpande

PV211: Introduction to Information Retrieval

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

ECS289: Scalable Machine Learning

Machine Learning. Chao Lan

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

Deep Neural Networks Optimization

Machine Learning Basics. Sargur N. Srihari

Data Mining Lecture 8: Decision Trees

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

5 Machine Learning Abstractions and Numerical Optimization

Link Prediction for Social Network

Unsupervised Learning

Machine Learning: Think Big and Parallel

Pose estimation using a variety of techniques

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

A Recommender System Based on Improvised K- Means Clustering Algorithm

Machine Learning using Matlab. Lecture 3 Logistic regression and regularization

Deep Learning. Practical introduction with Keras JORDI TORRES 27/05/2018. Chapter 3 JORDI TORRES

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Applying Machine Learning to Real Problems: Why is it Difficult? How Research Can Help?

Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

Combine the PA Algorithm with a Proximal Classifier

CS229 Final Project: Predicting Expected Response Times

Machine Learning Basics: Stochastic Gradient Descent. Sargur N. Srihari

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds

Inclusion of Aleatory and Epistemic Uncertainty in Design Optimization

CS281 Section 3: Practical Optimization

Deep Learning and Its Applications

Detecting Network Intrusions

3 Types of Gradient Descent Algorithms for Small & Large Data Sets

Combined Weak Classifiers

Keras: Handwritten Digit Recognition using MNIST Dataset

Facial Expression Classification with Random Filters Feature Extraction

CHAPTER 1 INTRODUCTION

Machine Learning in Python. Rohith Mohan GradQuant Spring 2018

Subspace Clustering with Global Dimension Minimization And Application to Motion Segmentation

Comparison of Optimization Methods for L1-regularized Logistic Regression

PARALLELIZED IMPLEMENTATION OF LOGISTIC REGRESSION USING MPI

Dynamic Resource Allocation for Distributed Dataflows. Lauritz Thamsen Technische Universität Berlin

Transcription:

ActiveClean: Interactive Data Cleaning For Statistical Modeling Safkat Islam Carolyn Zhang CS 590

Outline Biggest Takeaways, Strengths, and Weaknesses Background System Architecture Updating the Model Detecting Dirty Data Selecting Which Records to Clean Experimental Setup and Results Conclusion Demo

Biggest Takeaways ActiveClean is a framework for the iterative cleaning of a dataset for statistical modelling while maintaining convergence for convex loss problems. The authors formulate the iterative cleaning problem as a convex optimization problem, as a result, they can use stochastic gradient descent in ActiveClean. Prioritizing cleaning of data points that have a higher probability of affecting the results of the model can increase predictive accuracy of the model while cleaning the same amount of data.

Strengths Experimental setup Real applications Against four other methods Batch cleaning increases efficiency (in a similar manner as Sample-Clean) Detector works with other commonly used classes of rules Integrity constraints CFDs Matching dependencies Adaptive error detection Weaknesses Initializing the model on the dirty dataset Data could be cleaned of common inconsistencies Approach to sampling More sensitive to outliers in dataset due to approach to SGD Modification of the stochastic gradient descent algorithm Increases the time required for the gradient calculation with each iteration. Nonlinear gradients No isolation of features Use Taylor series approximation instead

Background Iterative data cleaning is: Cleaning subsets of data Evaluating preliminary results Cleaning more data as necessary Assumptions Approaches a class of analytics problems that can be expressed as the minimization of convex loss functions Data cleaning operations are applied record-by-record (alternative: find-and-replace where operations are applied on the full dirty dataset as a set-of-records)

System Architecture Initialization - train model on dirty data Sampler - selects a batch of data from dirty data in a randomized way but can assign probabilities Cleaner - user-specified function Updater - updates the model based on cleaned data using a SGD step Detector (optional) - identifies records that are more likely to be dirty Estimator (optional) - uses previously cleaned data to estimate the effect of cleaning a record on the model

System Architecture (Block Diagram)

Model Update Algorithm

Updating the Model Modification of stochastic gradient descent algorithm Stochastic Gradient Descent: iteratively minimizes the objective function Convex loss function: function measures the accuracy (predictive error) of the statistical model Geometric interpretation: ActiveClean: estimates the gradient by averaging the gradient of a sample of dirty data and the gradient of the cleaned data

Detecting Dirty Data Detector returns: Whether a record is dirty And if it is dirty, which attributes have errors Requirement: there is an algorithm to enumerate the set of records that violate at least one rule Let the set of clean data be the union between the set of clean data and set of records that satisfies all of the rules The set of dirty data is the set of records that violate at least one rule Apply the model updating algorithm

Adaptive Detection Dirty data is categorized into u classes (corruption) using a multi-class classifier These classes may not align with the features but every record is classified with at most one corruption category The repair step cleans the data and reports which of the u classes is associated with this record Algorithm for updating workflow: R clean is previously cleaned data and U clean is a set of labels for each record indicating error class and dirty/not-dirty Train classifier to predict the label (Train) Apply classifier to dirty data (Predict) For all records that are predicted to be clean, remove them from R dirty and append to R clean Apply model updating algorithm

Selecting Which Records to Clean Optimal sampling problem: Want to minimize the variance of the sampled gradient from the gradient of the dirty subset of the dataset Approach: use the detector to estimate the cleaned values Estimator: Estimates the clean gradient using the dirty gradient and the results from the Detector Linear approximation of gradient: uses average change of each feature value Limitation: couples impact of features and model results

Experimental Setup Comparisons to Naive-Mix, Naive-Sampling, Active Learning, and Oracle Metrics: Model error - the distance between the trained model and the true model if all data were cleaned Test error - the prediction accuracy of the model on a held out set of clean data Comparison Axes: The sampling procedure to pick the next set of records to clean The model update procedure to incorporate the cleaned sample Scenarios: Movie prediction from IMDB and Yahoo databases ProPublica s Dollars for Docs (DfD) database for medical donations Simulated ML pipline (AMPLab s Keystone MOL and Google s Tensor Flow) to classify handwritten numbers Simulated scenarios for predicting income bracket of an adult Simulated scenario for predicting onset of a seizure

Experimental Setup (Continued) Simulated Scenarios: Goal - understand which types of data corruption are amenable to data cleaning versus robust statistical techniques Four schemes: Full data cleaning Baseline of no cleaning Discarding dirty data Robust logistic regression Corrupted (both randomly and systematically) 5% of training examples in each dataset Results: Robust method works well on random corruption but falters on systematic corruption Discarding dirty data also works well for random corruption

Experimental Results Movies - ActiveClean has superior convergence Update algorithm correctly incorporates the raw data with the cleaned samples leaded to smaller sensitivity to sampling error Selects records that are more likely to be dirty and that will mostly improve the model DfD - ActiveClean can effectively identify and exploit bias of systematic errors to quickly converge to the true clean model Naive-Mix (most commonly used approach) was almost completely ineffective ML Pipeline - ActiveClean makes more progress towards the clean model with a smaller number of examples cleaned

Experimental Results (IMDB)

Experimental Results (DfD)

Experimental Results (ML Pipeline)

Experimental Results (Continued) Simulated Scenarios: ActiveClean is consistently better than Active Learning because it is a composable framework that supports different instantiation of detection and prioritization modules but still guarantees convergence Naive-Mix is an unreliable methodology that lacks convergences guarantees and has lower efficiency than the other methodologies because it considers the entire dirty data Naive-Sampling does not use the dirty data and its error is governed by the sample size It outperforms ActiveClean when corruptions are severe or when initialization with the dirty model is inaccurate As corruption becomes more random, the classifier becomes increasingly erroneous Break-even point at about 50% Beyond that, classifier is imperfect and misclassifies some data points incorrectly as cleaned

Experimental Results (Simulated Scenarios)

Conclusion Given: dirty dataset, convex model, general data cleaning procedure Problem addressed: cleaning data before user-specified modelling while preserving convergence ActiveClean improves the efficiency of cleaning in comparison to other state of the art methods Cleans fewer data records to achieve a high predictive accuracy ActiveClean is especially useful for the cleaning of systematic corruption Extensions: Broaden scope of cleaning: look at errors in the dirty records that, if fixed, cause errors in the cleaned dataset Relax the requirement for convex models

Demo: http://automation.berkeley.edu/activecleande mo

Thanks! Any questions?