Towards Reliable Interactive Data Cleaning: A User Survey and Recommendations

Similar documents
Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Microsoft Azure Databricks for data engineering. Building production data pipelines with Apache Spark in the cloud

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

Visual Tools & Methods for Data Cleaning

Value of Data Transformation. Sean Kandel, Co-Founder and CTO of Trifacta

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Scalable Tools - Part I Introduction to Scalable Tools

The Evolution of Big Data Platforms and Data Science

Data Science Training

Ph.D., Electrical Engineering and Computer Science Advisor: Samuel Madden Dissertation: Explaining Data in Visual Analytic Systems

Datameer for Data Preparation:

Database Engineering. Percona Live, Amsterdam, September, 2015

Jingren Zhou. Microsoft Corp.

Starting small to go Big: Building a Living Database

Human-in-the-Loop Challenges for Entity Matching: A Midterm Report

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Modern Data Warehouse The New Approach to Azure BI

DATA CLEANING & DATA MANIPULATION

CS 398 ACC Data Sourcing / Cleaning

Intelligence for the connected world How European First-Movers Manage IoT Analytics Projects Successfully

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Activator Library. Focus on maximizing the value of your data, gain business insights, increase your team s productivity, and achieve success.

Dealing with Data Especially Big Data

Oracle Big Data Science

Value of Data Transformation. Sean Kandel, Co-Founder and CTO of Trifacta

CERTIFICATE IN SOFTWARE DEVELOPMENT LIFE CYCLE IN BIG DATA AND BUSINESS INTELLIGENCE (SDLC-BD & BI)

A Parallel R Framework

Data Engineering for Data Science

A Scalable Architecture for Extracting, Aligning, Linking, and Visualizing Multi-Int Data

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Integrate MATLAB Analytics into Enterprise Applications

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

TPCX-BB (BigBench) Big Data Analytics Benchmark

Introduction to Data Science

Based on Big Data: Hype or Hallelujah? by Elena Baralis

Big Data Infrastructures & Technologies

Luigi Build Data Pipelines of batch jobs. - Pramod Toraskar

Cloud, Big Data & Linear Algebra

Six Core Data Wrangling Activities. An introductory guide to data wrangling with Trifacta

Week 1 Unit 1: Introduction to Data Science

BI ENVIRONMENT PLANNING GUIDE

Big Data Analytics. Description:

Introducing Oracle Machine Learning

Top 20 Data Quality Solutions for Data Science

Processing of big data with Apache Spark

Dell In-Memory Appliance for Cloudera Enterprise

DATA SCIENCE USING SPARK: AN INTRODUCTION

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

Chapter 3. Foundations of Business Intelligence: Databases and Information Management

SOFTWARE LIFE-CYCLE MODELS 2.1

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

Deployment Planning Guide

Putting it all together: Creating a Big Data Analytic Workflow with Spotfire

Accelerate Big Data Insights

Think & Work like a Data Scientist with SQL 2016 & R DR. SUBRAMANI PARAMASIVAM (MANI)

Data Analytics at Logitech Snowflake + Tableau = #Winning

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Data Analytics Job Guarantee Program

BSIT 1 Technology Skills: Apply current technical tools and methodologies to solve problems.

Databricks Delta: Bringing Unprecedented Reliability and Performance to Cloud Data Lakes

An Introduction to Big Data Formats

arxiv: v1 [cs.db] 29 Sep 2017

Unifying Big Data Workloads in Apache Spark

Hadoop is supplemented by an ecosystem of open source projects IBM Corporation. How to Analyze Large Data Sets in Hadoop


An Introduction to Apache Spark

Oracle Big Data Science IOUG Collaborate 16

Citizen Data Scientist is the new Data Analyst

Top 25 Big Data Interview Questions And Answers

About the Tutorial. Audience. Prerequisites. Copyright and Disclaimer. PySpark

Cloud Computing & Visualization

OPERATIONALIZING MACHINE LEARNING USING GPU ACCELERATED, IN-DATABASE ANALYTICS

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012

Virtuoso Infotech Pvt. Ltd.

Rok: Data Management for Data Science

Exam Questions

MATLAB. Senior Application Engineer The MathWorks Korea The MathWorks, Inc. 2

Expert Lecture plan proposal Hadoop& itsapplication

Lambda Architecture for Batch and Stream Processing. October 2018

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

CSE 140 wrapup. Michael Ernst CSE 140 University of Washington

2013 AWS Worldwide Public Sector Summit Washington, D.C.

Analyzing Big Data with Microsoft R

Salary Guide. Technology Guide For further information please contact Ida Renaud, Manager ,

Chapter 6 VIDEO CASES

Embedded Technosolutions

Apache Ignite - Using a Memory Grid for Heterogeneous Computation Frameworks A Use Case Guided Explanation. Chris Herrera Hashmap

/ Cloud Computing. Recitation 15 December 6 th 2016

ALIGNING CYBERSECURITY AND MISSION PLANNING WITH ADVANCED ANALYTICS AND HUMAN INSIGHT

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

The Seven Steps to Implement DataOps

The Evolution of a Data Project

CSE 444: Database Internals. Lecture 23 Spark

Intelligent Services Serving Machine Learning

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:

Computer-based Tracking Protocols: Improving Communication between Databases

Unit 6 - Software Design and Development LESSON 8 QUALITY ASSURANCE

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Transcription:

Towards Reliable Interactive Data Cleaning: A User Survey and Recommendations coax treasure out of messy, unstructured data Sanjay Krishnan, Daniel Haas, Eugene Wu, Michael Franklin HILDA 2016 1 2 204 papers on data cleaning*since 2012 in VLDB, ICDE, SIGMOD (papers mentioning data cleaning in title or abstract, possibly dirty data) 3 4

The tutorial you missed How can statistical techniques improve efficiency or reliability of data cleaning? (Data Cleaning with Statistics) How how can we improve the reliability of statistical analytics with data cleaning? (Data Cleaning For Statistics) The tutorial you missed Data cleaning with statistical techniques ERACER 2010 Guided Data Repair 2011 Corleone 2014 Wisteria 2015 Deep Dive 2014 Katara 2014 Trifacta 2015 Data Tamer 2013. Data cleaning for statistical analysis Sensor Net/Stream+ 2000s Scorpion 2013 SampleClean+ 2014 Unknown Unknowns 2016 5 6 5 6 35 People from 25 organizations [Sean Kandel et. al, VAST, 2012] 7 8

In practice, how is data cleaned before analysis? What are the limitations of existing processes? How can database researchers contribute? 9 10 How do you determine whether the data is sufficiently clean to trust the analysis? I am comfortable explaining when to use regularization for Support Vector Machines How do you determine whether the data is sufficiently clean to trust the analysis? I am comfortable explaining when to use regularization for Support Vector Machines I am comfortable writing a program that reads a large textual log file on HDFS and computes the number of errors from each IP using Apache Spark or Hadoop. Describe your company/organization and your role there. Which of the following tools/interfaces/languages do data scientists at your organization prefer for manipulating data, including extraction, schema transformations, and outlier removal, to make analysis easier or more reliable. Please Rank Which of the following most closely describes your job? Describe your organization s data analysis pipeline, including data acquisition, pre-processing, and applications that use the processed data. If possible, list it as a sequence of steps that each describe the intended goal and the tools used. Are any of these steps, or the downstream applications that use the data, affected by dirty data (i.e., inconsistent, missing, or duplicated records)? If so, please describe how you identify dirty records, repair dirty records, and maintain the processing pipeline. How do you validate the correctness of the processing pipeline? Does your organization employ teams of people or crowdsourcing for any of the steps described above? Has the scale of your dataset ever made it challenging to clean? I analyze my organization's customer data for modeling, forecasting, prediction, or recommendation. Describe your data analysis, ideally as a sequence of steps that each describe the intended goal and the tools you use to achieve the goal. Include descriptions of where the data comes from (including the number and variety of sources), properties of the data (e.g., the format, amount), each preprocessing step, and the final result. Is your analysis affected by dirty data (i.e., inconsistent, missing, or duplicated records)? If so, please provide examples of how the data is dirty, what the cleaned versions look like, and how it affects the final result. Describe your data cleaning process including how you identify errors, steps to mitigate errors in the future, and how you validate data cleaning with your analysis. I am comfortable writing a program that reads a large textual log file on HDFS and computes the number of errors from each IP using Apache Spark or Hadoop. Describe your company/organization and your role there. Which of the following tools/interfaces/languages do data scientists at your organization prefer for manipulating data, including extraction, schema transformations, and outlier removal, to make analysis easier or more reliable. Please Rank Which of the following most closely describes your job? Describe your organization s data analysis pipeline, including data acquisition, pre-processing, and applications that use the processed data. If possible, list it as a sequence of steps that each describe the intended goal and the tools used. Are any of these steps, or the downstream applications that use the data, affected by dirty data (i.e., inconsistent, missing, or duplicated records)? If so, please describe how you identify dirty records, repair dirty records, and maintain the processing pipeline. How do you validate the correctness of the processing pipeline? Does your organization employ teams of people or crowdsourcing for any of the steps described above? Has the scale of your dataset ever made it challenging to clean? I analyze my organization's customer data for modeling, forecasting, prediction, or recommendation. Describe your data analysis, ideally as a sequence of steps that each describe the intended goal and the tools you use to achieve the goal. Include descriptions of where the data comes from (including the number and variety of sources), properties of the data (e.g., the format, amount), each preprocessing step, and the final result. Is your analysis affected by dirty data (i.e., inconsistent, missing, or duplicated records)? If so, please provide examples of how the data is dirty, what the cleaned versions look like, and how it affects the final result. Describe your data cleaning process including how you identify errors, steps to mitigate errors in the future, and how you validate data cleaning with your analysis. 11 12

Our Survey: Participants Our Survey: Tools Initial results from N = 29 Largely Technology Sector 10 8 8 10 Organization Size # Job Desc. # 5 We re not in Kansas any more! Small 7 Large 17 N/A 5 Infrastructure 10 Analysis 12 Both 7 3 2 0 MapReduce Python / Perl Shell Spreadsheet / GUI 1 1 SQL / Declarative 13 14 The end-goal of data cleaning is clean data We typically clean our data until the desired analytics works without error. 15 Data cleaning is a sequential operation [It s an] iterative process, where I assess biggest problem, devise a fix, re-evaluate. It is dirty work. 16

Data cleaning is performed by one person There are often long back and forths with senior data scientists, devs, and the business units that provided the data on data quality. 17 Data quality is easy to evaluate Other than common sense we do not have a procedure to do this. 19 Data quality is easy to evaluate I wish there were a more rigorous way to do this but we look at the models and guess if the data are correct. 18 Data quality is easy to evaluate Usually [a data error] is only caught weeks later after someone notices. 20

1. A New Architecture How can database researchers contribute? Current Future Directions Data I.T. Infrastructure Analyst/Data Scientist slow feedback loop Extract Dedup Rm Outliers Analysis Inspect Interactive inspect, change, auto-tune loop I.T. & Analyst & Domain Expert Current systems focus 21 22 2. New Challenges Data cleaning evaluation Debugging workflows Usable collaborative interactions 23 Evaluation: Metrics Goal: Quantitative understanding of how well cleaning has worked Techniques: gold standard data, benchmark datasets, your idea here? Feedback: design systems that use data quality evaluations to optimize the pipeline [Patricia Arocena et. al, VLDB 2016] 24

Evaluation: Overfitting and Confirmation Bias Final thoughts: People are Everywhere UC#BERKELEY Data&Consumers Data&Generators Predictions Data E- mission Data&Scientists Data&Processors KeystoneML Icons&created&by&Clara&Joy&from&Noun& & 7 25 26 Thank you! Now: Questions at the panel Later: Check out our poster Whenever: {sanjay, dhaas, franklin}@cs.berkeley.edu, ewu@cs.columbia.edu 27