Top 20 Data Quality Solutions for Data Science

Similar documents
From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

Architectural challenges for building a low latency, scalable multi-tenant data warehouse

CHAPTER 3 Implementation of Data warehouse in Data Mining

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

ETL Best Practices and Techniques. Marc Beacom, Managing Partner, Datalere

Oracle Application Express Schema Design Guidelines Presenter: Flavio Casetta, Yocoya.com

Building LinkedIn s Real-time Data Pipeline. Jay Kreps

Shine a Light on Dark Data with Vertica Flex Tables

Building a Data Warehouse: Data Quality is key for BI. Werner Daehn

DATA STEWARDSHIP BODY OF KNOWLEDGE (DSBOK)

Benefits of Automating Data Warehousing

Data Management Glossary

DISQUS. Continuous Deployment Everything. David

Information Management course

Data Warehouses Chapter 12. Class 10: Data Warehouses 1

Software Engineering Fall 2015 (CSC 4350/6350) TR. 5:30 pm 7:15 pm. Rao Casturi 11/10/2015

Time Series Live 2017

Designing dashboards for performance. Reference deck

JSON Home Improvement. Christophe Pettus PostgreSQL Experts, Inc. SCALE 14x, January 2016

4) An organization needs a data store to handle the following data types and access patterns:

Software Engineering Fall 2014

Data Warehousing ETL. Esteban Zimányi Slides by Toon Calders

Seven Interesting Data Warehouse Ideas

Solutions for Netezza Performance Issues

Technical Sheet NITRODB Time-Series Database

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)?

SSQA Seminar Series. Server Side Testing Frameworks. Sachin Bansal Sr. Quality Engineering Manager Adobe Systems Inc. February 13 th, 2007

YeSQL: Battling the NoSQL Hype Cycle with Postgres

Upgrading MySQL Best Practices. Apr 11-14, 2011 MySQL Conference and Expo Santa Clara,CA by Peter Zaitsev, Percona Inc

Next Generation DWH Modeling. An overview of DWH modeling methods

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Data-Intensive Distributed Computing

Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia,

Marketers vs Duplicate Data: How You Can Win

Evolution of Big Data Facebook. Architecture Summit, Shenzhen, August 2012 Ashish Thusoo

Big Data Facebook

Data Analytics at Logitech Snowflake + Tableau = #Winning

Where Does Dirty Data Originate?

Full file at

Overview. DW Performance Optimization. Aggregates. Aggregate Use Example

Putting the open in opensuse: Community-driven KDE Development. Will Stephenson

5 Fundamental Strategies for Building a Data-centered Data Center

Data Warehouse Testing Best practices to improve and sustain Data Quality Getting ready for Serious DevOps

The DBMS accepts requests for data from the application program and instructs the operating system to transfer the appropriate data.

Marketing Automation Functional Evaluation Guide

REGULATORY REPORTING FOR FINANCIAL SERVICES

Data Integrity in Stateful Services. Velocity, China, 2016

Applying Best Practices, QA, and Tips and Tricks to Our Reports

MDS and your Data Warehouse/Data Mart

ETL and OLAP Systems

Building Event Driven Architectures using OpenEdge CDC Richard Banville, Fellow, OpenEdge Development Dan Mitchell, Principal Sales Engineer

In-Memory Data Management Jens Krueger

COMPARISON WHITEPAPER. Snowplow Insights VS SaaS load-your-data warehouse providers. We do data collection right.

Scaling. Marty Weiner Grayskull, Eternia. Yashh Nelapati Gotham City

The COS 333 Project. Robert M. Dondero, Ph.D. Princeton University

Copyright 2016 Datalynx Pty Ltd. All rights reserved. Datalynx Enterprise Data Management Solution Catalogue

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

USE CASE. Collect CLOSED CASE FEEDBACK. Salesforce Workflow. Clicktools Deployment TWO DEPLOYMENT APPROACHES. The basic activity flow goes like this:

SQL, Scaling, and What s Unique About PostgreSQL

Distributed Computation Models

Distributed Filesystem

Dell Boomi Cloud MDM Overview

Tanium Asset User Guide. Version 1.3.1

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

Processing of Very Large Data

NoSQL Databases An efficient way to store and query heterogeneous astronomical data in DACE. Nicolas Buchschacher - University of Geneva - ADASS 2018

Hortonworks DataPlane Service

Big data easily, efficiently, affordably. UniConnect 2.1

End-to-End data mining feature integration, transformation and selection with Datameer Datameer, Inc. All rights reserved.

Analytics and Visualization

Welcome to the IBM IIS Tech Talk

Organising benchmarking LLVM-based compiler: Arm experience

Agile Manifesto & XP. Topics. Rapid software development. Agile methods. Chapter ) What is Agile trying to do?

Building a Data Strategy for a Digital World

GFS Overview. Design goals/priorities Design for big-data workloads Huge files, mostly appends, concurrency, huge bandwidth Design for failures

Software Engineering Testing and Debugging Testing

An Information Asset Hub. How to Effectively Share Your Data

Refactoring Without Ropes

CIMERA ARCHITECTURE. Release 4.2.x

Introducing Kafka Connect. Large-scale streaming data import/export for

MySQL HA Solutions. Keeping it simple, kinda! By: Chris Schneider MySQL Architect Ning.com

GIN in 9.4 and further

Map-Reduce. Marco Mura 2010 March, 31th

Jenkins: A complete solution. From Continuous Integration to Continuous Delivery For HSBC

MCSA SQL SERVER 2012

Stream Processing Platforms Storm, Spark,.. Batch Processing Platforms MapReduce, SparkSQL, BigQuery, Hive, Cypher,...

Distributed Systems. Lec 10: Distributed File Systems GFS. Slide acks: Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Big Data Analytics CSCI 4030

Kafka Connect the Dots

Increase Value from Big Data with Real-Time Data Integration and Streaming Analytics

DecisionCAMP 2016: Solving the last mile in model based development

Distributed File Systems II

COMP091 Operating Systems 1. File Systems

Sepand Gojgini. ColumnStore Index Primer

Data Quality Architecture and Options

How to integrate data into Tableau

The Root Cause of Unstructured Data Problems is Not What You Think

Software Design COSC 4353/6353 D R. R A J S I N G H

6.830 Lecture PS1 Due Next Time (Tuesday!) Lab 1 Out end of week start early!

Tungsten Replicator for Kafka, Elasticsearch, Cassandra

Transcription:

Top 20 Data Quality Solutions for Data Science Data Science & Business Analytics Meetup Boulder, CO 2014-12-03 Ken Farmer

DQ Problems for Data Science Loom Large & Frequently 4000000 Strikingly visible defects 3500000 3000000 Bad Decisions Loss of credibility Widgets 2500000 2000000 1500000 1000000 500000 Loss of revenue 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Months

Lets Talk about Solutions For example... Proportionality is Important

Solution #1: Quality Assurance (QA) If it's not tested it's broken - Bruce Eckel Tests of application and system behavior input assumptions performed after application changes Programmer Testing: Unit Testing Functional Testing QA Team Testing: Black Box Testing Regression Testing System and Integration Testing Data Scientist Testing: All of the above Especially Programmer

Solution #2: Quality Control (QC) Because the environment & inputs are ever-changing 3000000 Track & analyze record counts - identify partial data sets - identify upstream changes 2500000 2000000 1500000 Records 1000000 500000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Track & analyze rule counts - identify upstream changes Track & analyze replication & aggregation consistency - identify process failures

Solution #3: Data Profiling Find out what data looks like BEFORE you model it Save enormous amount of time: - quickly get the type, size, cardinality, unk values - share profiling results with users Example Deliverables: - What is the distribution of data for every field? - How do partitions affect data distributions? - What correlations exist between fields? Name Field Number Wrong Field Cnt Type Min Max Unique Values Known Values Case Min Length Max Length Mean Length Top Values E F G C H I J M K f250_499 59 0 string C M 11 10 upper 1 1 1.0 x x x x x x x x x x 10056 occurrences 1299 occurrences 969 occurrences 358 occurrences 120 occurrences 89 occurrences 36 occurrences 18 occurrences 12 occurrences 11 occurrences

Solution #4: Process Auditing Because with a lot of moving parts comes a lot of failures Addresses: Slow processes Broken processes Helps identify: Duplicate loads Missing files Pipeline status Features: Tracks all processes: start, stop, times, return codes, batch_ids Alerting Example Products: Graphite - focus: resources Nagios - focus: process failure (the inspecting metaphor, not the searching metaphor)

Solution #5: Full Automation Because manual processes account for a majority of problems Solution Characteristics Addresses: Duplicate data Missing data Top Causes manual process restarts manual data recoveries manual process overrides - Test catastrophic failures - Automate failure recoveries - Consider Netflick's Chaos Monkey

Solution #6: Changed Data Capture Because identifying changes is harder than most people realize Typical Alternatives: Application Timestamp: - pro: may already be there - con: reliability challenges Triggers: - pro: simple for downstream - con: reliability challenges - con: requires changes to source File Image Delta Example Source Data Target Data Sort Sort Dedup Dedup File Delta & Transform File-Image Delta: - pro: implementation effort - con: very accurate Same Inserts Deletes Change Change New Old

Solution #7: Master Data Management Because sharing reference data eliminates many issues Industry Addresses: Data consistency between multiple systems States Lat/Long Geocoding GeoIP Features: Centralized storage of reference data Versioning of data Access via multiple protocols Products Customers Master Data Management Sales Regions Products Finance Billing CRM SFA DW

Solution #8: Extrapolate for Missing Data Because if done well it can simplify queries Features: Requires sufficient data to identify pattern Identify generated data (see Quality Indicator & Dimension)

Solution #9: Crowd-sourced Data Cleansing Because cleansing & enrichening data can benefit from consensus Features: Collect consensus answers from workers Workers can be from external market Workers can be from your team Example Product: CrowdFlower Mechanical Turk Simple Data Scenario Complex Data Scenario Correct obvious problems Spelling Grammar Missing descriptions Use public resources Correct sophisticated problems Provide scores Provide descriptions Correct complex data Use internal & external resources Leverage crowdsourcing services for coordination

Solution #10: Keep Original Values Because your transformations will fail & you will change your mind Features: keep archive copy of source data Pro: can be very highly compressed Pro: can be kept off-server Con: cannot be easily queried Or keep with transformed data Pro: can be easily queried Con: may be used when it should not be Con: may double volume of data to host os_orig os Win2k win2000 MS Win win2000 Win 2000 win2000 Windoze 2k win2000 Windows 2000 SP4 win2000 MS Win 2k SP2 win2000 ms win 2000 server ed win2000 win2ksp3 win2000 Win 2k server sp9 win2000

Solution #11: Keep usage logs Because knowing who got data can help you minimize impact Solution Types: Log application queries Pro: database audit tools can do this Con: requires process audit logs to translate to data content Store data that was delivered Pros: can precisely identify who got what when Con: requires dev, only works with certain tools (ex: restful API)

Solution #12: Cleanup at Source Because it's cheaper to clean at the source than downstream Features: Clean & scrub data in-route to target database But also - give clean-up tasks to source system

Solution #13: Static vs Dynamic, Strong vs Weak Type & Structure Because this will be debated endlessly Static vs Dynamic Schemas - Dynamic Examples: MongoDB, JSON in Postgres, etc - Dynamic Schemas optimize for writer at cost of reader Static vs Dynamic Typing - static typing provide fewer defects* - but maybe not better data quality Declarative Constraints - Ex: primary key, foreign key, Uniqueness, and check constraints - Code: ALTER TABLE foo ADD CONSTRAINT ck1 CHECK(open_date <= close_date)

Solution #14: Data Quality Indicator & Dimension Because it allows users to know what is damaged Example Single id that represents status for multiple fields Bitmap example: bitmap 16-bit integer each bit represents a single field bit value of 0 ==, value of 1 == Translate integer to field status with UDF or table quality_id result col1 col2 col3 col4 col5 col6 col7 col8 coln 0000000000000000 0000000000000001 0000000000000010 0000000000000011 0000000000000100 0000000000000101 0000000000000111

Solution #15: Generate Test Data Because manual processes account for a majority of problems Three basic Types: Type Highly-Deterministic Type Highly-Realistic: Contains very consistent data Great for benchmarking cheap to build Produced through a simulation Great for demos Great for sanity-checking analysis Hard to build Type Extreme: Contains extreme values Great for bounds-checking cheap to build

Solution #16: Use the Data! Data unused: Will decay over time Will lose pieces of data Will lose metadata Data in a report: Looks great Can see obvious defects But doesn't drive, doesn't get tested Data driving a business process: Looks great Gets tested & corrected in order to run the business.

Solution #17: Push Transformations Upstream Because you don't want to become source implementation experts MOVE From Data Scientists to ETL Developers: Eliminates inconsistencies between runs Scientist Source DW / Hadoop / etc Scientist Scientist Scientist MOVE From ETL Developers to Source System: Eliminate unnecessary source system knowledge Decouples systems Scientist Source DW / Hadoop / etc Scientist Scientist Scientist

Solution #18: Documentation (Metadata) Field Metadata: Name Description Type Length Unknown value Case Security Extended Metadata: Lineage Data Profile Common values/codes Their meaning Their frequency Validation rules & results Transformation rules Source Code: if gender == 'm': return 'male' else: return 'female' Report & Tool Documentation: Description Filtering Rules Transformation Rules

Solution #19: Fix the Culture (ha) Because you need support for priorities, resources, and time Single Most Important thing to do: Establish policy of transparency = 90% Everything else results from transparency: Establish policy of automation Establish policy of measuring Plus everything we already covered What doesn't work? Ask management to mandate quality

Solution #20: Data Defect Tracking Because you won't remember why a set of data was in 6 months Like Bug-Tracking, but for sets of data Will explain anomalies later Can be used for data annotation Is simple, just needs to be used

Resources & Thanks International Association for Information & Data Quality (IAIDQ) http://www.iqtrainwrecks.com/ Improving Data Warehouse and Business Information Quality, Larry English The Data Warehouse Institute (TDWI) 1-A Large Scale Study of Programming Languages and Code Quality in Github http://macbeth.cs.ucdavis.edu/lang_study.pdf

Solution List Solution ETL DEST/DW Consume 1. QA LOW MEDIUM 2. QC MEDIUM MEDIUM 4. Process Auditing MEDIUM 5. Full Automation 6. Changed Data Capture 3. Data Profiling 7. MDM Source 8. Extrapolate Missing Data 9. Crowdsource Cleansing MEDIUM 10. Keep original values MEDIUM 11. Keep usage logs 12. Cleanup at source 13. Static vs Dynamic MEDIUM LOW 14. DQ Dimension 15. Generate Test Data 18. Documentation 19. Fix the Culture 20. Data Defect Tracking MEDIUM MEDIUM 16. Use the Data 17. Push Transforms Upstream