Data Cleansing. LIU Jingyuan, Vislab WANG Yilei, Theoretical group

Similar documents
Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Why Data Preprocessing? MIT-652 Data Mining Applications. Chapter 3: Data Preprocessing. Multi-Dimensional Measure of Data Quality

Accounting Information Systems, 2e (Kay/Ovlia) Chapter 2 Accounting Databases. Objective 1

HoloClean: Holistic Data Repairs with Probabilistic Inference

UNIT 2 Data Preprocessing

Summary of Last Chapter. Course Content. Chapter 3 Objectives. Chapter 3: Data Preprocessing. Dr. Osmar R. Zaïane. University of Alberta 4

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Data preprocessing. Hamid Beigy. Sharif University of Technology. Fall 1394

arxiv: v2 [cs.db] 30 Dec 2017

Business Impacts of Poor Data Quality: Building the Business Case

Chapter 3. Foundations of Business Intelligence: Databases and Information Management

A Brief Survey on Issues & Approaches of Data Cleaning

Outline. Data Integration. Entity Matching/Identification. Duplicate Detection. More Resources. Duplicates Detection in Database Integration

Semantic Web Technologies. Topic: Data Cleaning

Data Preprocessing. Data Mining 1

COMPUSOFT, An international journal of advanced computer technology, 4 (6), June-2015 (Volume-IV, Issue-VI)

Normalization and denormalization Missing values Outliers detection and removing Noisy Data Variants of Attributes Meta Data Data Transformation

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

By Mahesh R. Sanghavi Associate professor, SNJB s KBJ CoE, Chandwad

Address Standardization using Supervised Machine Learning

Data Preprocessing. S1 Teknik Informatika Fakultas Teknologi Informasi Universitas Kristen Maranatha

An Approach to Intensional Query Answering at Multiple Abstraction Levels Using Data Mining Approaches

My Journey with DataFlux - Garry D Lima Business Solutions Administrator December 13, 2013

Entity Relationship Diagram (ERD) Dr. Moustafa Elazhary

Copyright 2010 Randy Siran Gu

Management Information Systems

11/04/16. Data Profiling. Helena Galhardas DEI/IST. References

CS 521 Data Mining Techniques Instructor: Abdullah Mueen

CSE 880:Database Systems. ER Model and Relation Schemas

Management Information Systems MANAGING THE DIGITAL FIRM, 12 TH EDITION FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT

2. Data Preprocessing

Chapter 6 VIDEO CASES

3. Data Preprocessing. 3.1 Introduction

IBM InfoSphere Information Server Version 8 Release 7. Reporting Guide SC

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Robust Discovery of Positive and Negative Rules in Knowledge-Bases

Data about data is database Select correct option: True False Partially True None of the Above

Leveraging Decision Making in Cyber Security Analysis through Data Cleaning

Where Does Dirty Data Originate?

Comprehensive and Progressive Duplicate Entities Detection

Database Technology Introduction. Heiko Paulheim

Scalable and Holistic Qualitative Data Cleaning

Part 5. Verification and Validation

Unit I. By Prof.Sushila Aghav MIT

Research of Data Cleaning Methods Based on Dependency Rules

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Keywords Data alignment, Data annotation, Web database, Search Result Record

Relational model continued. Understanding how to use the relational model. Summary of board example: with Copies as weak entity

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

AN EFFICIENT OODB MODEL FOR ENSURING THE INTEGRITY OF USER-DEFINED CONSTRAINTS

Data Preprocessing in Python. Prof.Sushila Aghav

CONSTRAINTS AND UPDATES CHAPTER 3 (6/E) CHAPTER 5 (5/E)

Tabular Data Cleaning and Linked Data Generation with Grafterizer. Dina Sukhobok Master s Thesis Spring 2016

Data 100. Lecture 5: Data Cleaning & Exploratory Data Analysis

A GML SCHEMA MAPPING APPROACH TO OVERCOME SEMANTIC HETEROGENEITY IN GIS

DATA CLEANING & DATA MANIPULATION

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4

The Data Organization

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

The Relational Model

(Big Data Integration) : :

Otmane Azeroual abc a

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Data Mining. 2.4 Data Integration. Fall Instructor: Dr. Masoud Yaghini. Data Integration

Conceptual Design. The Entity-Relationship (ER) Model

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

KDI EER: The Extended ER Model

Modeling Databases Using UML

Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Foundations of Business Intelligence: Databases and Information Management

Data Quality Blueprint for Pentaho: Better Data Leads to Better Results. Charles Gaddy Director Global Sales & Alliances, Melissa Data

Management Information Systems Review Questions. Chapter 6 Foundations of Business Intelligence: Databases and Information Management

A Survey on Data Preprocessing Techniques for Bioinformatics and Web Usage Mining

Target and source schemas may contain integrity constraints. source schema(s) assertions relating elements of the global schema to elements of the

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

Semantic Errors in Database Queries

Data Strategies for Efficiency and Growth

Deep Web Content Mining

Enterprise Data Catalog for Microsoft Azure Tutorial


Data Mining. Introduction. Hamid Beigy. Sharif University of Technology. Fall 1394

DATA CLEANING ON GRAPH DATABASES USING NEO4J: SPELLING CORRECTION USING ONTOLOGY AND VISUALIZATION. Bachelor of Electronics & Communications

Dependency Networks for Relational Data

Normalization in DBMS

Data 100 Lecture 5: Data Cleaning & Exploratory Data Analysis

Effective Risk Data Aggregation & Risk Reporting

Relational Databases and Web Integration. Week 7

2004 John Mylopoulos. The Entity-Relationship Model John Mylopoulos. The Entity-Relationship Model John Mylopoulos

Software Engineering 2 A practical course in software engineering. Ekkart Kindler

A Framework for Securing Databases from Intrusion Threats

Data Analyst Nanodegree Syllabus

CTL.SC4x Technology and Systems

Relational Data Model

XV. The Entity-Relationship Model

Database Design and Administration for OnBase WorkView Solutions. Mike Martel Senior Project Manager

Data Quality in the MDM Ecosystem

DATABASE SCHEMA DESIGN ENTITY-RELATIONSHIP MODEL. CS121: Relational Databases Fall 2017 Lecture 14

Transcription:

Data Cleansing LIU Jingyuan, Vislab WANG Yilei, Theoretical group

What is Data Cleansing Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or inconsistencies from a record set, table, or database. Name Age Gender Salary Peter 23 M 16,330 HKD Tom 34M 20,000 HKD Sue 21 F 2,548 USD Data Cleansing Name Age Gender Salary Peter 23 M 16,330 HKD Tom 34 M 20,000 HKD Sue 21 F 20,000 HKD

Why we need Data Cleansing Error universally exists in real-world data. erroneous measurements, lazy input habits, omissions, etc. Error data leads to false conclusions and misdirected investments. to keep track of employees, customers, or the sales volume Error data leads to unnecessary costs and probably loss of reputation. invalid mailing addresses, inaccurate buying habits and preferences

Data Anomalies Use the term anomalies to represent the errors to be detected or corrected. Classification of Data Anomalies: Syntactical Anomalies describe characteristics concerning the format and values used for representation of the entities. Semantic Anomalies hinder the data collection from being a comprehensive and non-redundant representation to the mini-world. Coverage Anomalies decrease the amount of entities and entity properties from the mini-world that are represented in the data collection.

Syntactical Anomalies Lexical errors name discrepancies between the structure of data items and the specified format. The degree of the tuple #t is different from #R, the degree of the relation schema for the tuple. Name Age Gender Size Peter 23 M 7 1 Tom 34 M Sue 21 5 8 Data table with lexical errors

Syntactical Anomalies Lexical errors Domain format errors specify errors where the given value for an attribute does not conform with the anticipated format. Required format of name FirstName, LastName Name Age Gender Rachel, Green 24 F Monica, Geller 24 F Ross Geller 26 M Data table with domain format errors

Syntactical Anomalies Lexical errors Domain format errors Irregularities are concerned with the non-uniform use of values, units and abbreviations. Name Age Gender Salary Peter 23 M 16,330 HKD Tom 34 M 20,000 HKD Sue 21 F 2,548 USD Data table with irregularities

Semantic Anomalies Integrity constraint violations describe tuples that do not satisfy some integrity constraints, which are used to describe our understanding of the mini-world by restricting the set of valid instances (e.g. AGE 0). Contradictions are values between tuples that violate some kind of dependency between the values (e.g. the contradiction between AGE and DATE_OF_BIRTH).

Semantic Anomalies Integrity constraint violations Contradictions Duplicates are two or more tuples representing the same entity from the mini-world. The values of these tuples can be different, which may also be specific cases of contradiction. Invalid tuples represent tuples that do not display anomalies of the classes defined above but still do not represent valid entries from the mini-world.

Coverage Anomalies Missing values or tuples. Tom s salary is missing. Sue s information is missing, who is the employee of this company. Name Age Gender Salary Peter 23 M 16,330 HKD Tom 34 M NULL

Data Anomalies Syntactical Anomalies Semantic Anomalies Coverage Anomalies Lexical errors Integrity constraint violations Contradictions Missing values Domain format errors Duplicates Invalid tuples Missing tuples Irregularities

Data Quality Data quality is defined as an aggregated value over a set of quality criteria. With data quality, we can Decide whether we need to do data cleansing on a data collection Assess and compare the performances of different data cleansing methods

Data Quality Hierarchy of data quality criteria:

Completeness Validity Schema conform Uniformity Density Uniqueness Lexical error Domain format error Irregularities Constraint Violation Missing Value Missing Tuple Duplicates Invalid Tuple Data anomalies affecting data quality criteria

Process of Data Cleansing

Process of Data Cleansing 1. Data Auditing is the step to find the types of anomalies contained within data. 2. Workflow Specification is the step to decide the data cleansing workflow, which is a sequence of operations on the data, in order to detect and eliminate anomalies. 3. Workflow Execution is the step to execute the workflow after specification and verification of its correctness. 4. Post-Processing and Controlling is the step to inspect the results again to find the tuples that are still not correct, which should be corrected manually.

Data Cleansing Methods 1. Anomaly Detection: a) rule-based detection b) pattern enforcement detection c) duplicate detection 2. Error Correction in terms of signals: a) integrity constraints b) external information c) quantitative statistics

1. Anomaly Detection (a) Rule-based detection specify a collection of rules that clean data will obey. Rules are represented as multi-attribute functional dependencies(fds) or userdefined functions. Mistake Illegal values Misspellings Missing values Duplicates Heuristic Value should not be outside of permissible range (min,max) Sorting on values often brings misspelled values next to correct values Presence of default value may indicate real value is missing Sorting values by number of occurrences and more than 1 occurrence indicates duplicates Ref: Rahm E, Do H H. Data cleaning: Problems and current approaches[j]. IEEE Data Eng. Bull., 2000, 23(4): 3-13.

1. Anomaly Detection (b) Pattern enforcement utilizes syntactic or semantic patterns in data, and detect cells that do not conform with the patterns. This is the focus of data mining models including clustering, summarization, association discovery and sequence discovery. e.g. relationships holding between several attributes A i1 = a i1 A i2 = a i2 A j1 = a j1 A j2 = a j2 Ref: Abedjan Z, Chu X, Deng D, et al. Detecting Data Errors: Where are we and what needs to be done?[j]. Proceedings of the VLDB Endowment, 2016, 9(12): 993-1004.

1. Anomaly Detection (c)duplicate detection identifies multiple records for the same entity. Meanwhile, conflicting values for the same attribute can be found, indicating possible errors. Duplicate representations might differ slightly in their values, thus well-chosen similarity measures improve the effectiveness of duplicate detection. Algorithms are developed to perform on very large volumes of data in search for duplicates. Ref: Naumann F, Herschel M. An introduction to duplicate detection[j]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.

1. Anomaly Detection Duplicate detection Similarity measure1: Jaccard Coefficient: compare two sets P and Q Jaccard P, Q = P Q P Q tokenize Thomas Sean Connery = Thomas, Sean, Connery tokenize Sir Sean Connery = {Sir, Sean, Connery} Jaccard Thomas Sean Connery, Sir Sean Connery = 2 4 Similarity measure2: Edit distance Ref: Naumann F, Herschel M. An introduction to duplicate detection[j]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.

1. Anomaly Detection Duplicate detection examples: Detection algorithm: To avoid the cost of pair-wise comparisons, sorted-neighborhood method first assigns a sorting key to each record and sort all records according to that key. Then all pairs of records that appear in the same window are compared. Ref: Naumann F, Herschel M. An introduction to duplicate detection[j]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.

2. Error Correction (a) Integrity Constraints Functional Dependencies:

2. Error Correction (a) Integrity Constraints Tuple t 4 : Modify name to Alice Smith and street to 17 bridge Ref: Bohannon P, Fan W, Flaster M, et al. A cost-based model and effective heuristic for repairing constraints by value modification[c]// Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, 2005: 143-154.

2. Error Correction (a) Integrity Constraints Cost-based model is to find another database that is consistent and minimally differs from the original database. Assign a weight to each tuple, the cost of a modification is the weight times the distance according to a similarity metric between the original value and the repaired value. cost t = ω(t) A attr(ri) dis(d t, A, D (t, A)) Ref: Bohannon P, Fan W, Flaster M, et al. A cost-based model and effective heuristic for repairing constraints by value modification[c]// Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, 2005: 143-154.

2. Error Correction (a) Integrity Constraints Cost-based model is to find another database that is consistent and minimally differs from the original database.

2. Error Correction (a) Integrity Constraints Denial Constraints is a more expressive first order logic than integrity constraints in that they involve order predicates (>,<) and compares different attributes in the same predicate.

2. Error Correction (a) Integrity Constraints A denial constraint expresses that a set of predicates cannot be true together for any combination of tuples in a relation. t α, t β, R: (p 1 p m ) e.g. There cannot exist two persons who live in the same zip code and one person has a lower salary and higher tax rate: t α, t β R, (t α. ZIP = t β. ZIP t α. SAL < t β. SAL t α. TR > t β. TR) Ref: Chu X, Ilyas I F, Papotti P. Discovering denial constraints[j]. Proceedings of the VLDB Endowment, 2013, 6(13): 1498-1509.

2. Error Correction (b) External Information External information include dictionaries, knowledge bases and annotations by experts. It is used for correcting data entry errors and correct them automatically. For example, identifying and correcting misspellings based on dictionary lookup, dictionaries on geographic names and zip codes help to correct address data. Attribute dependencies (birthday-age, total price-unit price/quantity, city-phone area code ) can be used to detect wrong values and substitute missing values.

2. Error Correction (c) Quantitative Statistics Relational dependency network (RDN) captures attribute dependencies with graphical models to propagate inferences throughout the database. Compared with conventional conditional models, this model deals with statistically datasets with dependent instances. When relational data exhibit autocorrelation, inferences about one object can inform inferences about related objects. Compared with other probabilistic relational models, this model can deal with cyclic autocorrelation dependencies. Ref: Neville J, Jensen D. Relational dependency networks[j]. Journal of Machine Learning Research, 2007, 8(Mar): 653-692.

2. Error Correction (c) Quantitative Statistics There are three graphs associated with relational data: Data graph: Each node has a number of associated attributes. A probabilistic relational model represents a joint distribution over the values of the attributes in the data graph.

2. Error Correction (c) Quantitative Statistics There are three graphs associated with relational data: Model graph: Represent the dependencies among attributes. Attributes of an item can depend probabilistically on other attributes of the same item, as well as on attributes of other related objects.

2. Error Correction (c) Quantitative Statistics There are three graphs associated with relational data: Inference graph: During inference, an inference graph is instantiated to represent the probabilistic dependencies among all the variables in a test set.

2. Error Correction (c) Quantitative Statistics Learning a RDN: Maximum a pseudolikelihood 1. learn the dependency structure among the attributes of each object type; 2. estimating the parameters of the local probability models for an attribute given its parents. if p x i X x i = αx j + βx k then PA i = {x j, x k }

2. Error Correction (c) Quantitative Statistics Inference: Gibbs sampling 1. Create the inference graph, where the values of all unobserved variables are initialized to values drawn from prior distributions; 2. Given the current state of the rest of the graph, Gibbs sampling iteratively relabels each unobserved variable by drawing from its local conditional distribution; 3. Finally, the values will be drawn from a stationary distribution and we can use the samples to estimate probabilities of interest.

Conclusion Data cleansing is the process of detecting and correcting errors and inconsistencies. The process of data cleansing is a sequence of operations intending to enhance to overall data quality of a data collection. There have been many methods of data cleansing, which aim at error detection and error correction in different steps of data cleansing.

Thank you!