Data Cleansing. LIU Jingyuan, Vislab WANG Yilei, Theoretical group

Data Cleansing LIU Jingyuan, Vislab WANG Yilei, Theoretical group

What is Data Cleansing Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or inconsistencies from a record set, table, or database. Name Age Gender Salary Peter 23 M 16,330 HKD Tom 34M 20,000 HKD Sue 21 F 2,548 USD Data Cleansing Name Age Gender Salary Peter 23 M 16,330 HKD Tom 34 M 20,000 HKD Sue 21 F 20,000 HKD

Why we need Data Cleansing Error universally exists in real-world data. erroneous measurements, lazy input habits, omissions, etc. Error data leads to false conclusions and misdirected investments. to keep track of employees, customers, or the sales volume Error data leads to unnecessary costs and probably loss of reputation. invalid mailing addresses, inaccurate buying habits and preferences

Data Anomalies Use the term anomalies to represent the errors to be detected or corrected. Classification of Data Anomalies: Syntactical Anomalies describe characteristics concerning the format and values used for representation of the entities. Semantic Anomalies hinder the data collection from being a comprehensive and non-redundant representation to the mini-world. Coverage Anomalies decrease the amount of entities and entity properties from the mini-world that are represented in the data collection.

Syntactical Anomalies Lexical errors name discrepancies between the structure of data items and the specified format. The degree of the tuple #t is different from #R, the degree of the relation schema for the tuple. Name Age Gender Size Peter 23 M 7 1 Tom 34 M Sue 21 5 8 Data table with lexical errors

Syntactical Anomalies Lexical errors Domain format errors specify errors where the given value for an attribute does not conform with the anticipated format. Required format of name FirstName, LastName Name Age Gender Rachel, Green 24 F Monica, Geller 24 F Ross Geller 26 M Data table with domain format errors

Syntactical Anomalies Lexical errors Domain format errors Irregularities are concerned with the non-uniform use of values, units and abbreviations. Name Age Gender Salary Peter 23 M 16,330 HKD Tom 34 M 20,000 HKD Sue 21 F 2,548 USD Data table with irregularities

Semantic Anomalies Integrity constraint violations describe tuples that do not satisfy some integrity constraints, which are used to describe our understanding of the mini-world by restricting the set of valid instances (e.g. AGE 0). Contradictions are values between tuples that violate some kind of dependency between the values (e.g. the contradiction between AGE and DATE_OF_BIRTH).

Semantic Anomalies Integrity constraint violations Contradictions Duplicates are two or more tuples representing the same entity from the mini-world. The values of these tuples can be different, which may also be specific cases of contradiction. Invalid tuples represent tuples that do not display anomalies of the classes defined above but still do not represent valid entries from the mini-world.

Coverage Anomalies Missing values or tuples. Tom s salary is missing. Sue s information is missing, who is the employee of this company. Name Age Gender Salary Peter 23 M 16,330 HKD Tom 34 M NULL

Data Anomalies Syntactical Anomalies Semantic Anomalies Coverage Anomalies Lexical errors Integrity constraint violations Contradictions Missing values Domain format errors Duplicates Invalid tuples Missing tuples Irregularities

Data Quality Data quality is defined as an aggregated value over a set of quality criteria. With data quality, we can Decide whether we need to do data cleansing on a data collection Assess and compare the performances of different data cleansing methods

Data Quality Hierarchy of data quality criteria:

Completeness Validity Schema conform Uniformity Density Uniqueness Lexical error Domain format error Irregularities Constraint Violation Missing Value Missing Tuple Duplicates Invalid Tuple Data anomalies affecting data quality criteria

Process of Data Cleansing

Process of Data Cleansing 1. Data Auditing is the step to find the types of anomalies contained within data. 2. Workflow Specification is the step to decide the data cleansing workflow, which is a sequence of operations on the data, in order to detect and eliminate anomalies. 3. Workflow Execution is the step to execute the workflow after specification and verification of its correctness. 4. Post-Processing and Controlling is the step to inspect the results again to find the tuples that are still not correct, which should be corrected manually.

Data Cleansing Methods 1. Anomaly Detection: a) rule-based detection b) pattern enforcement detection c) duplicate detection 2. Error Correction in terms of signals: a) integrity constraints b) external information c) quantitative statistics

1. Anomaly Detection (a) Rule-based detection specify a collection of rules that clean data will obey. Rules are represented as multi-attribute functional dependencies(fds) or userdefined functions. Mistake Illegal values Misspellings Missing values Duplicates Heuristic Value should not be outside of permissible range (min,max) Sorting on values often brings misspelled values next to correct values Presence of default value may indicate real value is missing Sorting values by number of occurrences and more than 1 occurrence indicates duplicates Ref: Rahm E, Do H H. Data cleaning: Problems and current approaches[j]. IEEE Data Eng. Bull., 2000, 23(4): 3-13.

1. Anomaly Detection (b) Pattern enforcement utilizes syntactic or semantic patterns in data, and detect cells that do not conform with the patterns. This is the focus of data mining models including clustering, summarization, association discovery and sequence discovery. e.g. relationships holding between several attributes A i1 = a i1 A i2 = a i2 A j1 = a j1 A j2 = a j2 Ref: Abedjan Z, Chu X, Deng D, et al. Detecting Data Errors: Where are we and what needs to be done?[j]. Proceedings of the VLDB Endowment, 2016, 9(12): 993-1004.

1. Anomaly Detection (c)duplicate detection identifies multiple records for the same entity. Meanwhile, conflicting values for the same attribute can be found, indicating possible errors. Duplicate representations might differ slightly in their values, thus well-chosen similarity measures improve the effectiveness of duplicate detection. Algorithms are developed to perform on very large volumes of data in search for duplicates. Ref: Naumann F, Herschel M. An introduction to duplicate detection[j]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.

1. Anomaly Detection Duplicate detection Similarity measure1: Jaccard Coefficient: compare two sets P and Q Jaccard P, Q = P Q P Q tokenize Thomas Sean Connery = Thomas, Sean, Connery tokenize Sir Sean Connery = {Sir, Sean, Connery} Jaccard Thomas Sean Connery, Sir Sean Connery = 2 4 Similarity measure2: Edit distance Ref: Naumann F, Herschel M. An introduction to duplicate detection[j]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.

1. Anomaly Detection Duplicate detection examples: Detection algorithm: To avoid the cost of pair-wise comparisons, sorted-neighborhood method first assigns a sorting key to each record and sort all records according to that key. Then all pairs of records that appear in the same window are compared. Ref: Naumann F, Herschel M. An introduction to duplicate detection[j]. Synthesis Lectures on Data Management, 2010, 2(1): 1-87.

2. Error Correction (a) Integrity Constraints Functional Dependencies:

2. Error Correction (a) Integrity Constraints Tuple t 4 : Modify name to Alice Smith and street to 17 bridge Ref: Bohannon P, Fan W, Flaster M, et al. A cost-based model and effective heuristic for repairing constraints by value modification[c]// Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, 2005: 143-154.

2. Error Correction (a) Integrity Constraints Cost-based model is to find another database that is consistent and minimally differs from the original database. Assign a weight to each tuple, the cost of a modification is the weight times the distance according to a similarity metric between the original value and the repaired value. cost t = ω(t) A attr(ri) dis(d t, A, D (t, A)) Ref: Bohannon P, Fan W, Flaster M, et al. A cost-based model and effective heuristic for repairing constraints by value modification[c]// Proceedings of the 2005 ACM SIGMOD international conference on Management of data. ACM, 2005: 143-154.

2. Error Correction (a) Integrity Constraints Cost-based model is to find another database that is consistent and minimally differs from the original database.

2. Error Correction (a) Integrity Constraints Denial Constraints is a more expressive first order logic than integrity constraints in that they involve order predicates (>,<) and compares different attributes in the same predicate.

2. Error Correction (a) Integrity Constraints A denial constraint expresses that a set of predicates cannot be true together for any combination of tuples in a relation. t α, t β, R: (p 1 p m ) e.g. There cannot exist two persons who live in the same zip code and one person has a lower salary and higher tax rate: t α, t β R, (t α. ZIP = t β. ZIP t α. SAL < t β. SAL t α. TR > t β. TR) Ref: Chu X, Ilyas I F, Papotti P. Discovering denial constraints[j]. Proceedings of the VLDB Endowment, 2013, 6(13): 1498-1509.

2. Error Correction (b) External Information External information include dictionaries, knowledge bases and annotations by experts. It is used for correcting data entry errors and correct them automatically. For example, identifying and correcting misspellings based on dictionary lookup, dictionaries on geographic names and zip codes help to correct address data. Attribute dependencies (birthday-age, total price-unit price/quantity, city-phone area code ) can be used to detect wrong values and substitute missing values.

2. Error Correction (c) Quantitative Statistics Relational dependency network (RDN) captures attribute dependencies with graphical models to propagate inferences throughout the database. Compared with conventional conditional models, this model deals with statistically datasets with dependent instances. When relational data exhibit autocorrelation, inferences about one object can inform inferences about related objects. Compared with other probabilistic relational models, this model can deal with cyclic autocorrelation dependencies. Ref: Neville J, Jensen D. Relational dependency networks[j]. Journal of Machine Learning Research, 2007, 8(Mar): 653-692.

2. Error Correction (c) Quantitative Statistics There are three graphs associated with relational data: Data graph: Each node has a number of associated attributes. A probabilistic relational model represents a joint distribution over the values of the attributes in the data graph.

2. Error Correction (c) Quantitative Statistics There are three graphs associated with relational data: Model graph: Represent the dependencies among attributes. Attributes of an item can depend probabilistically on other attributes of the same item, as well as on attributes of other related objects.

2. Error Correction (c) Quantitative Statistics There are three graphs associated with relational data: Inference graph: During inference, an inference graph is instantiated to represent the probabilistic dependencies among all the variables in a test set.

2. Error Correction (c) Quantitative Statistics Learning a RDN: Maximum a pseudolikelihood 1. learn the dependency structure among the attributes of each object type; 2. estimating the parameters of the local probability models for an attribute given its parents. if p x i X x i = αx j + βx k then PA i = {x j, x k }

2. Error Correction (c) Quantitative Statistics Inference: Gibbs sampling 1. Create the inference graph, where the values of all unobserved variables are initialized to values drawn from prior distributions; 2. Given the current state of the rest of the graph, Gibbs sampling iteratively relabels each unobserved variable by drawing from its local conditional distribution; 3. Finally, the values will be drawn from a stationary distribution and we can use the samples to estimate probabilities of interest.

Conclusion Data cleansing is the process of detecting and correcting errors and inconsistencies. The process of data cleansing is a sequence of operations intending to enhance to overall data quality of a data collection. There have been many methods of data cleansing, which aim at error detection and error correction in different steps of data cleansing.

Thank you!