Data Quality: the Other Face of Big Data. Divesh Srivastava AT&T Labs-Research

Size: px

Start display at page:

Download "Data Quality: the Other Face of Big Data. Divesh Srivastava AT&T Labs-Research"

Brook Copeland
5 years ago
Views:

1 Data Quality: the Other Face of Big Data Divesh Srivastava AT&T Labs-Research

2 Data Quality I am a manager I am also a researcher working on data quality 2

3 Big Data Big data is different things to different people Volume, velocity, variety, variability, value, veracity 3

4 Big Data + Data Quality Big data: all about the V s Size: huge volume of data from multiple sources Speed: dynamic data, collected and analyzed at high velocity Complexity: large variety of data and sources Evolution: considerable variability of data, semantics over time Goal: to extract significant value from big data Key stumbling block: data quality Raw data is often of questionable veracity How do we obtain high quality information? 4

5 Big Data + Data Quality 5

6 Data Can Be Erroneous The story, marked Hold for release Do not use, was sent in error to the news service s thousands of corporate clients. 6

7 Data Quality: By the Numbers Impact of poor data quality Erroneous data costs US businesses $600 billion/year [E02] In DW projects, data cleaning takes 30-80% of time and budget Data quality tools market is growing at 16% annually, way over 7% average for other IT segments [G07] How much data is erroneous Enterprise data error rates: average of 1-5%, some > 30% [R98] 7

8 Case Study: Big Data Quality [LDL+12] Study on two domains Belief of clean data Poor quality data can have big impact #Sources Period #Objects #Localattrs #Globalattrs Considered items Stock 55 7/ * *20 Flight 38 12/ * *31 8

9 Case Study: Big Data Quality Is the data consistent? Tolerance to 1% value difference 9

10 Case Study: Big Data Quality Why such inconsistency? Semantic ambiguity Nasdaq Yahoo! Finance Day s Range: wk Range: Wk:

11 Case Study: Big Data Quality Why such inconsistency? Unit errors 76.82B 76,821,000 11

12 Case Study: Big Data Quality Why such inconsistency? Instance ambiguity 12

13 Case Study: Big Data Quality Why such inconsistency? Pure errors FlightView FlightAware Orbitz 6:15 PM 6:22 PM 6:15 PM 9:40 PM 8:33 PM 9:54 PM 13

14 Case Study: Big Data Quality Why such inconsistency? Random sample of 20 data items + 5 items with largest # of values 14

15 Case Study: Big Data Quality Copying between sources? 15

16 Case Study: Big Data Quality Copying on erroneous data? 16

17 Case Study: Lessons Learned Big data has considerable inconsistency Even in domains where poor quality data can have big impact Semantics ambiguity, out of date data, unexplainable errors Data sources often copy from each other Copying can happen on erroneous data, spreading poor quality data 17

18 Small Data Quality: How Was It Achieved? Specify all domain knowledge as integrity constraints on data Reject updates that do not preserve integrity constraints Works well when the domain is well understood and static 18

19 Big Data Quality: A Different Approach? Big data: integrity constraints cannot be specified a priori Data variety, volume complete domain knowledge is infeasible Data velocity, variability domain knowledge becomes obsolete Too much rejected data small data 19

20 Big Data Quality: A Different Approach? Big data: integrity constraints cannot be specified a priori Data variety, volume complete domain knowledge is infeasible Data velocity, variability domain knowledge becomes obsolete Solution: let the data speak for itself Learn models (semantics) from the data Identify data glitches as violations of the learned models Repair data glitches and models in a timely manner 20

21 In This Talk A focus on well-structured data and logic-based data quality Models: logical constraints, e.g., (C)FDs, IDs, MDs, EGDs, DCs Glitches: groups of cells, i.e., (tuple-id, attribute) pairs Repairs: cost-based modifications to the data and models What we do not discuss in this talk Logic-based: consistent query answering, without data repairs Statistics-based: statistical models, statistical anomaly detection Unstructured data: quality of audio, video, extracted data 21

22 Outline Introduction Identifying inconsistencies Repairing inconsistencies 22

23 Identifying Inconsistencies Small data: specify semantics as integrity constraints on data Big data: let the data speak for itself Learn models (e.g., constraints, rules, patterns) from the data Identify data glitches as violations of the learned models 23

24 Example: Functional Dependencies Tid Name Type Country Customer Price Tax t1 Harry Potter Book France Andrew 10 0 t2 Harry Potter Book France Barbara 10 0 t3 Harry Potter Book France Candy t4 Lord of the Rings Book USA David 25 0 t5 Lord of the Rings Book USA Eran t6 Armani Suit Clothing USA Frank t7 Armani Suit Clothing USA George t8 Kaminski Hat Clothing Australia Harry t9 Kaminski Hat Clothing Australia Ian FD: [Name, Type, Country] [Price, Tax] FDs used to check consistency 24

25 Example: Functional Dependencies X Tid Name Type Country Customer Price Tax t1 Harry Potter Book France Andrew 10 0 t2 Harry Potter Book France Barbara 10 0 t3 Harry Potter Book France Candy t4 Lord of the Rings Book USA David 25 0 t5 Lord of the Rings Book USA Eran t6 Armani Suit Clothing USA Frank t7 Armani Suit Clothing USA George t8 Kaminski Hat Clothing Australia Harry t9 Kaminski Hat Clothing Australia Ian FD: [Name, Type, Country] [Price, Tax] FDs used to check consistency 25

26 Identifying Inconsistencies: Impact of Big Data Variety, variability of data: one size does not fit all Learn conditional models (contextual semantics) 26

27 Example: Conditional FD Tid Name Type Country Customer Price Tax t1 Harry Potter Book France Andrew 10 0 t2 Harry Potter Book France Barbara 10 0 t3 Harry Potter Book France Candy t4 Lord of the Rings Book USA David 25 0 t5 Lord of the Rings Book USA Eran t6 Armani Suit Clothing USA Frank t7 Armani Suit Clothing USA George t8 Kaminski Hat Clothing Australia Harry t9 Kaminski Hat Clothing Australia Ian CFD: [Name = *, Type = Clothing, Country = *] [Price, Tax] CFDs used to check consistency of subset of table 27

28 Identifying Inconsistencies: Impact of Big Data Variety, variability of data: exact vs approximate models Exact approaches can lead to over-fitting, large number of patterns Approximate approaches can have few violations: these are glitches Statistically robust measures: use supportand confidence 28

29 Example: Conditional FD Tid Name Type Country Customer Price Tax t1 Harry Potter Book France Andrew 10 0 t2 Harry Potter Book France Barbara 10 0 t3 Harry Potter Book France Candy t4 Lord of the Rings Book USA David 25 0 t5 Lord of the Rings Book USA Eran t6 Armani Suit Clothing USA Frank t7 Armani Suit Clothing USA George t8 Kaminski Hat Clothing Australia Harry t9 Kaminski Hat Clothing Australia Ian CFD: [Name = *, Type = *, Country = France] [Price, Tax] Holds approximately, support = 3/9, confidence = 2/3 29

30 Learning CFDs Given an FD, learn a good pattern tableau from data [GKK+08] FD: [Name, Type, Country] [Price, Tax] Learned pattern tableau Name Type Country Price Tax Support Confidence * Clothing * * * 4/9 4/4 * * France * 0 3/9 2/3 Global support = 7/9, global confidence = 6/7, local confidence = 2/3 Learn FD and a good pattern tableau from data [FGL+09] 30

31 Learning Pattern Tableaux Generate smallest tableau with support and globalconfidence NP-complete Provably hard to approximate Generate smallest tableau with support and localconfidence NP-complete But 31

32 Identifying Inconsistencies: Impact of Big Data Variety, variability of data One size does not fit all Exact vs approximate models Volumeof data Scalable algorithms: trade-off between efficiency vs accuracy 32

33 Efficiency vs Accuracy Trade-off running time with accuracy of solution Input: FD: X Y, global support and local confidence Consider all instantiations of FD antecedent (X) Prune based on local confidence Apply partial greedy set cover until support is reached Output: log(n)-approximation in tableau size 33

34 Example: Pattern as a Set Tid Name Type Country Customer Price Tax t1 Harry Potter Book France Andrew 10 0 t2 Harry Potter Book France Barbara 10 0 t3 Harry Potter Book France Candy t4 Lord of the Rings Book USA David 25 0 t5 Lord of the Rings Book USA Eran t6 Armani Suit Clothing USA Frank t7 Armani Suit Clothing USA George t8 Kaminski Hat Clothing Australia Harry t9 Kaminski Hat Clothing Australia Ian Pattern [Name = *, Type = Clothing, Country = *] Pattern [Name = *, Type = *, Country = USA] 34

35 Efficiency vs Accuracy Trade-off running time with accuracy of solution Input: FD: X Y, global support and local confidence Consider all instantiations of FD antecedent (X) Prune based on local confidence Apply partial greedy set cover until support is reached Output: log(n)-approximation in tableau size All instantiations of FD antecedent: X = [Name, Type, Country] [HP, Book, France], [HP, Book, *], [HP, *, France], [HP, *, *], X = d, # of data records = N # of pa erns can be up to N*2 d 35

36 Efficiency vs Accuracy Trade-off running time with accuracy of solution Input: FD: X Y, global support and local confidence Consider all instantiations of FD antecedent (X) Prune based on local confidence Apply partial greedy coverage until support is reached Output: log(n)-approximation in tableau size All instantiations of FD antecedent: X = [Name, Type, Country] [HP, Book, France], [HP, Book, *],[HP,*, France], [HP,*,*], X = d, # of data records = N # of pa erns can be up to N*2 d Too many patterns (sets) to consider in each iteration! 36

37 Efficiency vs Accuracy Problem: N*2 d patterns to consider in partial greedy coverage Solution: Do not instantiate entire search space of X a priori Incremental generation of search space: On-demand algorithm! [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... [LotR, Book, USA] 37

38 Efficiency vs Accuracy [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... On-demand algorithm: incremental generation of search space Start from root pattern of lattice [LotR, Book, USA] 38

39 Efficiency vs Accuracy [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... On-demand algorithm: incremental generation of search space Start from root pattern of lattice [LotR, Book, USA] 39

40 Efficiency vs Accuracy [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... On-demand algorithm: incremental generation of search space Start from root pattern of lattice [LotR, Book, USA] If local confidence of pattern <, explore its unpruned children 40

41 Efficiency vs Accuracy [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... On-demand algorithm: incremental generation of search space Start from root pattern of lattice [LotR, Book, USA] If local confidence of pattern <, explore its unpruned children 41

42 Efficiency vs Accuracy [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... On-demand algorithm: incremental generation of search space Start from root pattern of lattice [LotR, Book, USA] If local confidence of pattern <, explore its unpruned children If local confidence of pattern, prune sub-lattice incident on it 42

43 Efficiency vs Accuracy [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... On-demand algorithm: incremental generation of search space Start from root pattern of lattice [LotR, Book, USA] If local confidence of pattern <, explore its unpruned children If local confidence of pattern, prune sub-lattice incident on it Same search space exploration as partial greedy set cover! 43

44 Identifying Inconsistencies: Impact of Big Data Variety, variability of data One size does not fit all Exact vs approximate models Volumeof data Scalable algorithms: trade-off between efficiency vs accuracy Velocityof data Incremental, streaming algorithms 44

45 Other Data Quality Models Inclusion dependencies: every manager is an employee Sequential dependencies: consecutive polls must be 3-5min apart Matching dependencies: if similar name, address must be same Conservation dependencies: router in-traffic = router out-traffic Denial constraints: single tax exemption cannot exceed salary 45

46 Outline Introduction Identifying inconsistencies Repairing inconsistencies 46

47 Repair Techniques Glitch repairs, using source analysis [DBS09a] Introduced the idea of copy detection for structured data Glitch repairs by value modification, for FDs + InDs[BFF+05] Introduced the idea of cell equivalence classes Glitch + model repairs, for FDs [CM11] Introduced the idea of model repairs 47

48 Repairs Using Source Analysis [DBS09a] Problem: Given a database D obtained from a set of sources with overlapping data items, a single FD C, such that each (Si, C) is consistent but (D, C) is inconsistent, find best repair D of D Result: Using source quality and copy detection are essential Key ideas: Focus on value modifications of FD RHS attributes Learning and using source quality is better than naïve voting Copy detection between sources can prevent cabals 48

49 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Resolves inconsistency across diversity of sources Voting Source Quality USP S1 S2 S3 S4 S5 Jagadish UM ATT UM UM UI Dewitt MSR MSR UW UW UW Bernstein MSR MSR MSR MSR MSR Carey UCI ATT BEA BEA BEA Franklin UCB UCB UMD UMD UMD Copy Detection 49

50 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Voting Source Quality USP S1 S2 S3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD Copy Detection 50

51 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Supports difference of opinion Voting Source Quality USP S1 S2 S3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD Copy Detection 51

52 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Voting Source Quality USP S1 S2 S3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD Copy Detection 52

53 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Gives more weight to knowledgeable sources Voting Source Quality USP S1 S2 S3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD Copy Detection 53

54 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Voting Source Quality USP S1 S2 S3 S4 S5 Jagadish UM ATT UM UM UI Dewitt MSR MSR UW UW UW Bernstein MSR MSR MSR MSR MSR Carey UCI ATT BEA BEA BEA Franklin UCB UCB UMD UMD UMD Copy Detection 54

55 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Voting Source Quality USP S1 S2 S3 S4 S5 Jagadish UM ATT UM UM UI Dewitt MSR MSR UW UW UW Bernstein MSR MSR MSR MSR MSR Carey UCI ATT BEA BEA BEA Franklin UCB UCB UMD UMD UMD Copy Detection 55

56 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Reduces weight of copier sources Voting Source Quality USP S1 S2 S3 S4 S5 Jagadish UM ATT UM UM UI Dewitt MSR MSR UW UW UW Bernstein MSR MSR MSR MSR MSR Carey UCI ATT BEA BEA BEA Franklin UCB UCB UMD UMD UMD Copy Detection 56

57 Basic Solution: Naïve Voting Supports difference of opinion, allows conflict resolution Works well for independent sources that have similar accuracy When sources have different accuracies Need to give more weight to votes by knowledgeable sources When sources copy from other sources Need to reduce the weight of votes by copiers 57

58 Source Accuracy [YHY08, DBS09a] Need to give more weight to knowledgeable sources Computing source accuracy: A(S) = Avg v i(d) S Pr(v i (D) true Ф) v i (D) S : S provides value v i on data item D Ф: observations on all data items by sources S Pr(v i (D) true Ф) : probability of v i (D) being true How to compute Pr(v i (D) true Ф)? 58

59 Source Accuracy Input: data item D, val(d) = {v 0,v 1,,v n }, Ф Output: Pr(v i (D) true Ф), for i=0,, n (sum=1) Based on Bayes Rule, need Pr(Ф v i (D) true) Under independence, need Pr(Ф D (S) v i (D) true) If S provides v i : Pr(Ф D (S) v i (D) true) = A(S) If S does not : Pr(Ф D (S) v i (D) true) =(1-A(S))/n Challenge: Inter-dependence between source accuracy and value probability? 59

60 Value Vote Count Source Vote Count Value Probability Source Accuracy Source Accuracy Continue until source accuracy converges 60 ) ) ( Pr( ) ( ) ( Φ = D v Avg S A S D v ) ( 1 ) ( ln ) ( ' S A S na S A = = Φ ) ( )) ( ( )) ( ( 0 0 ) ) ( Pr( D val v D v C D v C e e D v = )) ( ( ) ( ' )) ( ( D v S S S A D v C

61 Copy Detection Are Source 1 and Source 2 dependent? Not necessarily Source 1 on USA Presidents: Source 2 on USA Presidents: 1 st : George Washington 2 nd : John Adams 3 rd : Thomas Jefferson 4 th : James Madison 41 st : George H.W. Bush 42 nd : William J. Clinton 43 rd : George W. Bush 44 th : Barack Obama 1 st : George Washington 2 nd : John Adams 3 rd : Thomas Jefferson 4 th : James Madison 41 st : George H.W. Bush 42 nd : William J. Clinton 43 rd : George W. Bush 44 th : Barack Obama 61

62 Copy Detection Are Source 1 and Source 2 dependent? Very likely Source 1 on USA Presidents: Source 2 on USA Presidents: 1 st : George Washington 2 nd : Benjamin Franklin 3 rd : John F. Kennedy 4 th : Abraham Lincoln 41 st : George W. Bush 42 nd : Hillary Clinton 43 rd : Dick Cheney 44 th : Barack Obama 1 st : George Washington 2 nd : Benjamin Franklin 3 rd : John F. Kennedy 4 th : Abraham Lincoln 41 st : George W. Bush 42 nd : Hillary Clinton 43 rd : Dick Cheney 44 th : John McCain 62

63 Copy Detection: Bayesian Analysis Different Values O d Same Values TRUE O t FALSE O f S1 S2 Goal: Pr(S1 S2 Ф), Pr(S1 S2 Ф) (sum = 1) According to Bayes Rule, we need Pr(Ф S1 S2), Pr(Ф S1 S2) Key: compute Pr(Ф D S1 S2), Pr(Ф D S1 S2), for each D S1 S2 63

64 Copy Detection: Bayesian Analysis Different Values O d Same Values TRUE O t FALSE O f S1 S2 Pr Independence Copying O t 2 A < A c + A 2 (1 c) O f O d ( 1 A ) 2 n P d =1 A 2 (1 A)2 n << > (1 A) c + (1 A) n P d (1 c) 2 (1 c) 64

65 Iterative Process Typically converges when #objs >> #srcs Step 2 Truth Discovery Accuracy Computation Step 3 Copy Detection Step 1 65

66 Repair Techniques Glitch repairs, using source analysis [DBS09a] Introduced the idea of copy detection for structured data Glitch repairs by value modification, for FDs + InDs[BFF+05] Introduced the idea of cell equivalence classes Glitch + model repairs, for FDs [CM11] Introduced the idea of model repairs 66

67 Repairs Using Value Modification Problem: Given a database D, FD and InDconstraints C, such that (D, C) is inconsistent, find repair D of D with minimum cost(d ) Result: The problem is NP-hard even for only FDs or only InDs Key ideas: Focus on value modifications of FD RHS attributes Cost model for repairs is based on value accuracy, repair similarity Equivalence classes of cells with identical values in the repair permits a delayed assignment of a value to an equivalence class 67

68 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Ali Smith 27 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 InD: Equip[Tel] Customer[Tel] 68

69 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Ali Smith 27 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 InD: Equip[Tel] Customer[Tel] 69

70 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Ali Smith 27 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 FD: Customer[Tel] Customer[Name, Street, City, State, Zip] 70

71 Repairs Using Value Modification X C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Ali Smith 27 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 FD: Customer[Tel] Customer[Name, Street, City, State, Zip] 71

72 Repairs Using Value Modification Repair alternatives when records t i and t j violate FD: X Y Value modification of LHS attributes X Modify t j [X] to a value differentfrom t i [X] Unclear what (different) value should be assigned to t j [X] Value modification of RHS attributes Y Modify t j [Y] to equal t i [Y] or vice versa Use cost of repair to choose between alternatives FD violations can always be repaired by modifying RHS attributes Y Naïve approach can lead to non-termination 72

73 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Alice Smith 17 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 FD: Customer[Tel] Customer[Name, Street, City, State, Zip] 73

74 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Alice Smith 17 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 FD: Customer[Tel] Customer[Name, Street, City, State, Zip] FD: Customer[Zip] Customer[City, State] FD: Customer[Name, Street, Zip] Customer[Tel] 74

75 Repairs Using Value Modification X C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Alice Smith 17 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 FD: Customer[Tel] Customer[Name, Street, City, State, Zip] FD: Customer[Zip] Customer[City, State] FD: Customer[Name, Street, Zip] Customer[Tel] 75

76 Repairs Using Value Modification? C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NY t Alice Smith 17 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 FD: Customer[Tel] Customer[Name, Street, City, State, Zip] FD: Customer[Zip] Customer[City, State] FD: Customer[Name, Street, Zip] Customer[Tel] 76

77 Repairs Using Value Modification X C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NY t Alice Smith 17 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 InD: Equip[Tel] Customer[Tel] FD: Customer[Tel] Customer[Name, Street, City, State, Zip] FD: Customer[Zip] Customer[City, State] FD: Customer[Name, Street, Zip] Customer[Tel] 77

78 Repairs Using Value Modification Repair alternatives when record t i violates InD: R i [X] R j [Y] Value modification of t i [X] Modify t j [X] to a value t j [Y] for some t j in R j Value modification of t j [Y] Modify t j [Y] for some t j in R j to equal t i [X] Use cost of repair to choose between alternatives 78

79 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NY t Alice Smith 17 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 InD: Equip[Tel] Customer[Tel] FD: Customer[Tel] Customer[Name, Street, City, State, Zip] FD: Customer[Zip] Customer[City, State] FD: Customer[Name, Street, Zip] Customer[Tel] 79

80 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Ali Smith 27 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 Greedily build equivalence classes of cells {(t2, Tel), (t3, Tel), (t5, Tel), (t6, Tel)} {(t1, Name), (t4, Name)} 80

81 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Ali Smith 27 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 Greedily build equivalence classes of cells, assign unique value {(t2, Tel), (t3, Tel), (t5, Tel), (t6, Tel)} {(t1, Name), (t4, Name)} Alice Smith 81

82 Repair Techniques Glitch repairs, using source analysis [DBS09a] Introduced the idea of copy detection for structured data Glitch repairs by value modification, for FDs + InDs[BFF+05] Introduced the idea of cell equivalence classes Glitch + model repairs, for FDs [CM11] Introduced the idea of model repairs 82

83 Repairing Data and Constraints Motivation: variability of data semantics over time Problem: Given a database D, FD constraints C, such that (D, C) is inconsistent, find repair (D, C ) with minimum cost Key ideas: Allow value modifications of FD RHS or LHS attributes Allow modifications of FDsin C by augmenting the LHS Cost model for repairs is based on minimum description length 83

84 Repairing Data and Constraints Tid District Region Municipal AC Tel Street Zip City State t1 Brookside Granville Glendale Boxwood NY NY t2 Brookside Granville Glendale Boxwood NY NY t3 Brookside Granville Glendale Westlane NY MA t4 Brookside Granville Guild Squire Boston MA t5 Brookside Granville Guild Squire Boston MA t6 Brookside Granville Queen Main Chicago IL t7 Brookside Granville Queen Main Chicago IL t8 Brookside Granville Queen Main Chicago IL t9 Brookside Granville Queen Bay Chicago IL FD: [District, Region] [AC, City, State] 84

85 Repairing Data and Constraints Tid District Region Municipal AC Tel Street Zip City State t1 Brookside Granville Glendale Boxwood NY NY t2 Brookside Granville Glendale Boxwood NY NY t3 Brookside Granville Glendale Westlane NY MA t4 Brookside Granville Guild Squire Boston MA t5 Brookside Granville Guild Squire Boston MA t6 Brookside Granville Queen Main Chicago IL t7 Brookside Granville Queen Main Chicago IL t8 Brookside Granville Queen Main Chicago IL t9 Brookside Granville Queen Bay Chicago IL FD: [District, Region] [AC, City, State] Expensive repair using only value modifications 85

86 Repairing Data and Constraints Repair alternatives when records t i and t j violate FD: X Y Value modification of RHS attributes Y Value modification of LHS attributes X Modify t j [X] to a value different from t i [X], supported by the data Repair constraints by augmenting LHS (X) with a new attribute New attribute provides additional context Choose from alternatives using MDL-based cost model 86

87 MDL-Based Cost Model Quantifies trade-off of a data repair versus a constraint repair Cost-model based on the three properties Accuracy: value modifications must minimize distance Redundancy: value modifications must be well supported in data, constraint repairs must result in a higher degree of consistency Conciseness: repaired constraints should explain, but not overfit Minimum description length (MDL) principle Length of model + length to encode data given the model 87

88 Repairing Data and Constraints Tid District Region Municipal AC Tel Street Zip City State t1 Brookside Granville Glendale Boxwood NY NY t2 Brookside Granville Glendale Boxwood NY NY t3 Brookside Granville Glendale Westlane NY MA t4 Brookside Granville Guild Squire Boston MA t5 Brookside Granville Guild Squire Boston MA t6 Brookside Granville Queen Main Chicago IL t7 Brookside Granville Queen Main Chicago IL t8 Brookside Granville Queen Main Chicago IL t9 Brookside Granville Queen Bay Chicago IL Cheap repair of constraints and data FD: [District, Region, Municipal] [AC, City, State] t3.state = NY 88

89 Repairing Data and Constraints Tid District Region Municipal AC Tel Street Zip City State t1 Brookside Granville Glendale Boxwood NY NY t2 Brookside Granville Glendale *** Boxwood *** *** t3 Brookside Granville Glendale *** Westlane *** MA t4 Brookside Granville Guild Squire Boston MA t5 Brookside Granville Guild *** Squire *** *** t6 Brookside Granville Queen Main Chicago IL t7 Brookside Granville Queen *** Main *** *** t8 Brookside Granville Queen *** Main *** *** t9 Brookside Granville Queen *** Bay *** *** MDL: Length of model + length to encode data given the model FD: [District, Region, Municipal] [AC, City, State] 89

90 Conclusions Big data quality (veracity) is an important area of research Challenges due to volume, velocity, variety, variability Much interesting work has been done in this area Learn models (semantics) from the data Identify data glitches as violations of the learned models Repair data glitches and models in a timely manner A lot more research needs to be done! 90

91 Crowdsourcing Improving data quality by crowdsourcing 91

92 Source Exploration Tool Data.gov 92

93 Bibliography [BFF+05] Philip Bohannon, Michael Flaster, Wenfei Fan, Rajeev Rastogi: A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification. SIGMOD 2005: [CM11] Fei Chiang, Renée J. Miller: A unified model for data and constraint repair. ICDE 2011: [DBS09a] Xin Luna Dong, Laure Berti-Equille, DiveshSrivastava: Integrating Conflicting Data: The Role of Source Dependence. PVLDB 2(1): (2009) [GKK+08] Lukasz Golab, Howard J. Karloff, Flip Korn, Divesh Srivastava, Bei Yu: On generating nearoptimal tableaux for conditional functional dependencies. PVLDB 1(1): (2008) [LDL+12] Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, DiveshSrivastava: Truth Finding on the Deep Web: Is the Problem Solved? PVLDB 6(2): (2012) 93

(Big Data Integration) : :

(Big Data Integration) : : 3 # $%&'! ()* +$,- 2/30 ()* + # $%&' = 3 : $ 2 : 17 ;' $ # < 2 6 ' $%&',# +'= > 0 - '? @0 A 1 3/30 3?. - B 6 @* @(C : E6 - > ()* (C :(C E6 1' +'= - ''3-6 F :* 2G '> H-! +'-?