Data Quality: the Other Face of Big Data. Divesh Srivastava AT&T Labs-Research

Size: px
Start display at page:

Download "Data Quality: the Other Face of Big Data. Divesh Srivastava AT&T Labs-Research"

Transcription

1 Data Quality: the Other Face of Big Data Divesh Srivastava AT&T Labs-Research

2 Data Quality I am a manager I am also a researcher working on data quality 2

3 Big Data Big data is different things to different people Volume, velocity, variety, variability, value, veracity 3

4 Big Data + Data Quality Big data: all about the V s Size: huge volume of data from multiple sources Speed: dynamic data, collected and analyzed at high velocity Complexity: large variety of data and sources Evolution: considerable variability of data, semantics over time Goal: to extract significant value from big data Key stumbling block: data quality Raw data is often of questionable veracity How do we obtain high quality information? 4

5 Big Data + Data Quality 5

6 Data Can Be Erroneous The story, marked Hold for release Do not use, was sent in error to the news service s thousands of corporate clients. 6

7 Data Quality: By the Numbers Impact of poor data quality Erroneous data costs US businesses $600 billion/year [E02] In DW projects, data cleaning takes 30-80% of time and budget Data quality tools market is growing at 16% annually, way over 7% average for other IT segments [G07] How much data is erroneous Enterprise data error rates: average of 1-5%, some > 30% [R98] 7

8 Case Study: Big Data Quality [LDL+12] Study on two domains Belief of clean data Poor quality data can have big impact #Sources Period #Objects #Localattrs #Globalattrs Considered items Stock 55 7/ * *20 Flight 38 12/ * *31 8

9 Case Study: Big Data Quality Is the data consistent? Tolerance to 1% value difference 9

10 Case Study: Big Data Quality Why such inconsistency? Semantic ambiguity Nasdaq Yahoo! Finance Day s Range: wk Range: Wk:

11 Case Study: Big Data Quality Why such inconsistency? Unit errors 76.82B 76,821,000 11

12 Case Study: Big Data Quality Why such inconsistency? Instance ambiguity 12

13 Case Study: Big Data Quality Why such inconsistency? Pure errors FlightView FlightAware Orbitz 6:15 PM 6:22 PM 6:15 PM 9:40 PM 8:33 PM 9:54 PM 13

14 Case Study: Big Data Quality Why such inconsistency? Random sample of 20 data items + 5 items with largest # of values 14

15 Case Study: Big Data Quality Copying between sources? 15

16 Case Study: Big Data Quality Copying on erroneous data? 16

17 Case Study: Lessons Learned Big data has considerable inconsistency Even in domains where poor quality data can have big impact Semantics ambiguity, out of date data, unexplainable errors Data sources often copy from each other Copying can happen on erroneous data, spreading poor quality data 17

18 Small Data Quality: How Was It Achieved? Specify all domain knowledge as integrity constraints on data Reject updates that do not preserve integrity constraints Works well when the domain is well understood and static 18

19 Big Data Quality: A Different Approach? Big data: integrity constraints cannot be specified a priori Data variety, volume complete domain knowledge is infeasible Data velocity, variability domain knowledge becomes obsolete Too much rejected data small data 19

20 Big Data Quality: A Different Approach? Big data: integrity constraints cannot be specified a priori Data variety, volume complete domain knowledge is infeasible Data velocity, variability domain knowledge becomes obsolete Solution: let the data speak for itself Learn models (semantics) from the data Identify data glitches as violations of the learned models Repair data glitches and models in a timely manner 20

21 In This Talk A focus on well-structured data and logic-based data quality Models: logical constraints, e.g., (C)FDs, IDs, MDs, EGDs, DCs Glitches: groups of cells, i.e., (tuple-id, attribute) pairs Repairs: cost-based modifications to the data and models What we do not discuss in this talk Logic-based: consistent query answering, without data repairs Statistics-based: statistical models, statistical anomaly detection Unstructured data: quality of audio, video, extracted data 21

22 Outline Introduction Identifying inconsistencies Repairing inconsistencies 22

23 Identifying Inconsistencies Small data: specify semantics as integrity constraints on data Big data: let the data speak for itself Learn models (e.g., constraints, rules, patterns) from the data Identify data glitches as violations of the learned models 23

24 Example: Functional Dependencies Tid Name Type Country Customer Price Tax t1 Harry Potter Book France Andrew 10 0 t2 Harry Potter Book France Barbara 10 0 t3 Harry Potter Book France Candy t4 Lord of the Rings Book USA David 25 0 t5 Lord of the Rings Book USA Eran t6 Armani Suit Clothing USA Frank t7 Armani Suit Clothing USA George t8 Kaminski Hat Clothing Australia Harry t9 Kaminski Hat Clothing Australia Ian FD: [Name, Type, Country] [Price, Tax] FDs used to check consistency 24

25 Example: Functional Dependencies X Tid Name Type Country Customer Price Tax t1 Harry Potter Book France Andrew 10 0 t2 Harry Potter Book France Barbara 10 0 t3 Harry Potter Book France Candy t4 Lord of the Rings Book USA David 25 0 t5 Lord of the Rings Book USA Eran t6 Armani Suit Clothing USA Frank t7 Armani Suit Clothing USA George t8 Kaminski Hat Clothing Australia Harry t9 Kaminski Hat Clothing Australia Ian FD: [Name, Type, Country] [Price, Tax] FDs used to check consistency 25

26 Identifying Inconsistencies: Impact of Big Data Variety, variability of data: one size does not fit all Learn conditional models (contextual semantics) 26

27 Example: Conditional FD Tid Name Type Country Customer Price Tax t1 Harry Potter Book France Andrew 10 0 t2 Harry Potter Book France Barbara 10 0 t3 Harry Potter Book France Candy t4 Lord of the Rings Book USA David 25 0 t5 Lord of the Rings Book USA Eran t6 Armani Suit Clothing USA Frank t7 Armani Suit Clothing USA George t8 Kaminski Hat Clothing Australia Harry t9 Kaminski Hat Clothing Australia Ian CFD: [Name = *, Type = Clothing, Country = *] [Price, Tax] CFDs used to check consistency of subset of table 27

28 Identifying Inconsistencies: Impact of Big Data Variety, variability of data: exact vs approximate models Exact approaches can lead to over-fitting, large number of patterns Approximate approaches can have few violations: these are glitches Statistically robust measures: use supportand confidence 28

29 Example: Conditional FD Tid Name Type Country Customer Price Tax t1 Harry Potter Book France Andrew 10 0 t2 Harry Potter Book France Barbara 10 0 t3 Harry Potter Book France Candy t4 Lord of the Rings Book USA David 25 0 t5 Lord of the Rings Book USA Eran t6 Armani Suit Clothing USA Frank t7 Armani Suit Clothing USA George t8 Kaminski Hat Clothing Australia Harry t9 Kaminski Hat Clothing Australia Ian CFD: [Name = *, Type = *, Country = France] [Price, Tax] Holds approximately, support = 3/9, confidence = 2/3 29

30 Learning CFDs Given an FD, learn a good pattern tableau from data [GKK+08] FD: [Name, Type, Country] [Price, Tax] Learned pattern tableau Name Type Country Price Tax Support Confidence * Clothing * * * 4/9 4/4 * * France * 0 3/9 2/3 Global support = 7/9, global confidence = 6/7, local confidence = 2/3 Learn FD and a good pattern tableau from data [FGL+09] 30

31 Learning Pattern Tableaux Generate smallest tableau with support and globalconfidence NP-complete Provably hard to approximate Generate smallest tableau with support and localconfidence NP-complete But 31

32 Identifying Inconsistencies: Impact of Big Data Variety, variability of data One size does not fit all Exact vs approximate models Volumeof data Scalable algorithms: trade-off between efficiency vs accuracy 32

33 Efficiency vs Accuracy Trade-off running time with accuracy of solution Input: FD: X Y, global support and local confidence Consider all instantiations of FD antecedent (X) Prune based on local confidence Apply partial greedy set cover until support is reached Output: log(n)-approximation in tableau size 33

34 Example: Pattern as a Set Tid Name Type Country Customer Price Tax t1 Harry Potter Book France Andrew 10 0 t2 Harry Potter Book France Barbara 10 0 t3 Harry Potter Book France Candy t4 Lord of the Rings Book USA David 25 0 t5 Lord of the Rings Book USA Eran t6 Armani Suit Clothing USA Frank t7 Armani Suit Clothing USA George t8 Kaminski Hat Clothing Australia Harry t9 Kaminski Hat Clothing Australia Ian Pattern [Name = *, Type = Clothing, Country = *] Pattern [Name = *, Type = *, Country = USA] 34

35 Efficiency vs Accuracy Trade-off running time with accuracy of solution Input: FD: X Y, global support and local confidence Consider all instantiations of FD antecedent (X) Prune based on local confidence Apply partial greedy set cover until support is reached Output: log(n)-approximation in tableau size All instantiations of FD antecedent: X = [Name, Type, Country] [HP, Book, France], [HP, Book, *], [HP, *, France], [HP, *, *], X = d, # of data records = N # of pa erns can be up to N*2 d 35

36 Efficiency vs Accuracy Trade-off running time with accuracy of solution Input: FD: X Y, global support and local confidence Consider all instantiations of FD antecedent (X) Prune based on local confidence Apply partial greedy coverage until support is reached Output: log(n)-approximation in tableau size All instantiations of FD antecedent: X = [Name, Type, Country] [HP, Book, France], [HP, Book, *],[HP,*, France], [HP,*,*], X = d, # of data records = N # of pa erns can be up to N*2 d Too many patterns (sets) to consider in each iteration! 36

37 Efficiency vs Accuracy Problem: N*2 d patterns to consider in partial greedy coverage Solution: Do not instantiate entire search space of X a priori Incremental generation of search space: On-demand algorithm! [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... [LotR, Book, USA] 37

38 Efficiency vs Accuracy [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... On-demand algorithm: incremental generation of search space Start from root pattern of lattice [LotR, Book, USA] 38

39 Efficiency vs Accuracy [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... On-demand algorithm: incremental generation of search space Start from root pattern of lattice [LotR, Book, USA] 39

40 Efficiency vs Accuracy [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... On-demand algorithm: incremental generation of search space Start from root pattern of lattice [LotR, Book, USA] If local confidence of pattern <, explore its unpruned children 40

41 Efficiency vs Accuracy [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... On-demand algorithm: incremental generation of search space Start from root pattern of lattice [LotR, Book, USA] If local confidence of pattern <, explore its unpruned children 41

42 Efficiency vs Accuracy [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... On-demand algorithm: incremental generation of search space Start from root pattern of lattice [LotR, Book, USA] If local confidence of pattern <, explore its unpruned children If local confidence of pattern, prune sub-lattice incident on it 42

43 Efficiency vs Accuracy [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... On-demand algorithm: incremental generation of search space Start from root pattern of lattice [LotR, Book, USA] If local confidence of pattern <, explore its unpruned children If local confidence of pattern, prune sub-lattice incident on it Same search space exploration as partial greedy set cover! 43

44 Identifying Inconsistencies: Impact of Big Data Variety, variability of data One size does not fit all Exact vs approximate models Volumeof data Scalable algorithms: trade-off between efficiency vs accuracy Velocityof data Incremental, streaming algorithms 44

45 Other Data Quality Models Inclusion dependencies: every manager is an employee Sequential dependencies: consecutive polls must be 3-5min apart Matching dependencies: if similar name, address must be same Conservation dependencies: router in-traffic = router out-traffic Denial constraints: single tax exemption cannot exceed salary 45

46 Outline Introduction Identifying inconsistencies Repairing inconsistencies 46

47 Repair Techniques Glitch repairs, using source analysis [DBS09a] Introduced the idea of copy detection for structured data Glitch repairs by value modification, for FDs + InDs[BFF+05] Introduced the idea of cell equivalence classes Glitch + model repairs, for FDs [CM11] Introduced the idea of model repairs 47

48 Repairs Using Source Analysis [DBS09a] Problem: Given a database D obtained from a set of sources with overlapping data items, a single FD C, such that each (Si, C) is consistent but (D, C) is inconsistent, find best repair D of D Result: Using source quality and copy detection are essential Key ideas: Focus on value modifications of FD RHS attributes Learning and using source quality is better than naïve voting Copy detection between sources can prevent cabals 48

49 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Resolves inconsistency across diversity of sources Voting Source Quality USP S1 S2 S3 S4 S5 Jagadish UM ATT UM UM UI Dewitt MSR MSR UW UW UW Bernstein MSR MSR MSR MSR MSR Carey UCI ATT BEA BEA BEA Franklin UCB UCB UMD UMD UMD Copy Detection 49

50 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Voting Source Quality USP S1 S2 S3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD Copy Detection 50

51 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Supports difference of opinion Voting Source Quality USP S1 S2 S3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD Copy Detection 51

52 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Voting Source Quality USP S1 S2 S3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD Copy Detection 52

53 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Gives more weight to knowledgeable sources Voting Source Quality USP S1 S2 S3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD Copy Detection 53

54 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Voting Source Quality USP S1 S2 S3 S4 S5 Jagadish UM ATT UM UM UI Dewitt MSR MSR UW UW UW Bernstein MSR MSR MSR MSR MSR Carey UCI ATT BEA BEA BEA Franklin UCB UCB UMD UMD UMD Copy Detection 54

55 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Voting Source Quality USP S1 S2 S3 S4 S5 Jagadish UM ATT UM UM UI Dewitt MSR MSR UW UW UW Bernstein MSR MSR MSR MSR MSR Carey UCI ATT BEA BEA BEA Franklin UCB UCB UMD UMD UMD Copy Detection 55

56 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Reduces weight of copier sources Voting Source Quality USP S1 S2 S3 S4 S5 Jagadish UM ATT UM UM UI Dewitt MSR MSR UW UW UW Bernstein MSR MSR MSR MSR MSR Carey UCI ATT BEA BEA BEA Franklin UCB UCB UMD UMD UMD Copy Detection 56

57 Basic Solution: Naïve Voting Supports difference of opinion, allows conflict resolution Works well for independent sources that have similar accuracy When sources have different accuracies Need to give more weight to votes by knowledgeable sources When sources copy from other sources Need to reduce the weight of votes by copiers 57

58 Source Accuracy [YHY08, DBS09a] Need to give more weight to knowledgeable sources Computing source accuracy: A(S) = Avg v i(d) S Pr(v i (D) true Ф) v i (D) S : S provides value v i on data item D Ф: observations on all data items by sources S Pr(v i (D) true Ф) : probability of v i (D) being true How to compute Pr(v i (D) true Ф)? 58

59 Source Accuracy Input: data item D, val(d) = {v 0,v 1,,v n }, Ф Output: Pr(v i (D) true Ф), for i=0,, n (sum=1) Based on Bayes Rule, need Pr(Ф v i (D) true) Under independence, need Pr(Ф D (S) v i (D) true) If S provides v i : Pr(Ф D (S) v i (D) true) = A(S) If S does not : Pr(Ф D (S) v i (D) true) =(1-A(S))/n Challenge: Inter-dependence between source accuracy and value probability? 59

60 Value Vote Count Source Vote Count Value Probability Source Accuracy Source Accuracy Continue until source accuracy converges 60 ) ) ( Pr( ) ( ) ( Φ = D v Avg S A S D v ) ( 1 ) ( ln ) ( ' S A S na S A = = Φ ) ( )) ( ( )) ( ( 0 0 ) ) ( Pr( D val v D v C D v C e e D v = )) ( ( ) ( ' )) ( ( D v S S S A D v C

61 Copy Detection Are Source 1 and Source 2 dependent? Not necessarily Source 1 on USA Presidents: Source 2 on USA Presidents: 1 st : George Washington 2 nd : John Adams 3 rd : Thomas Jefferson 4 th : James Madison 41 st : George H.W. Bush 42 nd : William J. Clinton 43 rd : George W. Bush 44 th : Barack Obama 1 st : George Washington 2 nd : John Adams 3 rd : Thomas Jefferson 4 th : James Madison 41 st : George H.W. Bush 42 nd : William J. Clinton 43 rd : George W. Bush 44 th : Barack Obama 61

62 Copy Detection Are Source 1 and Source 2 dependent? Very likely Source 1 on USA Presidents: Source 2 on USA Presidents: 1 st : George Washington 2 nd : Benjamin Franklin 3 rd : John F. Kennedy 4 th : Abraham Lincoln 41 st : George W. Bush 42 nd : Hillary Clinton 43 rd : Dick Cheney 44 th : Barack Obama 1 st : George Washington 2 nd : Benjamin Franklin 3 rd : John F. Kennedy 4 th : Abraham Lincoln 41 st : George W. Bush 42 nd : Hillary Clinton 43 rd : Dick Cheney 44 th : John McCain 62

63 Copy Detection: Bayesian Analysis Different Values O d Same Values TRUE O t FALSE O f S1 S2 Goal: Pr(S1 S2 Ф), Pr(S1 S2 Ф) (sum = 1) According to Bayes Rule, we need Pr(Ф S1 S2), Pr(Ф S1 S2) Key: compute Pr(Ф D S1 S2), Pr(Ф D S1 S2), for each D S1 S2 63

64 Copy Detection: Bayesian Analysis Different Values O d Same Values TRUE O t FALSE O f S1 S2 Pr Independence Copying O t 2 A < A c + A 2 (1 c) O f O d ( 1 A ) 2 n P d =1 A 2 (1 A)2 n << > (1 A) c + (1 A) n P d (1 c) 2 (1 c) 64

65 Iterative Process Typically converges when #objs >> #srcs Step 2 Truth Discovery Accuracy Computation Step 3 Copy Detection Step 1 65

66 Repair Techniques Glitch repairs, using source analysis [DBS09a] Introduced the idea of copy detection for structured data Glitch repairs by value modification, for FDs + InDs[BFF+05] Introduced the idea of cell equivalence classes Glitch + model repairs, for FDs [CM11] Introduced the idea of model repairs 66

67 Repairs Using Value Modification Problem: Given a database D, FD and InDconstraints C, such that (D, C) is inconsistent, find repair D of D with minimum cost(d ) Result: The problem is NP-hard even for only FDs or only InDs Key ideas: Focus on value modifications of FD RHS attributes Cost model for repairs is based on value accuracy, repair similarity Equivalence classes of cells with identical values in the repair permits a delayed assignment of a value to an equivalence class 67

68 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Ali Smith 27 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 InD: Equip[Tel] Customer[Tel] 68

69 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Ali Smith 27 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 InD: Equip[Tel] Customer[Tel] 69

70 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Ali Smith 27 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 FD: Customer[Tel] Customer[Name, Street, City, State, Zip] 70

71 Repairs Using Value Modification X C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Ali Smith 27 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 FD: Customer[Tel] Customer[Name, Street, City, State, Zip] 71

72 Repairs Using Value Modification Repair alternatives when records t i and t j violate FD: X Y Value modification of LHS attributes X Modify t j [X] to a value differentfrom t i [X] Unclear what (different) value should be assigned to t j [X] Value modification of RHS attributes Y Modify t j [Y] to equal t i [Y] or vice versa Use cost of repair to choose between alternatives FD violations can always be repaired by modifying RHS attributes Y Naïve approach can lead to non-termination 72

73 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Alice Smith 17 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 FD: Customer[Tel] Customer[Name, Street, City, State, Zip] 73

74 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Alice Smith 17 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 FD: Customer[Tel] Customer[Name, Street, City, State, Zip] FD: Customer[Zip] Customer[City, State] FD: Customer[Name, Street, Zip] Customer[Tel] 74

75 Repairs Using Value Modification X C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Alice Smith 17 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 FD: Customer[Tel] Customer[Name, Street, City, State, Zip] FD: Customer[Zip] Customer[City, State] FD: Customer[Name, Street, Zip] Customer[Tel] 75

76 Repairs Using Value Modification? C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NY t Alice Smith 17 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 FD: Customer[Tel] Customer[Name, Street, City, State, Zip] FD: Customer[Zip] Customer[City, State] FD: Customer[Name, Street, Zip] Customer[Tel] 76

77 Repairs Using Value Modification X C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NY t Alice Smith 17 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 InD: Equip[Tel] Customer[Tel] FD: Customer[Tel] Customer[Name, Street, City, State, Zip] FD: Customer[Zip] Customer[City, State] FD: Customer[Name, Street, Zip] Customer[Tel] 77

78 Repairs Using Value Modification Repair alternatives when record t i violates InD: R i [X] R j [Y] Value modification of t i [X] Modify t j [X] to a value t j [Y] for some t j in R j Value modification of t j [Y] Modify t j [Y] for some t j in R j to equal t i [X] Use cost of repair to choose between alternatives 78

79 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NY t Alice Smith 17 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 InD: Equip[Tel] Customer[Tel] FD: Customer[Tel] Customer[Name, Street, City, State, Zip] FD: Customer[Zip] Customer[City, State] FD: Customer[Name, Street, Zip] Customer[Tel] 79

80 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Ali Smith 27 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 Greedily build equivalence classes of cells {(t2, Tel), (t3, Tel), (t5, Tel), (t6, Tel)} {(t1, Name), (t4, Name)} 80

81 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Ali Smith 27 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 Greedily build equivalence classes of cells, assign unique value {(t2, Tel), (t3, Tel), (t5, Tel), (t6, Tel)} {(t1, Name), (t4, Name)} Alice Smith 81

82 Repair Techniques Glitch repairs, using source analysis [DBS09a] Introduced the idea of copy detection for structured data Glitch repairs by value modification, for FDs + InDs[BFF+05] Introduced the idea of cell equivalence classes Glitch + model repairs, for FDs [CM11] Introduced the idea of model repairs 82

83 Repairing Data and Constraints Motivation: variability of data semantics over time Problem: Given a database D, FD constraints C, such that (D, C) is inconsistent, find repair (D, C ) with minimum cost Key ideas: Allow value modifications of FD RHS or LHS attributes Allow modifications of FDsin C by augmenting the LHS Cost model for repairs is based on minimum description length 83

84 Repairing Data and Constraints Tid District Region Municipal AC Tel Street Zip City State t1 Brookside Granville Glendale Boxwood NY NY t2 Brookside Granville Glendale Boxwood NY NY t3 Brookside Granville Glendale Westlane NY MA t4 Brookside Granville Guild Squire Boston MA t5 Brookside Granville Guild Squire Boston MA t6 Brookside Granville Queen Main Chicago IL t7 Brookside Granville Queen Main Chicago IL t8 Brookside Granville Queen Main Chicago IL t9 Brookside Granville Queen Bay Chicago IL FD: [District, Region] [AC, City, State] 84

85 Repairing Data and Constraints Tid District Region Municipal AC Tel Street Zip City State t1 Brookside Granville Glendale Boxwood NY NY t2 Brookside Granville Glendale Boxwood NY NY t3 Brookside Granville Glendale Westlane NY MA t4 Brookside Granville Guild Squire Boston MA t5 Brookside Granville Guild Squire Boston MA t6 Brookside Granville Queen Main Chicago IL t7 Brookside Granville Queen Main Chicago IL t8 Brookside Granville Queen Main Chicago IL t9 Brookside Granville Queen Bay Chicago IL FD: [District, Region] [AC, City, State] Expensive repair using only value modifications 85

86 Repairing Data and Constraints Repair alternatives when records t i and t j violate FD: X Y Value modification of RHS attributes Y Value modification of LHS attributes X Modify t j [X] to a value different from t i [X], supported by the data Repair constraints by augmenting LHS (X) with a new attribute New attribute provides additional context Choose from alternatives using MDL-based cost model 86

87 MDL-Based Cost Model Quantifies trade-off of a data repair versus a constraint repair Cost-model based on the three properties Accuracy: value modifications must minimize distance Redundancy: value modifications must be well supported in data, constraint repairs must result in a higher degree of consistency Conciseness: repaired constraints should explain, but not overfit Minimum description length (MDL) principle Length of model + length to encode data given the model 87

88 Repairing Data and Constraints Tid District Region Municipal AC Tel Street Zip City State t1 Brookside Granville Glendale Boxwood NY NY t2 Brookside Granville Glendale Boxwood NY NY t3 Brookside Granville Glendale Westlane NY MA t4 Brookside Granville Guild Squire Boston MA t5 Brookside Granville Guild Squire Boston MA t6 Brookside Granville Queen Main Chicago IL t7 Brookside Granville Queen Main Chicago IL t8 Brookside Granville Queen Main Chicago IL t9 Brookside Granville Queen Bay Chicago IL Cheap repair of constraints and data FD: [District, Region, Municipal] [AC, City, State] t3.state = NY 88

89 Repairing Data and Constraints Tid District Region Municipal AC Tel Street Zip City State t1 Brookside Granville Glendale Boxwood NY NY t2 Brookside Granville Glendale *** Boxwood *** *** t3 Brookside Granville Glendale *** Westlane *** MA t4 Brookside Granville Guild Squire Boston MA t5 Brookside Granville Guild *** Squire *** *** t6 Brookside Granville Queen Main Chicago IL t7 Brookside Granville Queen *** Main *** *** t8 Brookside Granville Queen *** Main *** *** t9 Brookside Granville Queen *** Bay *** *** MDL: Length of model + length to encode data given the model FD: [District, Region, Municipal] [AC, City, State] 89

90 Conclusions Big data quality (veracity) is an important area of research Challenges due to volume, velocity, variety, variability Much interesting work has been done in this area Learn models (semantics) from the data Identify data glitches as violations of the learned models Repair data glitches and models in a timely manner A lot more research needs to be done! 90

91 Crowdsourcing Improving data quality by crowdsourcing 91

92 Source Exploration Tool Data.gov 92

93 Bibliography [BFF+05] Philip Bohannon, Michael Flaster, Wenfei Fan, Rajeev Rastogi: A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification. SIGMOD 2005: [CM11] Fei Chiang, Renée J. Miller: A unified model for data and constraint repair. ICDE 2011: [DBS09a] Xin Luna Dong, Laure Berti-Equille, DiveshSrivastava: Integrating Conflicting Data: The Role of Source Dependence. PVLDB 2(1): (2009) [GKK+08] Lukasz Golab, Howard J. Karloff, Flip Korn, Divesh Srivastava, Bei Yu: On generating nearoptimal tableaux for conditional functional dependencies. PVLDB 1(1): (2008) [LDL+12] Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, DiveshSrivastava: Truth Finding on the Deep Web: Is the Problem Solved? PVLDB 6(2): (2012) 93

(Big Data Integration) : :

(Big Data Integration) : : (Big Data Integration) : : 3 # $%&'! ()* +$,- 2/30 ()* + # $%&' = 3 : $ 2 : 17 ;' $ # < 2 6 ' $%&',# +'= > 0 - '? @0 A 1 3/30 3?. - B 6 @* @(C : E6 - > ()* (C :(C E6 1' +'= - ''3-6 F :* 2G '> H-! +'-?

More information

A Unified Model for Data and Constraint Repair

A Unified Model for Data and Constraint Repair A Unified Model for Data and Constraint Repair Fei Chiang, Renée J. Miller Department of Computer Science, University of Toronto Toronto, Canada {fchiang, miller}@cs.toronto.edu Abstract Integrity constraints

More information

Data Glitches = Constraint Violations Empirical Explanations. Divesh Srivastava AT&T Labs-Research

Data Glitches = Constraint Violations Empirical Explanations. Divesh Srivastava AT&T Labs-Research Data Glitches = Constraint Violations Empirical Explanations Divesh Srivastava AT&T Labs-Research What is a Glitch? A spaceman's word for irritating disturbances [Time, 23 Jul 1965]. Something's gone wrong

More information

Efficient and Effective Analysis of Data Quality using Pattern Tableaux

Efficient and Effective Analysis of Data Quality using Pattern Tableaux Efficient and Effective Analysis of Data Quality using Pattern Tableaux Lukasz Golab, Flip Korn and Divesh Srivastava AT&T Labs - Research 180 Park Avenue, Florham Park NJ, 07932, USA {lgolab, flip, divesh}@research.att.com

More information

Data Cleansing. LIU Jingyuan, Vislab WANG Yilei, Theoretical group

Data Cleansing. LIU Jingyuan, Vislab WANG Yilei, Theoretical group Data Cleansing LIU Jingyuan, Vislab WANG Yilei, Theoretical group What is Data Cleansing Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or inconsistencies

More information

Truth Finding on the Deep Web: Is the Problem Solved?

Truth Finding on the Deep Web: Is the Problem Solved? Truth Finding on the Deep Web: Is the Problem Solved? Xian Li SUNY at Binghamton xianli@cs.binghamton.edu Weiyi Meng SUNY at Binghamton meng@cs.binghamton.edu Xin Luna Dong AT&T Labs-Research lunadong@research.att.com

More information

Continuous Data Cleaning

Continuous Data Cleaning Continuous Data Cleaning M. Volkovs, F. Chiang, J. Szlichta and R. J. Miller ICDE 2014 Presenter: Nabiha Asghar Outline Introduction and motivation Main contributions of the paper Description of architecture

More information

Robust Discovery of Positive and Negative Rules in Knowledge-Bases

Robust Discovery of Positive and Negative Rules in Knowledge-Bases Robust Discovery of Positive and Negative Rules in Knowledge-Bases Paolo Papotti joint work with S. Ortona (Meltwater) and V. Meduri (ASU) http://www.eurecom.fr/en/publication/5321/detail/robust-discovery-of-positive-and-negative-rules-in-knowledge-bases

More information

DATA cleaning, which is to detect and repair data errors,

DATA cleaning, which is to detect and repair data errors, A Novel Cost-Based Model for Data Repairing Shuang Hao Nan Tang Guoliang Li Jian He Na Ta Jianhua Feng Abstract Integrity constraint based data repairing is an iterative process consisting of two parts:

More information

Bringing Order to Big Data. Jarek Szlichta

Bringing Order to Big Data. Jarek Szlichta Jarek Bringing Order to Big Data Conducted research was partially supported by IBM CAS 1 Data, Data Everywhere Open data Business Data Web Data Available at different formats. 2 Big Data to Data Science

More information

Ranking for Data Repairs

Ranking for Data Repairs Ranking for Data Repairs Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville Purdue University, West Lafayette, IN 47907, USA {myakout, ake, neville}@cs.purdue.edu Abstract Improving data quality is

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 4 - Schema Normalization

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 4 - Schema Normalization CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 4 - Schema Normalization References R&G Book. Chapter 19: Schema refinement and normal forms Also relevant to

More information

Improving Data Quality: Consistency and Accuracy

Improving Data Quality: Consistency and Accuracy Improving Data Quality: Consistency and Accuracy Gao Cong 1 Wenfei Fan 2,3 Floris Geerts 2,4,5 Xibei Jia 2 Shuai Ma 2 1 Microsoft Research Asia 2 University of Edinburgh 4 Hasselt University 3 Bell Laboratories

More information

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems Data Warehousing and Data Mining CPS 116 Introduction to Database Systems Announcements (December 1) 2 Homework #4 due today Sample solution available Thursday Course project demo period has begun! Check

More information

Exam Advanced Data Mining Date: Time:

Exam Advanced Data Mining Date: Time: Exam Advanced Data Mining Date: 11-11-2010 Time: 13.30-16.30 General Remarks 1. You are allowed to consult 1 A4 sheet with notes written on both sides. 2. Always show how you arrived at the result of your

More information

Extending Functional Dependency to Detect Abnormal Data in RDF Graphs

Extending Functional Dependency to Detect Abnormal Data in RDF Graphs Extending Functional Dependency to Detect Abnormal Data in RDF Graphs Yang Yu, Jeff Heflin SWAT Lab Department of Computer Science and Engineering Lehigh University PA, USA Outline Semantic Web data and

More information

Quotient Cube: How to Summarize the Semantics of a Data Cube

Quotient Cube: How to Summarize the Semantics of a Data Cube Quotient Cube: How to Summarize the Semantics of a Data Cube Laks V.S. Lakshmanan (Univ. of British Columbia) * Jian Pei (State Univ. of New York at Buffalo) * Jiawei Han (Univ. of Illinois at Urbana-Champaign)

More information

Relational model continued. Understanding how to use the relational model. Summary of board example: with Copies as weak entity

Relational model continued. Understanding how to use the relational model. Summary of board example: with Copies as weak entity COS 597A: Principles of Database and Information Systems Relational model continued Understanding how to use the relational model 1 with as weak entity folded into folded into branches: (br_, librarian,

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data

More information

Incognito: Efficient Full Domain K Anonymity

Incognito: Efficient Full Domain K Anonymity Incognito: Efficient Full Domain K Anonymity Kristen LeFevre David J. DeWitt Raghu Ramakrishnan University of Wisconsin Madison 1210 West Dayton St. Madison, WI 53706 Talk Prepared By Parul Halwe(05305002)

More information

Data X-Ray: A diagnostic tool for data errors Xiaolan Wang Xin Luna Dong Alexandra Meliou

Data X-Ray: A diagnostic tool for data errors Xiaolan Wang Xin Luna Dong Alexandra Meliou Data X-Ray: A diagnostic tool for data errors Xiaolan Wang Xin Luna Dong Alexandra Meliou UNIVERSITY OF MASSACHUSETTS, AMHERST College of Information and Computer Sciences MANY APPLICATIONS RELY ON DATA

More information

Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar

Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar School of Computer Science Faculty of Science University of Windsor How Big is this Big Data? 40 Billion Instagram Photos 300 Hours

More information

Identifying Useful Data Dependency Using Agree Set form Relational Database

Identifying Useful Data Dependency Using Agree Set form Relational Database Volume 1, Issue 6, September 2016 ISSN: 2456-0006 International Journal of Science Technology Management and Research Available online at: Identifying Useful Data Using Agree Set form Relational Database

More information

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy Xiaokui Xiao Nanyang Technological University Outline Privacy preserving data publishing: What and Why Examples of privacy attacks

More information

Asking the Right Questions in Crowd Data Sourcing

Asking the Right Questions in Crowd Data Sourcing MoDaS Mob Data Sourcing Asking the Right Questions in Crowd Data Sourcing Tova Milo Tel Aviv University Outline Introduction to crowd (data) sourcing Databases and crowds Declarative is good How to best

More information

Data Preprocessing. Slides by: Shree Jaswal

Data Preprocessing. Slides by: Shree Jaswal Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data

More information

The Relational Data Model

The Relational Data Model The Relational Data Model Lecture 6 1 Outline Relational Data Model Functional Dependencies Logical Schema Design Reading Chapter 8 2 1 The Relational Data Model Data Modeling Relational Schema Physical

More information

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2009 Lecture 3 - Schema Normalization

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2009 Lecture 3 - Schema Normalization CSE 544 Principles of Database Management Systems Magdalena Balazinska Fall 2009 Lecture 3 - Schema Normalization References R&G Book. Chapter 19: Schema refinement and normal forms Also relevant to this

More information

Textbook: Chapter 6! CS425 Fall 2013 Boris Glavic! Chapter 3: Formal Relational Query. Relational Algebra! Select Operation Example! Select Operation!

Textbook: Chapter 6! CS425 Fall 2013 Boris Glavic! Chapter 3: Formal Relational Query. Relational Algebra! Select Operation Example! Select Operation! Chapter 3: Formal Relational Query Languages CS425 Fall 2013 Boris Glavic Chapter 3: Formal Relational Query Languages Relational Algebra Tuple Relational Calculus Domain Relational Calculus Textbook:

More information

Functional Dependencies and Finding a Minimal Cover

Functional Dependencies and Finding a Minimal Cover Functional Dependencies and Finding a Minimal Cover Robert Soulé 1 Normalization An anomaly occurs in a database when you can update, insert, or delete data, and get undesired side-effects. These side

More information

Measuring and Evaluating Dissimilarity in Data and Pattern Spaces

Measuring and Evaluating Dissimilarity in Data and Pattern Spaces Measuring and Evaluating Dissimilarity in Data and Pattern Spaces Irene Ntoutsi, Yannis Theodoridis Database Group, Information Systems Laboratory Department of Informatics, University of Piraeus, Greece

More information

Lessons Learned and Research Agenda for Big Data Integration of Product Specifications (Discussion Paper)

Lessons Learned and Research Agenda for Big Data Integration of Product Specifications (Discussion Paper) Lessons Learned and Research Agenda for Big Data Integration of Product Specifications (Discussion Paper) Luciano Barbosa 1, Valter Crescenzi 2, Xin Luna Dong 3, Paolo Merialdo 2, Federico Piai 2, Disheng

More information

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA

Data Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA Obj ti Objectives Motivation: Why preprocess the Data? Data Preprocessing Techniques Data Cleaning Data Integration and Transformation Data Reduction Data Preprocessing Lecture 3/DMBI/IKI83403T/MTI/UI

More information

A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification

A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification Philip Bohannon Lucent Technologies Bell Laboratories bohannon@researchbell-labscom Wenfei Fan Univ of Edinburgh

More information

Exam I Computer Science 420 Dr. St. John Lehman College City University of New York 12 March 2002

Exam I Computer Science 420 Dr. St. John Lehman College City University of New York 12 March 2002 Exam I Computer Science 420 Dr. St. John Lehman College City University of New York 12 March 2002 NAME (Printed) NAME (Signed) E-mail Exam Rules Show all your work. Your grade will be based on the work

More information

ECE521 Lecture 18 Graphical Models Hidden Markov Models

ECE521 Lecture 18 Graphical Models Hidden Markov Models ECE521 Lecture 18 Graphical Models Hidden Markov Models Outline Graphical models Conditional independence Conditional independence after marginalization Sequence models hidden Markov models 2 Graphical

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 3 Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2011 Han, Kamber & Pei. All rights

More information

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)?

Overview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)? Introduction to Data Warehousing and Business Intelligence Overview Why Business Intelligence? Data analysis problems Data Warehouse (DW) introduction A tour of the coming DW lectures DW Applications Loosely

More information

Announcements. CS 188: Artificial Intelligence Fall Reminder: CSPs. Today. Example: 3-SAT. Example: Boolean Satisfiability.

Announcements. CS 188: Artificial Intelligence Fall Reminder: CSPs. Today. Example: 3-SAT. Example: Boolean Satisfiability. CS 188: Artificial Intelligence Fall 2008 Lecture 5: CSPs II 9/11/2008 Announcements Assignments: DUE W1: NOW P1: Due 9/12 at 11:59pm Assignments: UP W2: Up now P2: Up by weekend Dan Klein UC Berkeley

More information

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Fall 2008 CS 188: Artificial Intelligence Fall 2008 Lecture 5: CSPs II 9/11/2008 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 1 1 Assignments: DUE Announcements

More information

ERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution Leopoldo Bertossi Carleton University School of Computer Science Institute for Data Science Ottawa, Canada bertossi@scs.carleton.ca

More information

CSCI1270 Introduction to Database Systems

CSCI1270 Introduction to Database Systems CSCI1270 Introduction to Database Systems with thanks to Prof. George Kollios, Boston University Prof. Mitch Cherniack, Brandeis University Prof. Avi Silberschatz, Yale University 1.1 What is a Database

More information

Entity-Relationship Modelling. Entities Attributes Relationships Mapping Cardinality Keys Reduction of an E-R Diagram to Tables

Entity-Relationship Modelling. Entities Attributes Relationships Mapping Cardinality Keys Reduction of an E-R Diagram to Tables Entity-Relationship Modelling Entities Attributes Relationships Mapping Cardinality Keys Reduction of an E-R Diagram to Tables 1 Entity Sets A enterprise can be modeled as a collection of: entities, and

More information

Today. CS 188: Artificial Intelligence Fall Example: Boolean Satisfiability. Reminder: CSPs. Example: 3-SAT. CSPs: Queries.

Today. CS 188: Artificial Intelligence Fall Example: Boolean Satisfiability. Reminder: CSPs. Example: 3-SAT. CSPs: Queries. CS 188: Artificial Intelligence Fall 2007 Lecture 5: CSPs II 9/11/2007 More CSPs Applications Tree Algorithms Cutset Conditioning Today Dan Klein UC Berkeley Many slides over the course adapted from either

More information

Peter X. Gao, Andrew R. Curtis, Bernard Wong, S. Keshav. Cheriton School of Computer Science University of Waterloo

Peter X. Gao, Andrew R. Curtis, Bernard Wong, S. Keshav. Cheriton School of Computer Science University of Waterloo Peter X. Gao, Andrew R. Curtis, Bernard Wong, S. Keshav Cheriton School of Computer Science University of Waterloo August 15, 2012 1 = ~1M servers CO 2 of 280,000 cars 2 Datacenters and Request Routing

More information

Semantic Search at Bloomberg

Semantic Search at Bloomberg Semantic Search at Bloomberg Search Solutions 2017 Edgar Meij Team lead, R&D AI emeij@bloomberg.net @edgarmeij Bloomberg Professional Service Bloomberg at a glance Bloomberg Professional Service Trading

More information

GraDit: graph-based data repair algorithm for multiple data edits rule violations

GraDit: graph-based data repair algorithm for multiple data edits rule violations Journal of Physics: Conference Series PAPER OPEN ACCESS GraDit: graph-based data repair algorithm for multiple data edits rule violations To cite this article: Wa Ode Zuhayeni Madjida and I Gusti Bagus

More information

Scalable and Holistic Qualitative Data Cleaning

Scalable and Holistic Qualitative Data Cleaning Scalable and Holistic Qualitative Data Cleaning by Xu Chu A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Computer Science

More information

CAS CS 460/660 Introduction to Database Systems. Fall

CAS CS 460/660 Introduction to Database Systems. Fall CAS CS 460/660 Introduction to Database Systems Fall 2017 1.1 About the course Administrivia Instructor: George Kollios, gkollios@cs.bu.edu MCS 283, Mon 2:30-4:00 PM and Tue 1:00-2:30 PM Teaching Fellows:

More information

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining. About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts

More information

Open Data Integration. Renée J. Miller

Open Data Integration. Renée J. Miller Open Data Integration Renée J. Miller miller@northeastern.edu !2 Open Data Principles Timely & Comprehensive Accessible and Usable Complete - All public data is made available. Public data is data that

More information

Improving the Performance of OLAP Queries Using Families of Statistics Trees

Improving the Performance of OLAP Queries Using Families of Statistics Trees Improving the Performance of OLAP Queries Using Families of Statistics Trees Joachim Hammer Dept. of Computer and Information Science University of Florida Lixin Fu Dept. of Mathematical Sciences University

More information

Chapter 2: Relational Model

Chapter 2: Relational Model Chapter 2: Relational Model Database System Concepts, 5 th Ed. See www.db-book.com for conditions on re-use Chapter 2: Relational Model Structure of Relational Databases Fundamental Relational-Algebra-Operations

More information

Database System Concepts

Database System Concepts s Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth and Sudarshan. Chapter 2: Model Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2009/2010

More information

Big Data Challenges in Large IP Networks

Big Data Challenges in Large IP Networks Big Data Challenges in Large IP Networks Feature Extraction & Predictive Alarms for network management Wednesday 28 th Feb 2018 Dave Yearling British Telecommunications plc 2017 What we will cover Making

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Indexing Bi-temporal Windows. Chang Ge 1, Martin Kaufmann 2, Lukasz Golab 1, Peter M. Fischer 3, Anil K. Goel 4

Indexing Bi-temporal Windows. Chang Ge 1, Martin Kaufmann 2, Lukasz Golab 1, Peter M. Fischer 3, Anil K. Goel 4 Indexing Bi-temporal Windows Chang Ge 1, Martin Kaufmann 2, Lukasz Golab 1, Peter M. Fischer 3, Anil K. Goel 4 1 2 3 4 Outline Introduction Bi-temporal Windows Related Work The BiSW Index Experiments Conclusion

More information

COSC Dr. Ramon Lawrence. Emp Relation

COSC Dr. Ramon Lawrence. Emp Relation COSC 304 Introduction to Database Systems Normalization Dr. Ramon Lawrence University of British Columbia Okanagan ramon.lawrence@ubc.ca Normalization Normalization is a technique for producing relations

More information

Applied Databases. Sebastian Maneth. Lecture 5 ER Model, Normal Forms. University of Edinburgh - January 30 th, 2017

Applied Databases. Sebastian Maneth. Lecture 5 ER Model, Normal Forms. University of Edinburgh - January 30 th, 2017 Applied Databases Lecture 5 ER Model, Normal Forms Sebastian Maneth University of Edinburgh - January 30 th, 2017 Outline 2 1. Entity Relationship Model 2. Normal Forms From Last Lecture 3 the Lecturer

More information

Big Data Analytics. Rasoul Karimi

Big Data Analytics. Rasoul Karimi Big Data Analytics Rasoul Karimi Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 1 Outline

More information

CS411 Database Systems. 05: Relational Schema Design Ch , except and

CS411 Database Systems. 05: Relational Schema Design Ch , except and CS411 Database Systems 05: Relational Schema Design Ch. 3.1-3.5, except 3.4.2-3.4.3 and 3.5.3. 1 How does this fit in? ER Diagrams: Data Definition Translation to Relational Schema: Data Definition Relational

More information

On Data Dependencies in Dataspaces

On Data Dependencies in Dataspaces On Data Dependencies in Dataspaces Tsinghua University This is a joint work with Lei Chen (HKUST) and Philip S. Yu (UIC) 2011 On Data Dependencies in Dataspaces Introduction 1/24 Dataspaces provide a co-existing

More information

CS 188: Artificial Intelligence Spring Today

CS 188: Artificial Intelligence Spring Today CS 188: Artificial Intelligence Spring 2006 Lecture 7: CSPs II 2/7/2006 Dan Klein UC Berkeley Many slides from either Stuart Russell or Andrew Moore Today More CSPs Applications Tree Algorithms Cutset

More information

FINDING QUALITY IN QUANTITY: THE CHALLENGE OF DISCOVERING VALUABLE SOURCES FOR INTEGRATION

FINDING QUALITY IN QUANTITY: THE CHALLENGE OF DISCOVERING VALUABLE SOURCES FOR INTEGRATION FINDING QUALITY IN QUANTITY: THE CHALLENGE OF DISCOVERING VALUABLE SOURCES FOR INTEGRATION Theodoros Rekatsinas University of Maryland Amol Deshpande, Xin Luna Dong, Lise Getoor and Divesh Srivastava DATA,

More information

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems Data Warehousing & Mining CPS 116 Introduction to Database Systems Data integration 2 Data resides in many distributed, heterogeneous OLTP (On-Line Transaction Processing) sources Sales, inventory, customer,

More information

Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety

Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety Abhishek

More information

Handout 12: Textual models

Handout 12: Textual models Handout 12: Textual models Taylor Arnold Loading and parsing the data The full text of all the State of the Union addresses through 2016 are available in the R package sotu, available on CRAN. The package

More information

Identity John Homewoner Mary Homewoner. Employer - Company Search nh medical center columbia hospital

Identity John Homewoner Mary Homewoner. Employer - Company Search nh medical center columbia hospital Summary of findings: ADV-120 Reference Number Requesting Lender Feb 03, 2016 12:06:32 PM EST Identity John Homewoner Mary Homewoner First Name Pass Pass Last Name Pass Pass SSN Discrepancy Discrepancy

More information

Data Preprocessing. Data Mining 1

Data Preprocessing. Data Mining 1 Data Preprocessing Today s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size and their likely origin from multiple, heterogenous sources.

More information

Software Defined Networking Security: Security for SDN and Security with SDN. Seungwon Shin Texas A&M University

Software Defined Networking Security: Security for SDN and Security with SDN. Seungwon Shin Texas A&M University Software Defined Networking Security: Security for SDN and Security with SDN Seungwon Shin Texas A&M University Contents SDN Basic Operation SDN Security Issues SDN Operation L2 Forwarding application

More information

Interactive Visualization of the Stock Market Graph

Interactive Visualization of the Stock Market Graph Interactive Visualization of the Stock Market Graph Presented by Camilo Rostoker rostokec@cs.ubc.ca Department of Computer Science University of British Columbia Overview 1. Introduction 2. The Market

More information

BAYESIAN NETWORKS STRUCTURE LEARNING

BAYESIAN NETWORKS STRUCTURE LEARNING BAYESIAN NETWORKS STRUCTURE LEARNING Xiannian Fan Uncertainty Reasoning Lab (URL) Department of Computer Science Queens College/City University of New York http://url.cs.qc.cuny.edu 1/52 Overview : Bayesian

More information

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control. What is Learning? CS 343: Artificial Intelligence Machine Learning Herbert Simon: Learning is any process by which a system improves performance from experience. What is the task? Classification Problem

More information

Relational Databases

Relational Databases Relational Databases Lecture 2 Chapter 3 Robb T. Koether Hampden-Sydney College Fri, Jan 18, 2013 Robb T. Koether (Hampden-Sydney College) Relational Databases Fri, Jan 18, 2013 1 / 26 1 Types of Databases

More information

STRATEGIC DIRECTION SUPPORTED: Organizational Sustainability.

STRATEGIC DIRECTION SUPPORTED: Organizational Sustainability. DATE: January 9, 2017 MEMO TO: Craig Taylor, Chair Operations Committee S. Michael Rummel, Chair Finance Committee FROM: Mary E. Kann Director of Administration RECOMMENDATION: Recommend approval of a

More information

BOOLEAN MATRIX FACTORIZATIONS. with applications in data mining Pauli Miettinen

BOOLEAN MATRIX FACTORIZATIONS. with applications in data mining Pauli Miettinen BOOLEAN MATRIX FACTORIZATIONS with applications in data mining Pauli Miettinen MATRIX FACTORIZATIONS BOOLEAN MATRIX FACTORIZATIONS o THE BOOLEAN MATRIX PRODUCT As normal matrix product, but with addition

More information

Consistent Query Answering

Consistent Query Answering Consistent Query Answering Opportunities and Limitations Jan Chomicki Dept. CSE University at Buffalo State University of New York http://www.cse.buffalo.edu/ chomicki 1 Integrity constraints Integrity

More information

Consistent Query Answering: Opportunities and Limitations

Consistent Query Answering: Opportunities and Limitations Consistent Query Answering: Opportunities and Limitations Jan Chomicki Dept. Computer Science and Engineering University at Buffalo, SUNY Buffalo, NY 14260-2000, USA chomicki@buffalo.edu Abstract This

More information

Approximation Algorithms for Clustering Uncertain Data

Approximation Algorithms for Clustering Uncertain Data Approximation Algorithms for Clustering Uncertain Data Graham Cormode AT&T Labs - Research graham@research.att.com Andrew McGregor UCSD / MSR / UMass Amherst andrewm@ucsd.edu Introduction Many applications

More information

Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges

Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges Abhishek Santra 1 and Sanjukta Bhowmick 2 1 Information Technology Laboratory, CSE Department, University of

More information

COS 126 General Computer Science Fall Exam 1

COS 126 General Computer Science Fall Exam 1 COS 126 General Computer Science Fall 2005 Exam 1 This test has 9 questions worth a total of 50 points. You have 120 minutes. The exam is closed book, except that you are allowed to use a one page cheatsheet,

More information

Database Design Theory and Normalization. CS 377: Database Systems

Database Design Theory and Normalization. CS 377: Database Systems Database Design Theory and Normalization CS 377: Database Systems Recap: What Has Been Covered Lectures 1-2: Database Overview & Concepts Lecture 4: Representational Model (Relational Model) & Mapping

More information

11/04/16. Data Profiling. Helena Galhardas DEI/IST. References

11/04/16. Data Profiling. Helena Galhardas DEI/IST. References Data Profiling Helena Galhardas DEI/IST References Slides Data Profiling course, Felix Naumann, Trento, July 2015 Z. Abedjan, L. Golab, F. Naumann, Profiling Relational Data A Survey, VLDBJ 2015 T. Papenbrock

More information

Data Quality Problems beyond Consistency and Deduplication

Data Quality Problems beyond Consistency and Deduplication Data Quality Problems beyond Consistency and Deduplication Wenfei Fan Floris Geerts Shuai Ma Nan Tang Wenyuan Yu University of Edinburgh {wenfei@inf., fgeerts@inf., sma1@inf., ntang@inf., wenyuan.yu@}ed.ac.uk

More information

Query Answering over Functional Dependency Repairs

Query Answering over Functional Dependency Repairs Query Answering over Functional Dependency Repairs by Artur Galiullin A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics in

More information

B.2 Measures of Central Tendency and Dispersion

B.2 Measures of Central Tendency and Dispersion Appendix B. Measures of Central Tendency and Dispersion B B. Measures of Central Tendency and Dispersion What you should learn Find and interpret the mean, median, and mode of a set of data. Determine

More information

Set Cover Algorithms For Very Large Datasets

Set Cover Algorithms For Very Large Datasets Set Cover Algorithms For Very Large Datasets Graham Cormode Howard Karloff AT&T Labs-Research Tony Wirth University of Melbourne Set Cover? Given a collection of sets over a universe of items Find smallest

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining *Some of the slides are from Jaideep Srivastava @ http://www.cs.umn.edu/faculty/srivasta.html Mike Kassoff @ http://logic.stanford.edu/classes/cs246/lect ures2001/mkassoff_lecture.ppt

More information

Data Quality Problems beyond Consistency and Deduplication

Data Quality Problems beyond Consistency and Deduplication Data Quality Problems beyond Consistency and Deduplication Wenfei Fan Floris Geerts Shuai Ma Nan Tang Wenyuan Yu University of Edinburgh {wenfei@inf., fgeerts@inf., sma1@inf., ntang@inf., wenyuan.yu@}ed.ac.uk

More information

arxiv: v2 [cs.db] 30 Dec 2017

arxiv: v2 [cs.db] 30 Dec 2017 Human-Centric Data Cleaning [Vision] El Kindi Rezig Mourad Ouzzani Ahmed K. Elmagarmid Walid G. Aref Purdue University Qatar Computing Research Institute erezig@cs.purdue.edu, mouzzani@hbku.edu.qa, aelmagarmid@hbku.edu.qa,

More information

Bayesian Networks Inference (continued) Learning

Bayesian Networks Inference (continued) Learning Learning BN tutorial: ftp://ftp.research.microsoft.com/pub/tr/tr-95-06.pdf TAN paper: http://www.cs.huji.ac.il/~nir/abstracts/frgg1.html Bayesian Networks Inference (continued) Learning Machine Learning

More information

CSE 562 Database Systems

CSE 562 Database Systems Goal CSE 562 Database Systems Question: The relational model is great, but how do I go about designing my database schema? Database Design Some slides are based or modified from originals by Magdalena

More information

Introduction to Database Systems. Announcements CSE 444. Review: Closure, Key, Superkey. Decomposition: Schema Design using FD

Introduction to Database Systems. Announcements CSE 444. Review: Closure, Key, Superkey. Decomposition: Schema Design using FD Introduction to Database Systems CSE 444 Lecture #9 Jan 29 2001 Announcements Mid Term on Monday (in class) Material in lectures Textbook Chapter 1.1, Chapter 2 (except 2.1 and ODL), Chapter 3 (except

More information

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer Topic 14: Scheduling COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer 1 The Back End Well, let s see Motivating example Starting point Motivating example Starting point Multiplication

More information

Horn Formulae. CS124 Course Notes 8 Spring 2018

Horn Formulae. CS124 Course Notes 8 Spring 2018 CS124 Course Notes 8 Spring 2018 In today s lecture we will be looking a bit more closely at the Greedy approach to designing algorithms. As we will see, sometimes it works, and sometimes even when it

More information

Managing Inconsistencies in Collaborative Data Management

Managing Inconsistencies in Collaborative Data Management Managing Inconsistencies in Collaborative Data Management Eric Kao Logic Group Computer Science Department Stanford University Talk given at HP Labs on November 9, 2010 Structured Data Public Sources Company

More information

Efficient Approximation of Correlated Sums on Data Streams

Efficient Approximation of Correlated Sums on Data Streams Efficient Approximation of Correlated Sums on Data Streams Rohit Ananthakrishna Cornell University rohit@cs.cornell.edu Flip Korn AT&T Labs Research flip@research.att.com Abhinandan Das Cornell University

More information

This lecture. Databases -Normalization I. Repeating Data. Redundancy. This lecture introduces normal forms, decomposition and normalization.

This lecture. Databases -Normalization I. Repeating Data. Redundancy. This lecture introduces normal forms, decomposition and normalization. This lecture Databases -Normalization I This lecture introduces normal forms, decomposition and normalization (GF Royle 2006-8, N Spadaccini 2008) Databases - Normalization I 1 / 23 (GF Royle 2006-8, N

More information

UGuide User-Guided Discovery of FD-Detectable Errors

UGuide User-Guided Discovery of FD-Detectable Errors UGuide User-Guided Discovery of FD-Detectable Errors Saravanan Thirumuruganathan Laure Berti-Equille Mourad Ouzzani Jorge-Arnulfo Quiane-Ruiz Nan Tang Qatar Computing Research Institute HBKU, Research

More information

Speeding Up Data Science: From a Data Management Perspective

Speeding Up Data Science: From a Data Management Perspective Speeding Up Data Science: From a Data Management Perspective Jiannan Wang Database System Lab (DSL) Simon Fraser University NWDS Meeting, Jan 5, 2018 1 Simon Fraser University 2 SFU DB/DM Group Ke Wang

More information