Data Quality: the Other Face of Big Data. Divesh Srivastava AT&T Labs-Research
|
|
- Brook Copeland
- 5 years ago
- Views:
Transcription
1 Data Quality: the Other Face of Big Data Divesh Srivastava AT&T Labs-Research
2 Data Quality I am a manager I am also a researcher working on data quality 2
3 Big Data Big data is different things to different people Volume, velocity, variety, variability, value, veracity 3
4 Big Data + Data Quality Big data: all about the V s Size: huge volume of data from multiple sources Speed: dynamic data, collected and analyzed at high velocity Complexity: large variety of data and sources Evolution: considerable variability of data, semantics over time Goal: to extract significant value from big data Key stumbling block: data quality Raw data is often of questionable veracity How do we obtain high quality information? 4
5 Big Data + Data Quality 5
6 Data Can Be Erroneous The story, marked Hold for release Do not use, was sent in error to the news service s thousands of corporate clients. 6
7 Data Quality: By the Numbers Impact of poor data quality Erroneous data costs US businesses $600 billion/year [E02] In DW projects, data cleaning takes 30-80% of time and budget Data quality tools market is growing at 16% annually, way over 7% average for other IT segments [G07] How much data is erroneous Enterprise data error rates: average of 1-5%, some > 30% [R98] 7
8 Case Study: Big Data Quality [LDL+12] Study on two domains Belief of clean data Poor quality data can have big impact #Sources Period #Objects #Localattrs #Globalattrs Considered items Stock 55 7/ * *20 Flight 38 12/ * *31 8
9 Case Study: Big Data Quality Is the data consistent? Tolerance to 1% value difference 9
10 Case Study: Big Data Quality Why such inconsistency? Semantic ambiguity Nasdaq Yahoo! Finance Day s Range: wk Range: Wk:
11 Case Study: Big Data Quality Why such inconsistency? Unit errors 76.82B 76,821,000 11
12 Case Study: Big Data Quality Why such inconsistency? Instance ambiguity 12
13 Case Study: Big Data Quality Why such inconsistency? Pure errors FlightView FlightAware Orbitz 6:15 PM 6:22 PM 6:15 PM 9:40 PM 8:33 PM 9:54 PM 13
14 Case Study: Big Data Quality Why such inconsistency? Random sample of 20 data items + 5 items with largest # of values 14
15 Case Study: Big Data Quality Copying between sources? 15
16 Case Study: Big Data Quality Copying on erroneous data? 16
17 Case Study: Lessons Learned Big data has considerable inconsistency Even in domains where poor quality data can have big impact Semantics ambiguity, out of date data, unexplainable errors Data sources often copy from each other Copying can happen on erroneous data, spreading poor quality data 17
18 Small Data Quality: How Was It Achieved? Specify all domain knowledge as integrity constraints on data Reject updates that do not preserve integrity constraints Works well when the domain is well understood and static 18
19 Big Data Quality: A Different Approach? Big data: integrity constraints cannot be specified a priori Data variety, volume complete domain knowledge is infeasible Data velocity, variability domain knowledge becomes obsolete Too much rejected data small data 19
20 Big Data Quality: A Different Approach? Big data: integrity constraints cannot be specified a priori Data variety, volume complete domain knowledge is infeasible Data velocity, variability domain knowledge becomes obsolete Solution: let the data speak for itself Learn models (semantics) from the data Identify data glitches as violations of the learned models Repair data glitches and models in a timely manner 20
21 In This Talk A focus on well-structured data and logic-based data quality Models: logical constraints, e.g., (C)FDs, IDs, MDs, EGDs, DCs Glitches: groups of cells, i.e., (tuple-id, attribute) pairs Repairs: cost-based modifications to the data and models What we do not discuss in this talk Logic-based: consistent query answering, without data repairs Statistics-based: statistical models, statistical anomaly detection Unstructured data: quality of audio, video, extracted data 21
22 Outline Introduction Identifying inconsistencies Repairing inconsistencies 22
23 Identifying Inconsistencies Small data: specify semantics as integrity constraints on data Big data: let the data speak for itself Learn models (e.g., constraints, rules, patterns) from the data Identify data glitches as violations of the learned models 23
24 Example: Functional Dependencies Tid Name Type Country Customer Price Tax t1 Harry Potter Book France Andrew 10 0 t2 Harry Potter Book France Barbara 10 0 t3 Harry Potter Book France Candy t4 Lord of the Rings Book USA David 25 0 t5 Lord of the Rings Book USA Eran t6 Armani Suit Clothing USA Frank t7 Armani Suit Clothing USA George t8 Kaminski Hat Clothing Australia Harry t9 Kaminski Hat Clothing Australia Ian FD: [Name, Type, Country] [Price, Tax] FDs used to check consistency 24
25 Example: Functional Dependencies X Tid Name Type Country Customer Price Tax t1 Harry Potter Book France Andrew 10 0 t2 Harry Potter Book France Barbara 10 0 t3 Harry Potter Book France Candy t4 Lord of the Rings Book USA David 25 0 t5 Lord of the Rings Book USA Eran t6 Armani Suit Clothing USA Frank t7 Armani Suit Clothing USA George t8 Kaminski Hat Clothing Australia Harry t9 Kaminski Hat Clothing Australia Ian FD: [Name, Type, Country] [Price, Tax] FDs used to check consistency 25
26 Identifying Inconsistencies: Impact of Big Data Variety, variability of data: one size does not fit all Learn conditional models (contextual semantics) 26
27 Example: Conditional FD Tid Name Type Country Customer Price Tax t1 Harry Potter Book France Andrew 10 0 t2 Harry Potter Book France Barbara 10 0 t3 Harry Potter Book France Candy t4 Lord of the Rings Book USA David 25 0 t5 Lord of the Rings Book USA Eran t6 Armani Suit Clothing USA Frank t7 Armani Suit Clothing USA George t8 Kaminski Hat Clothing Australia Harry t9 Kaminski Hat Clothing Australia Ian CFD: [Name = *, Type = Clothing, Country = *] [Price, Tax] CFDs used to check consistency of subset of table 27
28 Identifying Inconsistencies: Impact of Big Data Variety, variability of data: exact vs approximate models Exact approaches can lead to over-fitting, large number of patterns Approximate approaches can have few violations: these are glitches Statistically robust measures: use supportand confidence 28
29 Example: Conditional FD Tid Name Type Country Customer Price Tax t1 Harry Potter Book France Andrew 10 0 t2 Harry Potter Book France Barbara 10 0 t3 Harry Potter Book France Candy t4 Lord of the Rings Book USA David 25 0 t5 Lord of the Rings Book USA Eran t6 Armani Suit Clothing USA Frank t7 Armani Suit Clothing USA George t8 Kaminski Hat Clothing Australia Harry t9 Kaminski Hat Clothing Australia Ian CFD: [Name = *, Type = *, Country = France] [Price, Tax] Holds approximately, support = 3/9, confidence = 2/3 29
30 Learning CFDs Given an FD, learn a good pattern tableau from data [GKK+08] FD: [Name, Type, Country] [Price, Tax] Learned pattern tableau Name Type Country Price Tax Support Confidence * Clothing * * * 4/9 4/4 * * France * 0 3/9 2/3 Global support = 7/9, global confidence = 6/7, local confidence = 2/3 Learn FD and a good pattern tableau from data [FGL+09] 30
31 Learning Pattern Tableaux Generate smallest tableau with support and globalconfidence NP-complete Provably hard to approximate Generate smallest tableau with support and localconfidence NP-complete But 31
32 Identifying Inconsistencies: Impact of Big Data Variety, variability of data One size does not fit all Exact vs approximate models Volumeof data Scalable algorithms: trade-off between efficiency vs accuracy 32
33 Efficiency vs Accuracy Trade-off running time with accuracy of solution Input: FD: X Y, global support and local confidence Consider all instantiations of FD antecedent (X) Prune based on local confidence Apply partial greedy set cover until support is reached Output: log(n)-approximation in tableau size 33
34 Example: Pattern as a Set Tid Name Type Country Customer Price Tax t1 Harry Potter Book France Andrew 10 0 t2 Harry Potter Book France Barbara 10 0 t3 Harry Potter Book France Candy t4 Lord of the Rings Book USA David 25 0 t5 Lord of the Rings Book USA Eran t6 Armani Suit Clothing USA Frank t7 Armani Suit Clothing USA George t8 Kaminski Hat Clothing Australia Harry t9 Kaminski Hat Clothing Australia Ian Pattern [Name = *, Type = Clothing, Country = *] Pattern [Name = *, Type = *, Country = USA] 34
35 Efficiency vs Accuracy Trade-off running time with accuracy of solution Input: FD: X Y, global support and local confidence Consider all instantiations of FD antecedent (X) Prune based on local confidence Apply partial greedy set cover until support is reached Output: log(n)-approximation in tableau size All instantiations of FD antecedent: X = [Name, Type, Country] [HP, Book, France], [HP, Book, *], [HP, *, France], [HP, *, *], X = d, # of data records = N # of pa erns can be up to N*2 d 35
36 Efficiency vs Accuracy Trade-off running time with accuracy of solution Input: FD: X Y, global support and local confidence Consider all instantiations of FD antecedent (X) Prune based on local confidence Apply partial greedy coverage until support is reached Output: log(n)-approximation in tableau size All instantiations of FD antecedent: X = [Name, Type, Country] [HP, Book, France], [HP, Book, *],[HP,*, France], [HP,*,*], X = d, # of data records = N # of pa erns can be up to N*2 d Too many patterns (sets) to consider in each iteration! 36
37 Efficiency vs Accuracy Problem: N*2 d patterns to consider in partial greedy coverage Solution: Do not instantiate entire search space of X a priori Incremental generation of search space: On-demand algorithm! [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... [LotR, Book, USA] 37
38 Efficiency vs Accuracy [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... On-demand algorithm: incremental generation of search space Start from root pattern of lattice [LotR, Book, USA] 38
39 Efficiency vs Accuracy [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... On-demand algorithm: incremental generation of search space Start from root pattern of lattice [LotR, Book, USA] 39
40 Efficiency vs Accuracy [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... On-demand algorithm: incremental generation of search space Start from root pattern of lattice [LotR, Book, USA] If local confidence of pattern <, explore its unpruned children 40
41 Efficiency vs Accuracy [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... On-demand algorithm: incremental generation of search space Start from root pattern of lattice [LotR, Book, USA] If local confidence of pattern <, explore its unpruned children 41
42 Efficiency vs Accuracy [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... On-demand algorithm: incremental generation of search space Start from root pattern of lattice [LotR, Book, USA] If local confidence of pattern <, explore its unpruned children If local confidence of pattern, prune sub-lattice incident on it 42
43 Efficiency vs Accuracy [*, *, *] [HP, *,*] [*,*, France] [*, Book, *] [*,*, USA] [LotR,*,*] [HP, Book, *] [*, Book, France] [LotR, Book, *] [HP, Book, France]... On-demand algorithm: incremental generation of search space Start from root pattern of lattice [LotR, Book, USA] If local confidence of pattern <, explore its unpruned children If local confidence of pattern, prune sub-lattice incident on it Same search space exploration as partial greedy set cover! 43
44 Identifying Inconsistencies: Impact of Big Data Variety, variability of data One size does not fit all Exact vs approximate models Volumeof data Scalable algorithms: trade-off between efficiency vs accuracy Velocityof data Incremental, streaming algorithms 44
45 Other Data Quality Models Inclusion dependencies: every manager is an employee Sequential dependencies: consecutive polls must be 3-5min apart Matching dependencies: if similar name, address must be same Conservation dependencies: router in-traffic = router out-traffic Denial constraints: single tax exemption cannot exceed salary 45
46 Outline Introduction Identifying inconsistencies Repairing inconsistencies 46
47 Repair Techniques Glitch repairs, using source analysis [DBS09a] Introduced the idea of copy detection for structured data Glitch repairs by value modification, for FDs + InDs[BFF+05] Introduced the idea of cell equivalence classes Glitch + model repairs, for FDs [CM11] Introduced the idea of model repairs 47
48 Repairs Using Source Analysis [DBS09a] Problem: Given a database D obtained from a set of sources with overlapping data items, a single FD C, such that each (Si, C) is consistent but (D, C) is inconsistent, find best repair D of D Result: Using source quality and copy detection are essential Key ideas: Focus on value modifications of FD RHS attributes Learning and using source quality is better than naïve voting Copy detection between sources can prevent cabals 48
49 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Resolves inconsistency across diversity of sources Voting Source Quality USP S1 S2 S3 S4 S5 Jagadish UM ATT UM UM UI Dewitt MSR MSR UW UW UW Bernstein MSR MSR MSR MSR MSR Carey UCI ATT BEA BEA BEA Franklin UCB UCB UMD UMD UMD Copy Detection 49
50 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Voting Source Quality USP S1 S2 S3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD Copy Detection 50
51 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Supports difference of opinion Voting Source Quality USP S1 S2 S3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD Copy Detection 51
52 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Voting Source Quality USP S1 S2 S3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD Copy Detection 52
53 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Gives more weight to knowledgeable sources Voting Source Quality USP S1 S2 S3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD Copy Detection 53
54 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Voting Source Quality USP S1 S2 S3 S4 S5 Jagadish UM ATT UM UM UI Dewitt MSR MSR UW UW UW Bernstein MSR MSR MSR MSR MSR Carey UCI ATT BEA BEA BEA Franklin UCB UCB UMD UMD UMD Copy Detection 54
55 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Voting Source Quality USP S1 S2 S3 S4 S5 Jagadish UM ATT UM UM UI Dewitt MSR MSR UW UW UW Bernstein MSR MSR MSR MSR MSR Carey UCI ATT BEA BEA BEA Franklin UCB UCB UMD UMD UMD Copy Detection 55
56 Repairs Using Source Analysis Repairs: voting + source quality + copy detection Reduces weight of copier sources Voting Source Quality USP S1 S2 S3 S4 S5 Jagadish UM ATT UM UM UI Dewitt MSR MSR UW UW UW Bernstein MSR MSR MSR MSR MSR Carey UCI ATT BEA BEA BEA Franklin UCB UCB UMD UMD UMD Copy Detection 56
57 Basic Solution: Naïve Voting Supports difference of opinion, allows conflict resolution Works well for independent sources that have similar accuracy When sources have different accuracies Need to give more weight to votes by knowledgeable sources When sources copy from other sources Need to reduce the weight of votes by copiers 57
58 Source Accuracy [YHY08, DBS09a] Need to give more weight to knowledgeable sources Computing source accuracy: A(S) = Avg v i(d) S Pr(v i (D) true Ф) v i (D) S : S provides value v i on data item D Ф: observations on all data items by sources S Pr(v i (D) true Ф) : probability of v i (D) being true How to compute Pr(v i (D) true Ф)? 58
59 Source Accuracy Input: data item D, val(d) = {v 0,v 1,,v n }, Ф Output: Pr(v i (D) true Ф), for i=0,, n (sum=1) Based on Bayes Rule, need Pr(Ф v i (D) true) Under independence, need Pr(Ф D (S) v i (D) true) If S provides v i : Pr(Ф D (S) v i (D) true) = A(S) If S does not : Pr(Ф D (S) v i (D) true) =(1-A(S))/n Challenge: Inter-dependence between source accuracy and value probability? 59
60 Value Vote Count Source Vote Count Value Probability Source Accuracy Source Accuracy Continue until source accuracy converges 60 ) ) ( Pr( ) ( ) ( Φ = D v Avg S A S D v ) ( 1 ) ( ln ) ( ' S A S na S A = = Φ ) ( )) ( ( )) ( ( 0 0 ) ) ( Pr( D val v D v C D v C e e D v = )) ( ( ) ( ' )) ( ( D v S S S A D v C
61 Copy Detection Are Source 1 and Source 2 dependent? Not necessarily Source 1 on USA Presidents: Source 2 on USA Presidents: 1 st : George Washington 2 nd : John Adams 3 rd : Thomas Jefferson 4 th : James Madison 41 st : George H.W. Bush 42 nd : William J. Clinton 43 rd : George W. Bush 44 th : Barack Obama 1 st : George Washington 2 nd : John Adams 3 rd : Thomas Jefferson 4 th : James Madison 41 st : George H.W. Bush 42 nd : William J. Clinton 43 rd : George W. Bush 44 th : Barack Obama 61
62 Copy Detection Are Source 1 and Source 2 dependent? Very likely Source 1 on USA Presidents: Source 2 on USA Presidents: 1 st : George Washington 2 nd : Benjamin Franklin 3 rd : John F. Kennedy 4 th : Abraham Lincoln 41 st : George W. Bush 42 nd : Hillary Clinton 43 rd : Dick Cheney 44 th : Barack Obama 1 st : George Washington 2 nd : Benjamin Franklin 3 rd : John F. Kennedy 4 th : Abraham Lincoln 41 st : George W. Bush 42 nd : Hillary Clinton 43 rd : Dick Cheney 44 th : John McCain 62
63 Copy Detection: Bayesian Analysis Different Values O d Same Values TRUE O t FALSE O f S1 S2 Goal: Pr(S1 S2 Ф), Pr(S1 S2 Ф) (sum = 1) According to Bayes Rule, we need Pr(Ф S1 S2), Pr(Ф S1 S2) Key: compute Pr(Ф D S1 S2), Pr(Ф D S1 S2), for each D S1 S2 63
64 Copy Detection: Bayesian Analysis Different Values O d Same Values TRUE O t FALSE O f S1 S2 Pr Independence Copying O t 2 A < A c + A 2 (1 c) O f O d ( 1 A ) 2 n P d =1 A 2 (1 A)2 n << > (1 A) c + (1 A) n P d (1 c) 2 (1 c) 64
65 Iterative Process Typically converges when #objs >> #srcs Step 2 Truth Discovery Accuracy Computation Step 3 Copy Detection Step 1 65
66 Repair Techniques Glitch repairs, using source analysis [DBS09a] Introduced the idea of copy detection for structured data Glitch repairs by value modification, for FDs + InDs[BFF+05] Introduced the idea of cell equivalence classes Glitch + model repairs, for FDs [CM11] Introduced the idea of model repairs 66
67 Repairs Using Value Modification Problem: Given a database D, FD and InDconstraints C, such that (D, C) is inconsistent, find repair D of D with minimum cost(d ) Result: The problem is NP-hard even for only FDs or only InDs Key ideas: Focus on value modifications of FD RHS attributes Cost model for repairs is based on value accuracy, repair similarity Equivalence classes of cells with identical values in the repair permits a delayed assignment of a value to an equivalence class 67
68 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Ali Smith 27 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 InD: Equip[Tel] Customer[Tel] 68
69 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Ali Smith 27 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 InD: Equip[Tel] Customer[Tel] 69
70 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Ali Smith 27 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 FD: Customer[Tel] Customer[Name, Street, City, State, Zip] 70
71 Repairs Using Value Modification X C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Ali Smith 27 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 FD: Customer[Tel] Customer[Name, Street, City, State, Zip] 71
72 Repairs Using Value Modification Repair alternatives when records t i and t j violate FD: X Y Value modification of LHS attributes X Modify t j [X] to a value differentfrom t i [X] Unclear what (different) value should be assigned to t j [X] Value modification of RHS attributes Y Modify t j [Y] to equal t i [Y] or vice versa Use cost of repair to choose between alternatives FD violations can always be repaired by modifying RHS attributes Y Naïve approach can lead to non-termination 72
73 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Alice Smith 17 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 FD: Customer[Tel] Customer[Name, Street, City, State, Zip] 73
74 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Alice Smith 17 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 FD: Customer[Tel] Customer[Name, Street, City, State, Zip] FD: Customer[Zip] Customer[City, State] FD: Customer[Name, Street, Zip] Customer[Tel] 74
75 Repairs Using Value Modification X C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Alice Smith 17 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 FD: Customer[Tel] Customer[Name, Street, City, State, Zip] FD: Customer[Zip] Customer[City, State] FD: Customer[Name, Street, Zip] Customer[Tel] 75
76 Repairs Using Value Modification? C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NY t Alice Smith 17 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 FD: Customer[Tel] Customer[Name, Street, City, State, Zip] FD: Customer[Zip] Customer[City, State] FD: Customer[Name, Street, Zip] Customer[Tel] 76
77 Repairs Using Value Modification X C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NY t Alice Smith 17 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 InD: Equip[Tel] Customer[Tel] FD: Customer[Tel] Customer[Name, Street, City, State, Zip] FD: Customer[Zip] Customer[City, State] FD: Customer[Name, Street, Zip] Customer[Tel] 77
78 Repairs Using Value Modification Repair alternatives when record t i violates InD: R i [X] R j [Y] Value modification of t i [X] Modify t j [X] to a value t j [Y] for some t j in R j Value modification of t j [Y] Modify t j [Y] for some t j in R j to equal t i [X] Use cost of repair to choose between alternatives 78
79 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NY t Alice Smith 17 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 InD: Equip[Tel] Customer[Tel] FD: Customer[Tel] Customer[Name, Street, City, State, Zip] FD: Customer[Zip] Customer[City, State] FD: Customer[Name, Street, Zip] Customer[Tel] 79
80 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Ali Smith 27 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 Greedily build equivalence classes of cells {(t2, Tel), (t3, Tel), (t5, Tel), (t6, Tel)} {(t1, Name), (t4, Name)} 80
81 Repairs Using Value Modification C U S T O M E R TId Tel Name Street City State Zip Wt t Alice Smith 17 Bridge Midville AZ t Bob Jones 5 Valley Centre NY t Bob Jones 5 Valley Centre NJ t Ali Smith 27 Bridge Midville AZ E Q U I P Tid Tel SerNo EqMfct EqModel InstDate Wt t L55001 LU ze400 Jan-03 2 t L55011 LU ze400 Mar-03 1 Greedily build equivalence classes of cells, assign unique value {(t2, Tel), (t3, Tel), (t5, Tel), (t6, Tel)} {(t1, Name), (t4, Name)} Alice Smith 81
82 Repair Techniques Glitch repairs, using source analysis [DBS09a] Introduced the idea of copy detection for structured data Glitch repairs by value modification, for FDs + InDs[BFF+05] Introduced the idea of cell equivalence classes Glitch + model repairs, for FDs [CM11] Introduced the idea of model repairs 82
83 Repairing Data and Constraints Motivation: variability of data semantics over time Problem: Given a database D, FD constraints C, such that (D, C) is inconsistent, find repair (D, C ) with minimum cost Key ideas: Allow value modifications of FD RHS or LHS attributes Allow modifications of FDsin C by augmenting the LHS Cost model for repairs is based on minimum description length 83
84 Repairing Data and Constraints Tid District Region Municipal AC Tel Street Zip City State t1 Brookside Granville Glendale Boxwood NY NY t2 Brookside Granville Glendale Boxwood NY NY t3 Brookside Granville Glendale Westlane NY MA t4 Brookside Granville Guild Squire Boston MA t5 Brookside Granville Guild Squire Boston MA t6 Brookside Granville Queen Main Chicago IL t7 Brookside Granville Queen Main Chicago IL t8 Brookside Granville Queen Main Chicago IL t9 Brookside Granville Queen Bay Chicago IL FD: [District, Region] [AC, City, State] 84
85 Repairing Data and Constraints Tid District Region Municipal AC Tel Street Zip City State t1 Brookside Granville Glendale Boxwood NY NY t2 Brookside Granville Glendale Boxwood NY NY t3 Brookside Granville Glendale Westlane NY MA t4 Brookside Granville Guild Squire Boston MA t5 Brookside Granville Guild Squire Boston MA t6 Brookside Granville Queen Main Chicago IL t7 Brookside Granville Queen Main Chicago IL t8 Brookside Granville Queen Main Chicago IL t9 Brookside Granville Queen Bay Chicago IL FD: [District, Region] [AC, City, State] Expensive repair using only value modifications 85
86 Repairing Data and Constraints Repair alternatives when records t i and t j violate FD: X Y Value modification of RHS attributes Y Value modification of LHS attributes X Modify t j [X] to a value different from t i [X], supported by the data Repair constraints by augmenting LHS (X) with a new attribute New attribute provides additional context Choose from alternatives using MDL-based cost model 86
87 MDL-Based Cost Model Quantifies trade-off of a data repair versus a constraint repair Cost-model based on the three properties Accuracy: value modifications must minimize distance Redundancy: value modifications must be well supported in data, constraint repairs must result in a higher degree of consistency Conciseness: repaired constraints should explain, but not overfit Minimum description length (MDL) principle Length of model + length to encode data given the model 87
88 Repairing Data and Constraints Tid District Region Municipal AC Tel Street Zip City State t1 Brookside Granville Glendale Boxwood NY NY t2 Brookside Granville Glendale Boxwood NY NY t3 Brookside Granville Glendale Westlane NY MA t4 Brookside Granville Guild Squire Boston MA t5 Brookside Granville Guild Squire Boston MA t6 Brookside Granville Queen Main Chicago IL t7 Brookside Granville Queen Main Chicago IL t8 Brookside Granville Queen Main Chicago IL t9 Brookside Granville Queen Bay Chicago IL Cheap repair of constraints and data FD: [District, Region, Municipal] [AC, City, State] t3.state = NY 88
89 Repairing Data and Constraints Tid District Region Municipal AC Tel Street Zip City State t1 Brookside Granville Glendale Boxwood NY NY t2 Brookside Granville Glendale *** Boxwood *** *** t3 Brookside Granville Glendale *** Westlane *** MA t4 Brookside Granville Guild Squire Boston MA t5 Brookside Granville Guild *** Squire *** *** t6 Brookside Granville Queen Main Chicago IL t7 Brookside Granville Queen *** Main *** *** t8 Brookside Granville Queen *** Main *** *** t9 Brookside Granville Queen *** Bay *** *** MDL: Length of model + length to encode data given the model FD: [District, Region, Municipal] [AC, City, State] 89
90 Conclusions Big data quality (veracity) is an important area of research Challenges due to volume, velocity, variety, variability Much interesting work has been done in this area Learn models (semantics) from the data Identify data glitches as violations of the learned models Repair data glitches and models in a timely manner A lot more research needs to be done! 90
91 Crowdsourcing Improving data quality by crowdsourcing 91
92 Source Exploration Tool Data.gov 92
93 Bibliography [BFF+05] Philip Bohannon, Michael Flaster, Wenfei Fan, Rajeev Rastogi: A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification. SIGMOD 2005: [CM11] Fei Chiang, Renée J. Miller: A unified model for data and constraint repair. ICDE 2011: [DBS09a] Xin Luna Dong, Laure Berti-Equille, DiveshSrivastava: Integrating Conflicting Data: The Role of Source Dependence. PVLDB 2(1): (2009) [GKK+08] Lukasz Golab, Howard J. Karloff, Flip Korn, Divesh Srivastava, Bei Yu: On generating nearoptimal tableaux for conditional functional dependencies. PVLDB 1(1): (2008) [LDL+12] Xian Li, Xin Luna Dong, Kenneth Lyons, Weiyi Meng, DiveshSrivastava: Truth Finding on the Deep Web: Is the Problem Solved? PVLDB 6(2): (2012) 93
(Big Data Integration) : :
(Big Data Integration) : : 3 # $%&'! ()* +$,- 2/30 ()* + # $%&' = 3 : $ 2 : 17 ;' $ # < 2 6 ' $%&',# +'= > 0 - '? @0 A 1 3/30 3?. - B 6 @* @(C : E6 - > ()* (C :(C E6 1' +'= - ''3-6 F :* 2G '> H-! +'-?
More informationA Unified Model for Data and Constraint Repair
A Unified Model for Data and Constraint Repair Fei Chiang, Renée J. Miller Department of Computer Science, University of Toronto Toronto, Canada {fchiang, miller}@cs.toronto.edu Abstract Integrity constraints
More informationData Glitches = Constraint Violations Empirical Explanations. Divesh Srivastava AT&T Labs-Research
Data Glitches = Constraint Violations Empirical Explanations Divesh Srivastava AT&T Labs-Research What is a Glitch? A spaceman's word for irritating disturbances [Time, 23 Jul 1965]. Something's gone wrong
More informationEfficient and Effective Analysis of Data Quality using Pattern Tableaux
Efficient and Effective Analysis of Data Quality using Pattern Tableaux Lukasz Golab, Flip Korn and Divesh Srivastava AT&T Labs - Research 180 Park Avenue, Florham Park NJ, 07932, USA {lgolab, flip, divesh}@research.att.com
More informationData Cleansing. LIU Jingyuan, Vislab WANG Yilei, Theoretical group
Data Cleansing LIU Jingyuan, Vislab WANG Yilei, Theoretical group What is Data Cleansing Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or inconsistencies
More informationTruth Finding on the Deep Web: Is the Problem Solved?
Truth Finding on the Deep Web: Is the Problem Solved? Xian Li SUNY at Binghamton xianli@cs.binghamton.edu Weiyi Meng SUNY at Binghamton meng@cs.binghamton.edu Xin Luna Dong AT&T Labs-Research lunadong@research.att.com
More informationContinuous Data Cleaning
Continuous Data Cleaning M. Volkovs, F. Chiang, J. Szlichta and R. J. Miller ICDE 2014 Presenter: Nabiha Asghar Outline Introduction and motivation Main contributions of the paper Description of architecture
More informationRobust Discovery of Positive and Negative Rules in Knowledge-Bases
Robust Discovery of Positive and Negative Rules in Knowledge-Bases Paolo Papotti joint work with S. Ortona (Meltwater) and V. Meduri (ASU) http://www.eurecom.fr/en/publication/5321/detail/robust-discovery-of-positive-and-negative-rules-in-knowledge-bases
More informationDATA cleaning, which is to detect and repair data errors,
A Novel Cost-Based Model for Data Repairing Shuang Hao Nan Tang Guoliang Li Jian He Na Ta Jianhua Feng Abstract Integrity constraint based data repairing is an iterative process consisting of two parts:
More informationBringing Order to Big Data. Jarek Szlichta
Jarek Bringing Order to Big Data Conducted research was partially supported by IBM CAS 1 Data, Data Everywhere Open data Business Data Web Data Available at different formats. 2 Big Data to Data Science
More informationRanking for Data Repairs
Ranking for Data Repairs Mohamed Yakout, Ahmed K. Elmagarmid, Jennifer Neville Purdue University, West Lafayette, IN 47907, USA {myakout, ake, neville}@cs.purdue.edu Abstract Improving data quality is
More informationCSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 4 - Schema Normalization
CSE 544 Principles of Database Management Systems Magdalena Balazinska Winter 2009 Lecture 4 - Schema Normalization References R&G Book. Chapter 19: Schema refinement and normal forms Also relevant to
More informationImproving Data Quality: Consistency and Accuracy
Improving Data Quality: Consistency and Accuracy Gao Cong 1 Wenfei Fan 2,3 Floris Geerts 2,4,5 Xibei Jia 2 Shuai Ma 2 1 Microsoft Research Asia 2 University of Edinburgh 4 Hasselt University 3 Bell Laboratories
More informationData Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems
Data Warehousing and Data Mining CPS 116 Introduction to Database Systems Announcements (December 1) 2 Homework #4 due today Sample solution available Thursday Course project demo period has begun! Check
More informationExam Advanced Data Mining Date: Time:
Exam Advanced Data Mining Date: 11-11-2010 Time: 13.30-16.30 General Remarks 1. You are allowed to consult 1 A4 sheet with notes written on both sides. 2. Always show how you arrived at the result of your
More informationExtending Functional Dependency to Detect Abnormal Data in RDF Graphs
Extending Functional Dependency to Detect Abnormal Data in RDF Graphs Yang Yu, Jeff Heflin SWAT Lab Department of Computer Science and Engineering Lehigh University PA, USA Outline Semantic Web data and
More informationQuotient Cube: How to Summarize the Semantics of a Data Cube
Quotient Cube: How to Summarize the Semantics of a Data Cube Laks V.S. Lakshmanan (Univ. of British Columbia) * Jian Pei (State Univ. of New York at Buffalo) * Jiawei Han (Univ. of Illinois at Urbana-Champaign)
More informationRelational model continued. Understanding how to use the relational model. Summary of board example: with Copies as weak entity
COS 597A: Principles of Database and Information Systems Relational model continued Understanding how to use the relational model 1 with as weak entity folded into folded into branches: (br_, librarian,
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 3. Chapter 3: Data Preprocessing. Major Tasks in Data Preprocessing
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 1 Chapter 3: Data Preprocessing Data Preprocessing: An Overview Data Quality Major Tasks in Data Preprocessing Data Cleaning Data Integration Data
More informationIncognito: Efficient Full Domain K Anonymity
Incognito: Efficient Full Domain K Anonymity Kristen LeFevre David J. DeWitt Raghu Ramakrishnan University of Wisconsin Madison 1210 West Dayton St. Madison, WI 53706 Talk Prepared By Parul Halwe(05305002)
More informationData X-Ray: A diagnostic tool for data errors Xiaolan Wang Xin Luna Dong Alexandra Meliou
Data X-Ray: A diagnostic tool for data errors Xiaolan Wang Xin Luna Dong Alexandra Meliou UNIVERSITY OF MASSACHUSETTS, AMHERST College of Information and Computer Sciences MANY APPLICATIONS RELY ON DATA
More informationEffective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar
Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar School of Computer Science Faculty of Science University of Windsor How Big is this Big Data? 40 Billion Instagram Photos 300 Hours
More informationIdentifying Useful Data Dependency Using Agree Set form Relational Database
Volume 1, Issue 6, September 2016 ISSN: 2456-0006 International Journal of Science Technology Management and Research Available online at: Identifying Useful Data Using Agree Set form Relational Database
More informationPrivacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University
Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy Xiaokui Xiao Nanyang Technological University Outline Privacy preserving data publishing: What and Why Examples of privacy attacks
More informationAsking the Right Questions in Crowd Data Sourcing
MoDaS Mob Data Sourcing Asking the Right Questions in Crowd Data Sourcing Tova Milo Tel Aviv University Outline Introduction to crowd (data) sourcing Databases and crowds Declarative is good How to best
More informationData Preprocessing. Slides by: Shree Jaswal
Data Preprocessing Slides by: Shree Jaswal Topics to be covered Why Preprocessing? Data Cleaning; Data Integration; Data Reduction: Attribute subset selection, Histograms, Clustering and Sampling; Data
More informationThe Relational Data Model
The Relational Data Model Lecture 6 1 Outline Relational Data Model Functional Dependencies Logical Schema Design Reading Chapter 8 2 1 The Relational Data Model Data Modeling Relational Schema Physical
More informationCSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2009 Lecture 3 - Schema Normalization
CSE 544 Principles of Database Management Systems Magdalena Balazinska Fall 2009 Lecture 3 - Schema Normalization References R&G Book. Chapter 19: Schema refinement and normal forms Also relevant to this
More informationTextbook: Chapter 6! CS425 Fall 2013 Boris Glavic! Chapter 3: Formal Relational Query. Relational Algebra! Select Operation Example! Select Operation!
Chapter 3: Formal Relational Query Languages CS425 Fall 2013 Boris Glavic Chapter 3: Formal Relational Query Languages Relational Algebra Tuple Relational Calculus Domain Relational Calculus Textbook:
More informationFunctional Dependencies and Finding a Minimal Cover
Functional Dependencies and Finding a Minimal Cover Robert Soulé 1 Normalization An anomaly occurs in a database when you can update, insert, or delete data, and get undesired side-effects. These side
More informationMeasuring and Evaluating Dissimilarity in Data and Pattern Spaces
Measuring and Evaluating Dissimilarity in Data and Pattern Spaces Irene Ntoutsi, Yannis Theodoridis Database Group, Information Systems Laboratory Department of Informatics, University of Piraeus, Greece
More informationLessons Learned and Research Agenda for Big Data Integration of Product Specifications (Discussion Paper)
Lessons Learned and Research Agenda for Big Data Integration of Product Specifications (Discussion Paper) Luciano Barbosa 1, Valter Crescenzi 2, Xin Luna Dong 3, Paolo Merialdo 2, Federico Piai 2, Disheng
More informationData Preprocessing Yudho Giri Sucahyo y, Ph.D , CISA
Obj ti Objectives Motivation: Why preprocess the Data? Data Preprocessing Techniques Data Cleaning Data Integration and Transformation Data Reduction Data Preprocessing Lecture 3/DMBI/IKI83403T/MTI/UI
More informationA Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification
A Cost-Based Model and Effective Heuristic for Repairing Constraints by Value Modification Philip Bohannon Lucent Technologies Bell Laboratories bohannon@researchbell-labscom Wenfei Fan Univ of Edinburgh
More informationExam I Computer Science 420 Dr. St. John Lehman College City University of New York 12 March 2002
Exam I Computer Science 420 Dr. St. John Lehman College City University of New York 12 March 2002 NAME (Printed) NAME (Signed) E-mail Exam Rules Show all your work. Your grade will be based on the work
More informationECE521 Lecture 18 Graphical Models Hidden Markov Models
ECE521 Lecture 18 Graphical Models Hidden Markov Models Outline Graphical models Conditional independence Conditional independence after marginalization Sequence models hidden Markov models 2 Graphical
More informationData Mining: Concepts and Techniques. (3 rd ed.) Chapter 3
Data Mining: Concepts and Techniques (3 rd ed.) Chapter 3 Jiawei Han, Micheline Kamber, and Jian Pei University of Illinois at Urbana-Champaign & Simon Fraser University 2011 Han, Kamber & Pei. All rights
More informationOverview. Introduction to Data Warehousing and Business Intelligence. BI Is Important. What is Business Intelligence (BI)?
Introduction to Data Warehousing and Business Intelligence Overview Why Business Intelligence? Data analysis problems Data Warehouse (DW) introduction A tour of the coming DW lectures DW Applications Loosely
More informationAnnouncements. CS 188: Artificial Intelligence Fall Reminder: CSPs. Today. Example: 3-SAT. Example: Boolean Satisfiability.
CS 188: Artificial Intelligence Fall 2008 Lecture 5: CSPs II 9/11/2008 Announcements Assignments: DUE W1: NOW P1: Due 9/12 at 11:59pm Assignments: UP W2: Up now P2: Up by weekend Dan Klein UC Berkeley
More informationCS 188: Artificial Intelligence Fall 2008
CS 188: Artificial Intelligence Fall 2008 Lecture 5: CSPs II 9/11/2008 Dan Klein UC Berkeley Many slides over the course adapted from either Stuart Russell or Andrew Moore 1 1 Assignments: DUE Announcements
More informationERBlox: Combining Matching Dependencies with Machine Learning for Entity Resolution Leopoldo Bertossi Carleton University School of Computer Science Institute for Data Science Ottawa, Canada bertossi@scs.carleton.ca
More informationCSCI1270 Introduction to Database Systems
CSCI1270 Introduction to Database Systems with thanks to Prof. George Kollios, Boston University Prof. Mitch Cherniack, Brandeis University Prof. Avi Silberschatz, Yale University 1.1 What is a Database
More informationEntity-Relationship Modelling. Entities Attributes Relationships Mapping Cardinality Keys Reduction of an E-R Diagram to Tables
Entity-Relationship Modelling Entities Attributes Relationships Mapping Cardinality Keys Reduction of an E-R Diagram to Tables 1 Entity Sets A enterprise can be modeled as a collection of: entities, and
More informationToday. CS 188: Artificial Intelligence Fall Example: Boolean Satisfiability. Reminder: CSPs. Example: 3-SAT. CSPs: Queries.
CS 188: Artificial Intelligence Fall 2007 Lecture 5: CSPs II 9/11/2007 More CSPs Applications Tree Algorithms Cutset Conditioning Today Dan Klein UC Berkeley Many slides over the course adapted from either
More informationPeter X. Gao, Andrew R. Curtis, Bernard Wong, S. Keshav. Cheriton School of Computer Science University of Waterloo
Peter X. Gao, Andrew R. Curtis, Bernard Wong, S. Keshav Cheriton School of Computer Science University of Waterloo August 15, 2012 1 = ~1M servers CO 2 of 280,000 cars 2 Datacenters and Request Routing
More informationSemantic Search at Bloomberg
Semantic Search at Bloomberg Search Solutions 2017 Edgar Meij Team lead, R&D AI emeij@bloomberg.net @edgarmeij Bloomberg Professional Service Bloomberg at a glance Bloomberg Professional Service Trading
More informationGraDit: graph-based data repair algorithm for multiple data edits rule violations
Journal of Physics: Conference Series PAPER OPEN ACCESS GraDit: graph-based data repair algorithm for multiple data edits rule violations To cite this article: Wa Ode Zuhayeni Madjida and I Gusti Bagus
More informationScalable and Holistic Qualitative Data Cleaning
Scalable and Holistic Qualitative Data Cleaning by Xu Chu A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Doctor of Philosophy in Computer Science
More informationCAS CS 460/660 Introduction to Database Systems. Fall
CAS CS 460/660 Introduction to Database Systems Fall 2017 1.1 About the course Administrivia Instructor: George Kollios, gkollios@cs.bu.edu MCS 283, Mon 2:30-4:00 PM and Tue 1:00-2:30 PM Teaching Fellows:
More informationThis tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.
About the Tutorial Data Mining is defined as the procedure of extracting information from huge sets of data. In other words, we can say that data mining is mining knowledge from data. The tutorial starts
More informationOpen Data Integration. Renée J. Miller
Open Data Integration Renée J. Miller miller@northeastern.edu !2 Open Data Principles Timely & Comprehensive Accessible and Usable Complete - All public data is made available. Public data is data that
More informationImproving the Performance of OLAP Queries Using Families of Statistics Trees
Improving the Performance of OLAP Queries Using Families of Statistics Trees Joachim Hammer Dept. of Computer and Information Science University of Florida Lixin Fu Dept. of Mathematical Sciences University
More informationChapter 2: Relational Model
Chapter 2: Relational Model Database System Concepts, 5 th Ed. See www.db-book.com for conditions on re-use Chapter 2: Relational Model Structure of Relational Databases Fundamental Relational-Algebra-Operations
More informationDatabase System Concepts
s Slides (fortemente) baseados nos slides oficiais do livro c Silberschatz, Korth and Sudarshan. Chapter 2: Model Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2009/2010
More informationBig Data Challenges in Large IP Networks
Big Data Challenges in Large IP Networks Feature Extraction & Predictive Alarms for network management Wednesday 28 th Feb 2018 Dave Yearling British Telecommunications plc 2017 What we will cover Making
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationIndexing Bi-temporal Windows. Chang Ge 1, Martin Kaufmann 2, Lukasz Golab 1, Peter M. Fischer 3, Anil K. Goel 4
Indexing Bi-temporal Windows Chang Ge 1, Martin Kaufmann 2, Lukasz Golab 1, Peter M. Fischer 3, Anil K. Goel 4 1 2 3 4 Outline Introduction Bi-temporal Windows Related Work The BiSW Index Experiments Conclusion
More informationCOSC Dr. Ramon Lawrence. Emp Relation
COSC 304 Introduction to Database Systems Normalization Dr. Ramon Lawrence University of British Columbia Okanagan ramon.lawrence@ubc.ca Normalization Normalization is a technique for producing relations
More informationApplied Databases. Sebastian Maneth. Lecture 5 ER Model, Normal Forms. University of Edinburgh - January 30 th, 2017
Applied Databases Lecture 5 ER Model, Normal Forms Sebastian Maneth University of Edinburgh - January 30 th, 2017 Outline 2 1. Entity Relationship Model 2. Normal Forms From Last Lecture 3 the Lecturer
More informationBig Data Analytics. Rasoul Karimi
Big Data Analytics Rasoul Karimi Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 1 Outline
More informationCS411 Database Systems. 05: Relational Schema Design Ch , except and
CS411 Database Systems 05: Relational Schema Design Ch. 3.1-3.5, except 3.4.2-3.4.3 and 3.5.3. 1 How does this fit in? ER Diagrams: Data Definition Translation to Relational Schema: Data Definition Relational
More informationOn Data Dependencies in Dataspaces
On Data Dependencies in Dataspaces Tsinghua University This is a joint work with Lei Chen (HKUST) and Philip S. Yu (UIC) 2011 On Data Dependencies in Dataspaces Introduction 1/24 Dataspaces provide a co-existing
More informationCS 188: Artificial Intelligence Spring Today
CS 188: Artificial Intelligence Spring 2006 Lecture 7: CSPs II 2/7/2006 Dan Klein UC Berkeley Many slides from either Stuart Russell or Andrew Moore Today More CSPs Applications Tree Algorithms Cutset
More informationFINDING QUALITY IN QUANTITY: THE CHALLENGE OF DISCOVERING VALUABLE SOURCES FOR INTEGRATION
FINDING QUALITY IN QUANTITY: THE CHALLENGE OF DISCOVERING VALUABLE SOURCES FOR INTEGRATION Theodoros Rekatsinas University of Maryland Amol Deshpande, Xin Luna Dong, Lise Getoor and Divesh Srivastava DATA,
More informationData Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems
Data Warehousing & Mining CPS 116 Introduction to Database Systems Data integration 2 Data resides in many distributed, heterogeneous OLTP (On-Line Transaction Processing) sources Sales, inventory, customer,
More informationBig Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety
Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety Abhishek
More informationHandout 12: Textual models
Handout 12: Textual models Taylor Arnold Loading and parsing the data The full text of all the State of the Union addresses through 2016 are available in the R package sotu, available on CRAN. The package
More informationIdentity John Homewoner Mary Homewoner. Employer - Company Search nh medical center columbia hospital
Summary of findings: ADV-120 Reference Number Requesting Lender Feb 03, 2016 12:06:32 PM EST Identity John Homewoner Mary Homewoner First Name Pass Pass Last Name Pass Pass SSN Discrepancy Discrepancy
More informationData Preprocessing. Data Mining 1
Data Preprocessing Today s real-world databases are highly susceptible to noisy, missing, and inconsistent data due to their typically huge size and their likely origin from multiple, heterogenous sources.
More informationSoftware Defined Networking Security: Security for SDN and Security with SDN. Seungwon Shin Texas A&M University
Software Defined Networking Security: Security for SDN and Security with SDN Seungwon Shin Texas A&M University Contents SDN Basic Operation SDN Security Issues SDN Operation L2 Forwarding application
More informationInteractive Visualization of the Stock Market Graph
Interactive Visualization of the Stock Market Graph Presented by Camilo Rostoker rostokec@cs.ubc.ca Department of Computer Science University of British Columbia Overview 1. Introduction 2. The Market
More informationBAYESIAN NETWORKS STRUCTURE LEARNING
BAYESIAN NETWORKS STRUCTURE LEARNING Xiannian Fan Uncertainty Reasoning Lab (URL) Department of Computer Science Queens College/City University of New York http://url.cs.qc.cuny.edu 1/52 Overview : Bayesian
More informationWhat is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.
What is Learning? CS 343: Artificial Intelligence Machine Learning Herbert Simon: Learning is any process by which a system improves performance from experience. What is the task? Classification Problem
More informationRelational Databases
Relational Databases Lecture 2 Chapter 3 Robb T. Koether Hampden-Sydney College Fri, Jan 18, 2013 Robb T. Koether (Hampden-Sydney College) Relational Databases Fri, Jan 18, 2013 1 / 26 1 Types of Databases
More informationSTRATEGIC DIRECTION SUPPORTED: Organizational Sustainability.
DATE: January 9, 2017 MEMO TO: Craig Taylor, Chair Operations Committee S. Michael Rummel, Chair Finance Committee FROM: Mary E. Kann Director of Administration RECOMMENDATION: Recommend approval of a
More informationBOOLEAN MATRIX FACTORIZATIONS. with applications in data mining Pauli Miettinen
BOOLEAN MATRIX FACTORIZATIONS with applications in data mining Pauli Miettinen MATRIX FACTORIZATIONS BOOLEAN MATRIX FACTORIZATIONS o THE BOOLEAN MATRIX PRODUCT As normal matrix product, but with addition
More informationConsistent Query Answering
Consistent Query Answering Opportunities and Limitations Jan Chomicki Dept. CSE University at Buffalo State University of New York http://www.cse.buffalo.edu/ chomicki 1 Integrity constraints Integrity
More informationConsistent Query Answering: Opportunities and Limitations
Consistent Query Answering: Opportunities and Limitations Jan Chomicki Dept. Computer Science and Engineering University at Buffalo, SUNY Buffalo, NY 14260-2000, USA chomicki@buffalo.edu Abstract This
More informationApproximation Algorithms for Clustering Uncertain Data
Approximation Algorithms for Clustering Uncertain Data Graham Cormode AT&T Labs - Research graham@research.att.com Andrew McGregor UCSD / MSR / UMass Amherst andrewm@ucsd.edu Introduction Many applications
More informationHolistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges
Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges Abhishek Santra 1 and Sanjukta Bhowmick 2 1 Information Technology Laboratory, CSE Department, University of
More informationCOS 126 General Computer Science Fall Exam 1
COS 126 General Computer Science Fall 2005 Exam 1 This test has 9 questions worth a total of 50 points. You have 120 minutes. The exam is closed book, except that you are allowed to use a one page cheatsheet,
More informationDatabase Design Theory and Normalization. CS 377: Database Systems
Database Design Theory and Normalization CS 377: Database Systems Recap: What Has Been Covered Lectures 1-2: Database Overview & Concepts Lecture 4: Representational Model (Relational Model) & Mapping
More information11/04/16. Data Profiling. Helena Galhardas DEI/IST. References
Data Profiling Helena Galhardas DEI/IST References Slides Data Profiling course, Felix Naumann, Trento, July 2015 Z. Abedjan, L. Golab, F. Naumann, Profiling Relational Data A Survey, VLDBJ 2015 T. Papenbrock
More informationData Quality Problems beyond Consistency and Deduplication
Data Quality Problems beyond Consistency and Deduplication Wenfei Fan Floris Geerts Shuai Ma Nan Tang Wenyuan Yu University of Edinburgh {wenfei@inf., fgeerts@inf., sma1@inf., ntang@inf., wenyuan.yu@}ed.ac.uk
More informationQuery Answering over Functional Dependency Repairs
Query Answering over Functional Dependency Repairs by Artur Galiullin A thesis presented to the University of Waterloo in fulfillment of the thesis requirement for the degree of Master of Mathematics in
More informationB.2 Measures of Central Tendency and Dispersion
Appendix B. Measures of Central Tendency and Dispersion B B. Measures of Central Tendency and Dispersion What you should learn Find and interpret the mean, median, and mode of a set of data. Determine
More informationSet Cover Algorithms For Very Large Datasets
Set Cover Algorithms For Very Large Datasets Graham Cormode Howard Karloff AT&T Labs-Research Tony Wirth University of Melbourne Set Cover? Given a collection of sets over a universe of items Find smallest
More informationIntroduction to Data Mining
Introduction to Data Mining *Some of the slides are from Jaideep Srivastava @ http://www.cs.umn.edu/faculty/srivasta.html Mike Kassoff @ http://logic.stanford.edu/classes/cs246/lect ures2001/mkassoff_lecture.ppt
More informationData Quality Problems beyond Consistency and Deduplication
Data Quality Problems beyond Consistency and Deduplication Wenfei Fan Floris Geerts Shuai Ma Nan Tang Wenyuan Yu University of Edinburgh {wenfei@inf., fgeerts@inf., sma1@inf., ntang@inf., wenyuan.yu@}ed.ac.uk
More informationarxiv: v2 [cs.db] 30 Dec 2017
Human-Centric Data Cleaning [Vision] El Kindi Rezig Mourad Ouzzani Ahmed K. Elmagarmid Walid G. Aref Purdue University Qatar Computing Research Institute erezig@cs.purdue.edu, mouzzani@hbku.edu.qa, aelmagarmid@hbku.edu.qa,
More informationBayesian Networks Inference (continued) Learning
Learning BN tutorial: ftp://ftp.research.microsoft.com/pub/tr/tr-95-06.pdf TAN paper: http://www.cs.huji.ac.il/~nir/abstracts/frgg1.html Bayesian Networks Inference (continued) Learning Machine Learning
More informationCSE 562 Database Systems
Goal CSE 562 Database Systems Question: The relational model is great, but how do I go about designing my database schema? Database Design Some slides are based or modified from originals by Magdalena
More informationIntroduction to Database Systems. Announcements CSE 444. Review: Closure, Key, Superkey. Decomposition: Schema Design using FD
Introduction to Database Systems CSE 444 Lecture #9 Jan 29 2001 Announcements Mid Term on Monday (in class) Material in lectures Textbook Chapter 1.1, Chapter 2 (except 2.1 and ODL), Chapter 3 (except
More informationTopic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer
Topic 14: Scheduling COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer 1 The Back End Well, let s see Motivating example Starting point Motivating example Starting point Multiplication
More informationHorn Formulae. CS124 Course Notes 8 Spring 2018
CS124 Course Notes 8 Spring 2018 In today s lecture we will be looking a bit more closely at the Greedy approach to designing algorithms. As we will see, sometimes it works, and sometimes even when it
More informationManaging Inconsistencies in Collaborative Data Management
Managing Inconsistencies in Collaborative Data Management Eric Kao Logic Group Computer Science Department Stanford University Talk given at HP Labs on November 9, 2010 Structured Data Public Sources Company
More informationEfficient Approximation of Correlated Sums on Data Streams
Efficient Approximation of Correlated Sums on Data Streams Rohit Ananthakrishna Cornell University rohit@cs.cornell.edu Flip Korn AT&T Labs Research flip@research.att.com Abhinandan Das Cornell University
More informationThis lecture. Databases -Normalization I. Repeating Data. Redundancy. This lecture introduces normal forms, decomposition and normalization.
This lecture Databases -Normalization I This lecture introduces normal forms, decomposition and normalization (GF Royle 2006-8, N Spadaccini 2008) Databases - Normalization I 1 / 23 (GF Royle 2006-8, N
More informationUGuide User-Guided Discovery of FD-Detectable Errors
UGuide User-Guided Discovery of FD-Detectable Errors Saravanan Thirumuruganathan Laure Berti-Equille Mourad Ouzzani Jorge-Arnulfo Quiane-Ruiz Nan Tang Qatar Computing Research Institute HBKU, Research
More informationSpeeding Up Data Science: From a Data Management Perspective
Speeding Up Data Science: From a Data Management Perspective Jiannan Wang Database System Lab (DSL) Simon Fraser University NWDS Meeting, Jan 5, 2018 1 Simon Fraser University 2 SFU DB/DM Group Ke Wang
More information