(Big Data Integration) : :

Similar documents
Entity Resolution with Heavy Indexing

Comprehensive and Progressive Duplicate Entities Detection

Computer-based Tracking Protocols: Improving Communication between Databases

Towards Efficient and Effective Semantic Table Interpretation Ziqi Zhang

Outline. Data Integration. Entity Matching/Identification. Duplicate Detection. More Resources. Duplicates Detection in Database Integration

Learning mappings and queries

Visualizing semantic table annotations with TableMiner+

Active Blocking Scheme Learning for Entity Resolution

HOLISTIC DATA INTEGRATION FOR BIG DATA

References Part I: Introduction

Flexible Dataspace Management Through Model Management

ISSN (Online) ISSN (Print)

Keywords Data alignment, Data annotation, Web database, Search Result Record

Principles of Dataspaces

Comparison of Online Record Linkage Techniques

Deduplication of Hospital Data using Genetic Programming

A Novel Approach On simplifying Document Annotation Using Content and Querying Assessment

Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules

Introduction Data Integration Summary. Data Integration. COCS 6421 Advanced Database Systems. Przemyslaw Pawluk. CSE, York University.

Exploring Schema Repositories with Schemr

Survey Result on Privacy Preserving Techniques in Data Publishing

A Mapping Approach for Fully Virtual Data Integration System Processes

A Learning Method for Entity Matching

A Novel Vision for Navigation and Enrichment in Cultural Heritage Collections

Data Partitioning for Parallel Entity Matching

Record Linkage using Probabilistic Methods and Data Mining Techniques

Schema Matching Using Directed Graph Matching

Object Matching for Information Integration: A Profiler-Based Approach

Probabilistic Estimates of Attribute Statistics and Match Likelihood for People Entity Resolution

2002 Journal of Software

Handling instance coreferencing in the KnoFuss architecture

Semantics Representation of Probabilistic Data by Using Topk-Queries for Uncertain Data

Stanford Warren Ascherman Professor of Engineering, Emeritus Computer Science

Matching and Alignment: What is the Cost of User Post-match Effort?

Deep Web Content Mining

PRIOR System: Results for OAEI 2006

Top-k Keyword Search Over Graphs Based On Backward Search

Twitter data Analytics using Distributed Computing

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

Implementation of an Efficient Approach for Duplicate Detection System

Redundancy-Driven Web Data Extraction and Integration

EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES

Dataspaces: A New Abstraction for Data Management. Mike Franklin, Alon Halevy, David Maier, Jennifer Widom

Data Cleansing. LIU Jingyuan, Vislab WANG Yilei, Theoretical group

Data Quality: the Other Face of Big Data. Divesh Srivastava AT&T Labs-Research

Advances in Data Management - Web Data Integration A.Poulovassilis

Effective Semantic Search over Huge RDF Data

An Extension of NDT to Model Entity Reconciliation Problems

Mining Trusted Information in Medical Science: An Information Network Approach

Identifying Value Mappings for Data Integration: An Unsupervised Approach

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

Truth Finding with Attribute Partitioning

Research of Data Cleaning Methods Based on Dependency Rules

Schema Integration Based on Uncertain Semantic Mappings

Semi-automatic Generation of Active Ontologies from Web Forms

A Survey on Data Extraction and Data Duplication Detection

Obtaining Rough Set Approximation using MapReduce Technique in Data Mining

Computational Cost of Querying for Related Entities in Different Ontologies

AN EFFICIENT PROCESSING OF WEBPAGE METADATA AND DOCUMENTS USING ANNOTATION Sabna N.S 1, Jayaleshmi S 2

International Journal of Research in Computer and Communication Technology, Vol 3, Issue 11, November

Ontology Augmentation Through Matching with Web Tables

10th International Workshop on Quality in Databases QDB 2012

Similarity Joins of Text with Incomplete Information Formats

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

Prof. Dr. Christian Bizer

XML Schema Matching Using Structural Information

HeteroClass: A Framework for Effective Classification from Heterogeneous Databases

Map-Reduce for Cube Computation

Quotient Cube: How to Summarize the Semantics of a Data Cube

An Uncertain Data Integration System

Evaluation of Keyword Search System with Ranking

Identifying Value Mappings for Data Integration: An Unsupervised Approach

Symmetrically Exploiting XML

Arbee L.P. Chen ( 陳良弼 )

An Iterative Approach to Record Deduplication

Answering Structured Queries on Unstructured Data

Enriching Knowledge Domain Visualizations: Analysis of a Record Linkage and Information Fusion Approach to Citation Data

Survey on Community Question Answering Systems

Classification of Contradiction Patterns

itrails: Pay-as-you-go Information Integration in Dataspaces

Outline. Part I. Introduction Part II. ML for DI. Part III. DI for ML Part IV. Conclusions and research direction

Searching SNT in XML Documents Using Reduction Factor

A Survey on Keyword Diversification Over XML Data

A Hierarchical Document Clustering Approach with Frequent Itemsets

Annotating Multiple Web Databases Using Svm

Record Linkage with Uniqueness Constraints and Erroneous Values

A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration

Introduction & Administrivia

A Clustering-Based Framework to Control Block Sizes for Entity Resolution

Functional Dependencies and Single Valued Normalization (Up to BCNF)

MINING OF LARGE SCALE DATA USING BESTPEER++ STRATEGY

A Review Paper on Query Optimization for Crowdsourcing Systems

An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages

Extraction of Automatic Search Result Records Using Content Density Algorithm Based on Node Similarity

A FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS

Adaptive Windows for Duplicate Detection

Unity: Speeding the Creation of Community Vocabularies for Information Integration and Reuse

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Distributed Database Management Systems M. Tamer Özsu and Patrick Valduriez

IFRAT: An IoT Field Recognition Algorithm based on Time-series Data

Transcription:

(Big Data Integration) : :

3 # $%&'! ()* +$,- 2/30

()* + # $%&' = 3 : $ 2 : 17 ;' $ # < 2 6 ' $%&',# +'= > 0 - '? @0 A 1 3/30

3?. - B 6 @* @(C : E6 - > ()* (C :(C E6 1' +'= - ''3-6 F :* 2G '> H-! +'-? - 0 A :B 6 I2 #- 2 - J H-!6 -! H-!6 :?. L- K + A'3 >- 3 - MN 5* O5?6 I7... 4/30

# $%&'! S0 Q' 2 R2 # $%&' 0 A 3! # 5/30

# $%&'! 3! '>!. ' # $%&'? 2 7 IT&' '> ' # 6/30

# $%&'! 3! # > U6 1 7/30

- # $%&'! S1 S2 S3 (name, games, runs) (name, team, score) a: (id, name); b: (id, team, runs) 3 H!. S4 S5 (name, club, matches) (name, team, matches) A 8/30

- # $%&'! 3 H!. A S1 S2 S3 S4 S5 USP (name, games, runs) (name, team, score) a: (id, name); b: (id, team, runs) (name, club, matches) (name, team, matches) MS (n, t, g, s) 9/30

- # $%&'! 3 H!. A S1 S2 S3 S4 S5 USP MS (n, t, g, s) MSAM (name, games, runs) (name, team, score) a: (id, name); b: (id, team, runs) (name, club, matches) (name, team, matches) MS.n: S1.name, S2.name, MS.t: S2.team, S4.club, MS.g: S1.games, S4.matches, MS.s: S1.runs, S2.score, 10/30

- # $%&'! 3 H!. A S1 S2 S3 S4 S5 MS (n, t, g, s) MSSM (name, games, runs) (name, team, score) a: (id, name); b: (id, team, runs) (name, club, matches) (name, team, matches) n, t, g, s (MS(n, t, g, s) S1(n, g, s) S2(n, t, s) i (S3a(i, n) & S3b(i, t, s)) S4(n, t, g) S5(n, t, g)) 11/30

* - B 6 +$ : 3!# [8] H!. 2 0 V5W V5W 2 0 I< # X M + *C = [6] A -? ;2 Y2- - 17 Q' ;2 ;& A # *C M 1'#6 Z [9,10]$% Y76 - '> Q' E +' 3 2 $%&' M Q' I * 2 17 V.6 L$$ - T #! E ;E6 /30

(C +$ : 3 ' ( [-14]-> 5 5 Q' 2 Y76 M-> 2'#2 Y76 M-> I 3 M-> # + [W [11]!W L- L- X 2 5 HTML X -,#3 +3 ' () [15]*> &7 ' 0\ # )* 17 :H*/ [W =' + - -< 1> '> [W 3 #2 # '> ;' 2 Y76 H*/ [W & 13/30

3- # $%&'! 2 ]/2 3-2- 2 0 14/30

3- # $%&'! 2 ]/2 3-2- 2 0 15/30

3- # $%&'! 2 ]/2 3-2- 2 0 16/30

3- # $%&'! 2 ]/2 3-2- 2 0 17/30

B 6 - (C +$ : 3 3 [21] & *+,-+ % +/A L$$ #! [24,26] *+,-+.+/ : $ O0 2 ]/2,- #! O0 ]/2 -_\ T' # :5> [29]' () ' ( *01 ;=' 0\ - =' 0 ` 2 :' :('A H!. X 2 # T$2 # T$2 K 2 O? K3 ' 2 K3 I 3 18/30

B 6 - * +$ : 3 3 [27,28]&* 2 *01 # *C'# - 2 0 M* > 0 2 -- T' a% [30]0% 2 *01 b: U c'6 ' - =6 I# M '> O&6 I M [31] *3% + 2 *01 ' - =6 '> H!. 2 '&' '-? d 34 19/30

- # $%&'! 3 % ewg6 : 17! S1 S2 S3 S4 S5 Jagadish UM ATT UM UM UI Dewitt MSR MSR UW UW UW Bernstein MSR MSR MSR MSR MSR Carey UCI ATT BEA BEA BEA Franklin UCB UCB UMD UMD UMD 20/30

- # $%&'! 3 % ewg6 : 17! USP S1 S2 S3 S4 S5 Jagadish UM ATT UM UM UI Dewitt MSR MSR UW UW UW Bernstein MSR MSR MSR MSR MSR Carey UCI ATT BEA BEA BEA Franklin UCB UCB UMD UMD UMD 21/30

- # $%&'! 3 % ewg6 : 17! USP S1 S2 S3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD 22/30

- # $%&'! 3 % ewg6 : 17! USP S1 S2 S3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD 23/30

(C +$ : [37]5-'6!78 3 [38]5-06!78 24/30

* - B 6 +$ : [37]' $!78 3 [39]0%!78 ''3 I< I2 7E # #2 - f?. ' X6 I 3 25/30

?. +$ : 3 [32];+% 5+ 2 *?% < 9 :;% < =>% + U ' ;2 f?. VG ( J I= 7?) 17 2 * I= 2'# ;2 % ewg6 ()ME Q' b: ] O *C H.i0 @17 Q' # i6 *C'# NE [33]<-@ + % A H?!. 2 76 #! ( ' BC = 2 76 C' I< E2 BC =' 2 76 [34] IR+ % A 7 Q' #! f?. - C' ;2 7 I=... 26/30

()* # $%&' +$ 1,- 3! 58 58 [10,9] [8,6] )5< [30]!? F5?! [31]5 2 H [29] $56 23) [8,6] D! [28,27] ' ( )$%& [11] ( -. )$%& [14-] )$%& [15] $5623)4 [21]? @A? [26,24]? @A? B?C N% ) $%& $#8 #$ %&'( )* +,&- 5J KL? [32]M? F? &! -(? 5 # [37]$56 $) [39] # [37]FA6I # [38]FAI #.)/) %&01 [33] IR? 5 [34] 27/30

3 1. Doan, A., A. Halevy, and Z. Ives, Principles of data integration. Elsevier, 20. 2. Dong, X.L. and D. Srivastava, Big data integration. Synthesis Lectures on Data Management, 2015. 3. Chen, J., et al., Big data challenge: a data management perspective. Frontiers of Computer Science, 2013. 4. Bernstein, P.A., J. Madhavan, and E. Rahm, Generic schema matching, ten years later. Proceedings of the VLDB Endowment, 2011. 5. Fagin, R., et al., Clio: Schema mapping creation and data exchange, in Conceptual Modeling: Foundations and Applications, Springer Berlin Heidelberg, 2009. 6. Dong, X.L., A. Halevy, and C. Yu, Data integration with uncertainty. The VLDB Journal The International Journal on Very Large Data Bases, 2009. 7. Franklin, M., A. Halevy, and D. Maier, From databases to dataspaces: a new abstraction for information management. ACM Sigmod Record, 2005. 8. Das Sarma, A., X. Dong, and A. Halevy. Bootstrapping pay-as-you-go data integration systems. in Proceedings of the 2008 ACM SIGMOD international conference on Management of data, 2008. 9. Wang, Z., et al., A unified approach to matching semantic data on the web. Knowledge-Based Systems, 2013. 10. Kang, L., L. Yi, and L. Dong. Research on Construction Methods of Big Data Semantic Model. in Proceedings of the World Congress on Engineering, 2014 11. Chuang, S.-L. and K.C.-C. Chang. Integrating web query results: holistic schema matching. in Proceedings of the 17th ACM conference on Information and knowledge management, 2008.. Cafarella, M.J., et al., Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment, 2008. 13. Das Sarma, A., et al. Finding related tables. in Proceedings of the 20 ACM SIGMOD International Conference on Management of Data, 20. 14. Limaye, G., S. Sarawagi, and S. Chakrabarti, Annotating and searching web tables using entities, types and relationships. Proceedings of the VLDB Endowment, 2010. 28/30

3 15. Torre-Bastida, A.I., et al., Semantic Information Fusion of Linked Open Data and Social Big Data for the Creation of an Extended Corporate CRM Database, in Intelligent Distributed Computing VIII, 2015. 16. Kum, H.-C., et al., Privacy preserving interactive record linkage (PPIRL). Journal of the American Medical Informatics Association, 2014. 17. Fan, W., et al., Reasoning about record matching rules. Proceedings of the VLDB Endowment, 2009. 18. Fellegi, I.P. and A.B. Sunter, A theory for record linkage. Journal of the American Statistical Association, 1969. 19. Elmagarmid, A.K., P.G. Ipeirotis, and V.S. Verykios, Duplicate record detection: A survey. Knowledge and Data Engineering, IEEE Transactions on, 2007. 20. Hernández, M.A. and S.J. Stolfo, Real-world data is dirty: Data cleansing and the merge/purge problem. Data mining and knowledge discovery, 1998. 21. Efthymiou, V., K. Stefanidis, and V. Christophides, Big data entity resolution: From highly to somehow similar entity descriptions in the Web. IEEE Big Data, 2015. 22. Köpcke, H., A. Thor, and E. Rahm, Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 2010. 23. Kolb, L., A. Thor, and E. Rahm. Load balancing for mapreduce-based entity resolution. in Data Engineering (ICDE), 20 IEEE 28th International Conference on, 20. 24. Papadakis, G., et al., Meta-blocking: Taking entity resolutionto the next level. Knowledge and Data Engineering, IEEE Transactions on, 2014. 25. Li, F., et al., Distributed data management using MapReduce. ACM Computing Surveys (CSUR), 2014. 26. Efthymiou, V., et al., Parallel Meta-blocking: Realizing Scalable Entity Resolution over Large, Heterogeneous Data. IEEE Big Data, 2015. 27. Whang, S.E. and H. Garcia-Molina, Incremental entity resolution on rules and data. The VLDB Journal The International Journal on Very Large Data Bases, 2014. 29/30

3 28. Gruenheid, A., X.L. Dong, and D. Srivastava, Incremental record linkage. Proceedings of the VLDB Endowment, 2014. 29. Kannan, A., et al. Matching unstructured product offers to structured product specifications. in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, 2011. 30. Li, P., et al., Linking temporal records. Proceedings of the VLDB Endowment, 2011. 31. Guo, S., et al., Record linkage with uniqueness constraints and erroneous values. Proceedings of the VLDB Endowment, 2010. 32. Dong, X.L., L. Berti-Equille, and D. Srivastava, Integrating conflicting data: the role of source dependence. Proceedings of the VLDB Endowment, 2009. 33. Yin, X. and W. Tan. Semi-supervised truth discovery. in Proceedings of the 20th international conference on World wide web, 2011. 34. Galland, A., et al. Corroborating information from disagreeing views. in Proceedings of the third ACM international conference on Web search and data mining, 2010. 35. Zhao, B. and J. Han, A probabilistic model for estimating real-valued truth from conflicting sources. Proc. of QDB, 20. 36. Pochampally, R., et al. Fusing data with correlations. in Proceedings of the 2014 ACM SIGMOD international conference on Management of data, 2014. 37. Dong, X.L., et al., From data fusion to knowledge fusion. Proceedings of the VLDB Endowment, 2014. 38. Liu, X., et al., Online data fusion. Proceedings of the VLDB Endowment, 2011. 39. Dong, X.L., L. Berti-Equille, and D. Srivastava, Truth discovery and copying detection in a dynamic world. Proceedings of the VLDB Endowment, 2009. 30/30

>6 # &G6 2