(Big Data Integration) : :

Size: px
Start display at page:

Download "(Big Data Integration) : :"

Transcription

1 (Big Data Integration) : :

2 3 # $%&'! ()* +$,- 2/30

3 ()* + # $%&' = 3 : $ 2 : 17 ;' $ # < 2 6 ' $%&',# +'= > 0 - A 1 3/30

4 3?. - : E6 - > ()* (C :(C E6 1' +'= - ''3-6 F :* 2G '> H-! +'-? - 0 A :B 6 I2 #- 2 - J H-!6 -! H-!6 :?. L- K + A'3 >- 3 - MN 5* O5?6 I7... 4/30

5 # $%&'! S0 Q' 2 R2 # $%&' 0 A 3! # 5/30

6 # $%&'! 3! '>!. ' # $%&'? 2 7 IT&' '> ' # 6/30

7 # $%&'! 3! # > U6 1 7/30

8 - # $%&'! S1 S2 S3 (name, games, runs) (name, team, score) a: (id, name); b: (id, team, runs) 3 H!. S4 S5 (name, club, matches) (name, team, matches) A 8/30

9 - # $%&'! 3 H!. A S1 S2 S3 S4 S5 USP (name, games, runs) (name, team, score) a: (id, name); b: (id, team, runs) (name, club, matches) (name, team, matches) MS (n, t, g, s) 9/30

10 - # $%&'! 3 H!. A S1 S2 S3 S4 S5 USP MS (n, t, g, s) MSAM (name, games, runs) (name, team, score) a: (id, name); b: (id, team, runs) (name, club, matches) (name, team, matches) MS.n: S1.name, S2.name, MS.t: S2.team, S4.club, MS.g: S1.games, S4.matches, MS.s: S1.runs, S2.score, 10/30

11 - # $%&'! 3 H!. A S1 S2 S3 S4 S5 MS (n, t, g, s) MSSM (name, games, runs) (name, team, score) a: (id, name); b: (id, team, runs) (name, club, matches) (name, team, matches) n, t, g, s (MS(n, t, g, s) S1(n, g, s) S2(n, t, s) i (S3a(i, n) & S3b(i, t, s)) S4(n, t, g) S5(n, t, g)) 11/30

12 * - B 6 +$ : 3!# [8] H!. 2 0 V5W V5W 2 0 I< # X M + *C = [6] A -? ;2 Y Q' ;2 ;& A # *C M 1'#6 Z [9,10]$% Y76 - '> Q' E +' 3 2 $%&' M Q' I * 2 17 V.6 L$$ - T #! E ;E6 /30

13 (C +$ : 3 ' ( [-14]-> 5 5 Q' 2 Y76 M-> 2'#2 Y76 M-> I 3 M-> # + [W [11]!W L- L- X 2 5 HTML X -,#3 +3 ' () [15]*> &7 ' 0\ # )* 17 :H*/ [W =' + - -< 1> '> [W 3 #2 # '> ;' 2 Y76 H*/ [W & 13/30

14 3- # $%&'! 2 ]/ /30

15 3- # $%&'! 2 ]/ /30

16 3- # $%&'! 2 ]/ /30

17 3- # $%&'! 2 ]/ /30

18 B 6 - (C +$ : 3 3 [21] & *+,-+ % +/A L$$ #! [24,26] *+,-+.+/ : $ O0 2 ]/2,- #! O0 ]/2 -_\ T' # :5> [29]' () ' ( *01 ;=' 0\ - =' 0 ` 2 :' :('A H!. X 2 # T$2 # T$2 K 2 O? K3 ' 2 K3 I 3 18/30

19 B 6 - * +$ : 3 3 [27,28]&* 2 *01 # *C'# M* > T' a% [30]0% 2 *01 b: U c'6 ' - =6 I# M '> O&6 I M [31] *3% + 2 *01 ' - =6 '> H!. 2 '&' '-? d 34 19/30

20 - # $%&'! 3 % ewg6 : 17! S1 S2 S3 S4 S5 Jagadish UM ATT UM UM UI Dewitt MSR MSR UW UW UW Bernstein MSR MSR MSR MSR MSR Carey UCI ATT BEA BEA BEA Franklin UCB UCB UMD UMD UMD 20/30

21 - # $%&'! 3 % ewg6 : 17! USP S1 S2 S3 S4 S5 Jagadish UM ATT UM UM UI Dewitt MSR MSR UW UW UW Bernstein MSR MSR MSR MSR MSR Carey UCI ATT BEA BEA BEA Franklin UCB UCB UMD UMD UMD 21/30

22 - # $%&'! 3 % ewg6 : 17! USP S1 S2 S3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD 22/30

23 - # $%&'! 3 % ewg6 : 17! USP S1 S2 S3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD 23/30

24 (C +$ : [37]5-'6!78 3 [38]5-06!78 24/30

25 * - B 6 +$ : [37]' $!78 3 [39]0%!78 ''3 I< I2 7E # #2 - f?. ' X6 I 3 25/30

26 ?. +$ : 3 [32];+% 5+ 2 *?% < 9 :;% < =>% + U ' ;2 f?. VG ( J I= 7?) 17 2 * I= 2'# ;2 % ewg6 ()ME Q' b: ] O *C Q' # i6 *C'# NE [33]<-@ + % A H?! #! ( ' BC = 2 76 C' I< E2 BC =' 2 76 [34] IR+ % A 7 Q' #! f?. - C' ;2 7 I=... 26/30

27 ()* # $%&' +$ 1,- 3! [10,9] [8,6] )5< [30]!? F5?! [31]5 2 H [29] $56 23) [8,6] D! [28,27] ' ( )$%& [11] ( -. )$%& [14-] )$%& [15] $5623)4 B?C N% ) $%& $#8 #$ %&'( )* +,&- 5J KL? [32]M? F? &! -(? 5 # [37]$56 $) [39] # [37]FA6I # [38]FAI #.)/) %&01 [33] IR? 5 [34] 27/30

28 3 1. Doan, A., A. Halevy, and Z. Ives, Principles of data integration. Elsevier, Dong, X.L. and D. Srivastava, Big data integration. Synthesis Lectures on Data Management, Chen, J., et al., Big data challenge: a data management perspective. Frontiers of Computer Science, Bernstein, P.A., J. Madhavan, and E. Rahm, Generic schema matching, ten years later. Proceedings of the VLDB Endowment, Fagin, R., et al., Clio: Schema mapping creation and data exchange, in Conceptual Modeling: Foundations and Applications, Springer Berlin Heidelberg, Dong, X.L., A. Halevy, and C. Yu, Data integration with uncertainty. The VLDB Journal The International Journal on Very Large Data Bases, Franklin, M., A. Halevy, and D. Maier, From databases to dataspaces: a new abstraction for information management. ACM Sigmod Record, Das Sarma, A., X. Dong, and A. Halevy. Bootstrapping pay-as-you-go data integration systems. in Proceedings of the 2008 ACM SIGMOD international conference on Management of data, Wang, Z., et al., A unified approach to matching semantic data on the web. Knowledge-Based Systems, Kang, L., L. Yi, and L. Dong. Research on Construction Methods of Big Data Semantic Model. in Proceedings of the World Congress on Engineering, Chuang, S.-L. and K.C.-C. Chang. Integrating web query results: holistic schema matching. in Proceedings of the 17th ACM conference on Information and knowledge management, Cafarella, M.J., et al., Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment, Das Sarma, A., et al. Finding related tables. in Proceedings of the 20 ACM SIGMOD International Conference on Management of Data, Limaye, G., S. Sarawagi, and S. Chakrabarti, Annotating and searching web tables using entities, types and relationships. Proceedings of the VLDB Endowment, /30

29 3 15. Torre-Bastida, A.I., et al., Semantic Information Fusion of Linked Open Data and Social Big Data for the Creation of an Extended Corporate CRM Database, in Intelligent Distributed Computing VIII, Kum, H.-C., et al., Privacy preserving interactive record linkage (PPIRL). Journal of the American Medical Informatics Association, Fan, W., et al., Reasoning about record matching rules. Proceedings of the VLDB Endowment, Fellegi, I.P. and A.B. Sunter, A theory for record linkage. Journal of the American Statistical Association, Elmagarmid, A.K., P.G. Ipeirotis, and V.S. Verykios, Duplicate record detection: A survey. Knowledge and Data Engineering, IEEE Transactions on, Hernández, M.A. and S.J. Stolfo, Real-world data is dirty: Data cleansing and the merge/purge problem. Data mining and knowledge discovery, Efthymiou, V., K. Stefanidis, and V. Christophides, Big data entity resolution: From highly to somehow similar entity descriptions in the Web. IEEE Big Data, Köpcke, H., A. Thor, and E. Rahm, Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, Kolb, L., A. Thor, and E. Rahm. Load balancing for mapreduce-based entity resolution. in Data Engineering (ICDE), 20 IEEE 28th International Conference on, Papadakis, G., et al., Meta-blocking: Taking entity resolutionto the next level. Knowledge and Data Engineering, IEEE Transactions on, Li, F., et al., Distributed data management using MapReduce. ACM Computing Surveys (CSUR), Efthymiou, V., et al., Parallel Meta-blocking: Realizing Scalable Entity Resolution over Large, Heterogeneous Data. IEEE Big Data, Whang, S.E. and H. Garcia-Molina, Incremental entity resolution on rules and data. The VLDB Journal The International Journal on Very Large Data Bases, /30

30 3 28. Gruenheid, A., X.L. Dong, and D. Srivastava, Incremental record linkage. Proceedings of the VLDB Endowment, Kannan, A., et al. Matching unstructured product offers to structured product specifications. in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, Li, P., et al., Linking temporal records. Proceedings of the VLDB Endowment, Guo, S., et al., Record linkage with uniqueness constraints and erroneous values. Proceedings of the VLDB Endowment, Dong, X.L., L. Berti-Equille, and D. Srivastava, Integrating conflicting data: the role of source dependence. Proceedings of the VLDB Endowment, Yin, X. and W. Tan. Semi-supervised truth discovery. in Proceedings of the 20th international conference on World wide web, Galland, A., et al. Corroborating information from disagreeing views. in Proceedings of the third ACM international conference on Web search and data mining, Zhao, B. and J. Han, A probabilistic model for estimating real-valued truth from conflicting sources. Proc. of QDB, Pochampally, R., et al. Fusing data with correlations. in Proceedings of the 2014 ACM SIGMOD international conference on Management of data, Dong, X.L., et al., From data fusion to knowledge fusion. Proceedings of the VLDB Endowment, Liu, X., et al., Online data fusion. Proceedings of the VLDB Endowment, Dong, X.L., L. Berti-Equille, and D. Srivastava, Truth discovery and copying detection in a dynamic world. Proceedings of the VLDB Endowment, /30

31 >6 # &G6 2

Entity Resolution with Heavy Indexing

Entity Resolution with Heavy Indexing Entity Resolution with Heavy Indexing Csaba István Sidló Data Mining and Web Search Group, Informatics Laboratory Institute for Computer Science and Control, Hungarian Academy of Sciences sidlo@ilab.sztaki.hu

More information

Comprehensive and Progressive Duplicate Entities Detection

Comprehensive and Progressive Duplicate Entities Detection Comprehensive and Progressive Duplicate Entities Detection Veerisetty Ravi Kumar Dept of CSE, Benaiah Institute of Technology and Science. Nagaraju Medida Assistant Professor, Benaiah Institute of Technology

More information

Computer-based Tracking Protocols: Improving Communication between Databases

Computer-based Tracking Protocols: Improving Communication between Databases Computer-based Tracking Protocols: Improving Communication between Databases Amol Deshpande Database Group Department of Computer Science University of Maryland Overview Food tracking and traceability

More information

Towards Efficient and Effective Semantic Table Interpretation Ziqi Zhang

Towards Efficient and Effective Semantic Table Interpretation Ziqi Zhang Towards Efficient and Effective Semantic Table Interpretation Ziqi Zhang Department of Computer Science, University of Sheffield Outline Define semantic table interpretation State-of-the-art and motivation

More information

Outline. Data Integration. Entity Matching/Identification. Duplicate Detection. More Resources. Duplicates Detection in Database Integration

Outline. Data Integration. Entity Matching/Identification. Duplicate Detection. More Resources. Duplicates Detection in Database Integration Outline Duplicates Detection in Database Integration Background HumMer Automatic Data Fusion System Duplicate Detection methods An efficient method using priority queue Approach based on Extended key Approach

More information

Learning mappings and queries

Learning mappings and queries Learning mappings and queries Marie Jacob University Of Pennsylvania DEIS 2010 1 Schema mappings Denote relationships between schemas Relates source schema S and target schema T Defined in a query language

More information

Visualizing semantic table annotations with TableMiner+

Visualizing semantic table annotations with TableMiner+ Visualizing semantic table annotations with TableMiner+ MAZUMDAR, Suvodeep and ZHANG, Ziqi Available from Sheffield Hallam University Research Archive (SHURA) at:

More information

Active Blocking Scheme Learning for Entity Resolution

Active Blocking Scheme Learning for Entity Resolution Active Blocking Scheme Learning for Entity Resolution Jingyu Shao and Qing Wang Research School of Computer Science, Australian National University {jingyu.shao,qing.wang}@anu.edu.au Abstract. Blocking

More information

HOLISTIC DATA INTEGRATION FOR BIG DATA

HOLISTIC DATA INTEGRATION FOR BIG DATA HOLISTIC DATA INTEGRATION FOR BIG DATA ERHARD RAHM, UNIVERSITY OF LEIPZIG, AUGUST 2016 www.scads.de GERMAN CENTERS FOR BIG DATA Two Centers of Excellence for Big Data in Germany ScaDS Dresden/Leipzig Berlin

More information

References Part I: Introduction

References Part I: Introduction References Part I: Introduction Bengio, Y., Goodfellow, I.J. & Courville, A., 2015. Deep learning. Nature, 521(7553), pp.436 444. Bishop, C.M., 2016. Pattern Recognition and Machine Learning, Springer

More information

Flexible Dataspace Management Through Model Management

Flexible Dataspace Management Through Model Management Flexible Dataspace Management Through Model Management Cornelia Hedeler, Khalid Belhajjame, Lu Mao, Norman W. Paton, Alvaro A.A. Fernandes, Chenjuan Guo, and Suzanne M. Embury School of Computer Science,

More information

ISSN (Online) ISSN (Print)

ISSN (Online) ISSN (Print) Accurate Alignment of Search Result Records from Web Data Base 1Soumya Snigdha Mohapatra, 2 M.Kalyan Ram 1,2 Dept. of CSE, Aditya Engineering College, Surampalem, East Godavari, AP, India Abstract: Most

More information

Keywords Data alignment, Data annotation, Web database, Search Result Record

Keywords Data alignment, Data annotation, Web database, Search Result Record Volume 5, Issue 8, August 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Annotating Web

More information

Principles of Dataspaces

Principles of Dataspaces Principles of Dataspaces Seminar From Databases to Dataspaces Summer Term 2007 Monika Podolecheva University of Konstanz Department of Computer and Information Science Tutor: Prof. M. Scholl, Alexander

More information

Comparison of Online Record Linkage Techniques

Comparison of Online Record Linkage Techniques International Research Journal of Engineering and Technology (IRJET) e-issn: 2395-0056 Volume: 02 Issue: 09 Dec-2015 p-issn: 2395-0072 www.irjet.net Comparison of Online Record Linkage Techniques Ms. SRUTHI.

More information

Deduplication of Hospital Data using Genetic Programming

Deduplication of Hospital Data using Genetic Programming Deduplication of Hospital Data using Genetic Programming P. Gujar Department of computer engineering Thakur college of engineering and Technology, Kandiwali, Maharashtra, India Priyanka Desai Department

More information

A Novel Approach On simplifying Document Annotation Using Content and Querying Assessment

A Novel Approach On simplifying Document Annotation Using Content and Querying Assessment A Novel Approach On simplifying Document Annotation Using Content and Querying Assessment Nomula Ramesh PG Scholar, Department CSE Krishnamurthy Institute of Technology & Engineering, Ghatkesar, R.R, Telangana.

More information

Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules

Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules Probabilistic Scoring Methods to Assist Entity Resolution Systems Using Boolean Rules Fumiko Kobayashi, John R Talburt Department of Information Science University of Arkansas at Little Rock 2801 South

More information

Introduction Data Integration Summary. Data Integration. COCS 6421 Advanced Database Systems. Przemyslaw Pawluk. CSE, York University.

Introduction Data Integration Summary. Data Integration. COCS 6421 Advanced Database Systems. Przemyslaw Pawluk. CSE, York University. COCS 6421 Advanced Database Systems CSE, York University March 20, 2008 Agenda 1 Problem description Problems 2 3 Open questions and future work Conclusion Bibliography Problem description Problems Why

More information

Exploring Schema Repositories with Schemr

Exploring Schema Repositories with Schemr Exploring Schema Repositories with Schemr Kuang Chen and Akshay Kannan University of California, Berkeley kuangc@cs.berkeley.edu, akannan@cs.berkeley.edu Jayant Madhavan and Alon Halevy Google, Inc. jayant@google.com,

More information

Survey Result on Privacy Preserving Techniques in Data Publishing

Survey Result on Privacy Preserving Techniques in Data Publishing Survey Result on Privacy Preserving Techniques in Data Publishing S.Deebika PG Student, Computer Science and Engineering, Vivekananda College of Engineering for Women, Namakkal India A.Sathyapriya Assistant

More information

A Mapping Approach for Fully Virtual Data Integration System Processes

A Mapping Approach for Fully Virtual Data Integration System Processes A Mapping Approach for Fully Virtual Data Integration System Processes Ali Z. El Qutaany 1 PhD Student, Faculty of Computers and Information, Cairo University Cairo, Egypt Osman M. Hegazi 2 Professor,

More information

A Learning Method for Entity Matching

A Learning Method for Entity Matching A Learning Method for Entity Matching Jie Chen Cheqing Jin Rong Zhang Aoying Zhou Shanghai Key Laboratory of Trustworthy Computing, Software Engineering Institute, East China Normal University, China 5500002@ecnu.cn,

More information

A Novel Vision for Navigation and Enrichment in Cultural Heritage Collections

A Novel Vision for Navigation and Enrichment in Cultural Heritage Collections A Novel Vision for Navigation and Enrichment in Cultural Heritage Collections Joffrey Decourselle, Audun Vennesland, Trond Aalberg, Fabien Duchateau & Nicolas Lumineau 08/09/2015 - SW4CH Workshop, Poitiers

More information

Data Partitioning for Parallel Entity Matching

Data Partitioning for Parallel Entity Matching Data Partitioning for Parallel Entity Matching Toralf Kirsten, Lars Kolb, Michael Hartung, Anika Groß, Hanna Köpcke, Erhard Rahm Department of Computer Science, University of Leipzig 04109 Leipzig, Germany

More information

Record Linkage using Probabilistic Methods and Data Mining Techniques

Record Linkage using Probabilistic Methods and Data Mining Techniques Doi:10.5901/mjss.2017.v8n3p203 Abstract Record Linkage using Probabilistic Methods and Data Mining Techniques Ogerta Elezaj Faculty of Economy, University of Tirana Gloria Tuxhari Faculty of Economy, University

More information

Schema Matching Using Directed Graph Matching

Schema Matching Using Directed Graph Matching Schema Matching Using Directed Graph Matching [1] K.AMSHAKALA Department of Computer Science Engineering and Information Technology Coimbatore Institute of Technology, Coimbatore, INDIA Email:amshakalacse@yahoo.com

More information

Object Matching for Information Integration: A Profiler-Based Approach

Object Matching for Information Integration: A Profiler-Based Approach Object Matching for Information Integration: A Profiler-Based Approach AnHai Doan Ying Lu Yoonkyong Lee Jiawei Han {anhai,yinglu,ylee11,hanj}@cs.uiuc.edu Department of Computer Science University of Illinois,

More information

Probabilistic Estimates of Attribute Statistics and Match Likelihood for People Entity Resolution

Probabilistic Estimates of Attribute Statistics and Match Likelihood for People Entity Resolution Probabilistic Estimates of Attribute Statistics and Match Likelihood for People Entity Resolution Xin Wang Ang Sun Hakan Kardes Siddharth Agrawal Lin Chen Andrew Borthwick Data Research Intelius Inc Bellevue

More information

2002 Journal of Software

2002 Journal of Software 1000-9825/2002/13(11)2076-07 2002 Journal of Software Vol13 No11 ( 200433); ( 200433) E-mail: zmguo@fudaneducn http://wwwfudaneducn : : ; ; ; ; : TP311 : A (garbage in garbage out) (dirty data) (duplicate

More information

Handling instance coreferencing in the KnoFuss architecture

Handling instance coreferencing in the KnoFuss architecture Handling instance coreferencing in the KnoFuss architecture Andriy Nikolov, Victoria Uren, Enrico Motta and Anne de Roeck Knowledge Media Institute, The Open University, Milton Keynes, UK {a.nikolov, v.s.uren,

More information

Semantics Representation of Probabilistic Data by Using Topk-Queries for Uncertain Data

Semantics Representation of Probabilistic Data by Using Topk-Queries for Uncertain Data PP 53-57 Semantics Representation of Probabilistic Data by Using Topk-Queries for Uncertain Data 1 R.G.NishaaM.E (SE), 2 N.GayathriM.E(SE) 1 Saveetha engineering college, 2 SSN engineering college Abstract:

More information

Stanford Warren Ascherman Professor of Engineering, Emeritus Computer Science

Stanford Warren Ascherman Professor of Engineering, Emeritus Computer Science Stanford Warren Ascherman Professor of Engineering, Emeritus Computer Science Bio ACADEMIC APPOINTMENTS Emeritus Faculty, Acad Council, Computer Science Teaching COURSES 2016-17 Mining Massive Data Sets:

More information

Matching and Alignment: What is the Cost of User Post-match Effort?

Matching and Alignment: What is the Cost of User Post-match Effort? Matching and Alignment: What is the Cost of User Post-match Effort? (Short paper) Fabien Duchateau 1 and Zohra Bellahsene 2 and Remi Coletta 2 1 Norwegian University of Science and Technology NO-7491 Trondheim,

More information

Deep Web Content Mining

Deep Web Content Mining Deep Web Content Mining Shohreh Ajoudanian, and Mohammad Davarpanah Jazi Abstract The rapid expansion of the web is causing the constant growth of information, leading to several problems such as increased

More information

PRIOR System: Results for OAEI 2006

PRIOR System: Results for OAEI 2006 PRIOR System: Results for OAEI 2006 Ming Mao, Yefei Peng University of Pittsburgh, Pittsburgh, PA, USA {mingmao,ypeng}@mail.sis.pitt.edu Abstract. This paper summarizes the results of PRIOR system, which

More information

Top-k Keyword Search Over Graphs Based On Backward Search

Top-k Keyword Search Over Graphs Based On Backward Search Top-k Keyword Search Over Graphs Based On Backward Search Jia-Hui Zeng, Jiu-Ming Huang, Shu-Qiang Yang 1College of Computer National University of Defense Technology, Changsha, China 2College of Computer

More information

Twitter data Analytics using Distributed Computing

Twitter data Analytics using Distributed Computing Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

Implementation of an Efficient Approach for Duplicate Detection System

Implementation of an Efficient Approach for Duplicate Detection System Implementation of an Efficient Approach for Duplicate Detection System Ruchira Deshpande #1, Sonali Bodkhe #2 #1,2 Department of Computer Science & Engineering Rashtrasant Tukadoji Maharaj Nagpur University

More information

Redundancy-Driven Web Data Extraction and Integration

Redundancy-Driven Web Data Extraction and Integration Redundancy-Driven Web Data Extraction and Integration Lorenzo Blanco, Mirko Bronzi, Valter Crescenzi, Paolo Merialdo, Paolo Papotti Università degli Studi Roma Tre Dipartimento di Informatica e Automazione

More information

EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES

EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES EXTRACT THE TARGET LIST WITH HIGH ACCURACY FROM TOP-K WEB PAGES B. GEETHA KUMARI M. Tech (CSE) Email-id: Geetha.bapr07@gmail.com JAGETI PADMAVTHI M. Tech (CSE) Email-id: jageti.padmavathi4@gmail.com ABSTRACT:

More information

Dataspaces: A New Abstraction for Data Management. Mike Franklin, Alon Halevy, David Maier, Jennifer Widom

Dataspaces: A New Abstraction for Data Management. Mike Franklin, Alon Halevy, David Maier, Jennifer Widom Dataspaces: A New Abstraction for Data Management Mike Franklin, Alon Halevy, David Maier, Jennifer Widom Today s Agenda Why databases are great. What problems people really have Why databases are not

More information

Data Cleansing. LIU Jingyuan, Vislab WANG Yilei, Theoretical group

Data Cleansing. LIU Jingyuan, Vislab WANG Yilei, Theoretical group Data Cleansing LIU Jingyuan, Vislab WANG Yilei, Theoretical group What is Data Cleansing Data cleansing (data cleaning) is the process of detecting and correcting (or removing) errors or inconsistencies

More information

Data Quality: the Other Face of Big Data. Divesh Srivastava AT&T Labs-Research

Data Quality: the Other Face of Big Data. Divesh Srivastava AT&T Labs-Research Data Quality: the Other Face of Big Data Divesh Srivastava AT&T Labs-Research Data Quality I am a manager I am also a researcher working on data quality 2 Big Data Big data is different things to different

More information

Advances in Data Management - Web Data Integration A.Poulovassilis

Advances in Data Management - Web Data Integration A.Poulovassilis Advances in Data Management - Web Data Integration A.Poulovassilis 1 1 Integrating Deep Web Data Traditionally, the web has made available vast amounts of information in unstructured form (i.e. text).

More information

Effective Semantic Search over Huge RDF Data

Effective Semantic Search over Huge RDF Data Effective Semantic Search over Huge RDF Data 1 Dinesh A. Zende, 2 Chavan Ganesh Baban 1 Assistant Professor, 2 Post Graduate Student Vidya Pratisthan s Kamanayan Bajaj Institute of Engineering & Technology,

More information

An Extension of NDT to Model Entity Reconciliation Problems

An Extension of NDT to Model Entity Reconciliation Problems J. G. Enríquez, F. J. Domínguez-Mayo, J. A. García-García and M. J. Escalona Computer Languages and Systems Department, University of Seville, Av. Reina Mercedes s/n, 41012, Seville, Spain Keywords: Abstract:

More information

Mining Trusted Information in Medical Science: An Information Network Approach

Mining Trusted Information in Medical Science: An Information Network Approach Mining Trusted Information in Medical Science: An Information Network Approach Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign Collaborated with many, especially Yizhou

More information

Identifying Value Mappings for Data Integration: An Unsupervised Approach

Identifying Value Mappings for Data Integration: An Unsupervised Approach Identifying Value Mappings for Data Integration: An Unsupervised Approach Jaewoo Kang 1, Dongwon Lee 2, and Prasenjit Mitra 2 1 NC State University, Raleigh NC 27695, USA 2 Penn State University, University

More information

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) 4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a,

More information

Truth Finding with Attribute Partitioning

Truth Finding with Attribute Partitioning Truth Finding with Attribute Partitioning M. Lamine Ba Institut Mines Télécom Télécom ParisTech; CNRS LTCI Paris, France ba@telecom-paristech.fr Roxana Horincar Institut Mines Télécom Télécom ParisTech;

More information

Research of Data Cleaning Methods Based on Dependency Rules

Research of Data Cleaning Methods Based on Dependency Rules Research of Data Cleaning Methods Based on Dependency Rules Yang Bao, Shi Wei Deng, Wang Qun Lin Abstract his paper introduces the concept and principle of data cleaning, analyzes the types and causes

More information

Schema Integration Based on Uncertain Semantic Mappings

Schema Integration Based on Uncertain Semantic Mappings chema Integration Based on Uncertain emantic Mappings Matteo Magnani 1, Nikos Rizopoulos 2, Peter M c.brien 2, and Danilo Montesi 3 1 Department of Computer cience, University of Bologna, Via Mura A.Zamboni

More information

Semi-automatic Generation of Active Ontologies from Web Forms

Semi-automatic Generation of Active Ontologies from Web Forms Semi-automatic Generation of Active Ontologies from Web Forms Martin Blersch, Mathias Landhäußer, and Thomas Mayer (IPD) 1 KIT The Research University in the Helmholtz Association www.kit.edu "Create an

More information

A Survey on Data Extraction and Data Duplication Detection

A Survey on Data Extraction and Data Duplication Detection A Survey on Data Extraction and Data Duplication Detection Yashika A. Shah e-mail: yashah0694@gmail.com Snehal S. Zade e-mail: snehalzade12@gmail.com Smita M. Raut e-mail: swatiraut93@gmail.com Shraddha

More information

Obtaining Rough Set Approximation using MapReduce Technique in Data Mining

Obtaining Rough Set Approximation using MapReduce Technique in Data Mining Obtaining Rough Set Approximation using MapReduce Technique in Data Mining Varda Dhande 1, Dr. B. K. Sarkar 2 1 M.E II yr student, Dept of Computer Engg, P.V.P.I.T Collage of Engineering Pune, Maharashtra,

More information

Computational Cost of Querying for Related Entities in Different Ontologies

Computational Cost of Querying for Related Entities in Different Ontologies Computational Cost of Querying for Related Entities in Different Ontologies Chung Ming Cheung Yinuo Zhang Anand Panangadan Viktor K. Prasanna University of Southern California Los Angeles, CA 90089, USA

More information

AN EFFICIENT PROCESSING OF WEBPAGE METADATA AND DOCUMENTS USING ANNOTATION Sabna N.S 1, Jayaleshmi S 2

AN EFFICIENT PROCESSING OF WEBPAGE METADATA AND DOCUMENTS USING ANNOTATION Sabna N.S 1, Jayaleshmi S 2 AN EFFICIENT PROCESSING OF WEBPAGE METADATA AND DOCUMENTS USING ANNOTATION Sabna N.S 1, Jayaleshmi S 2 1 M.Tech Scholar, Dept of CSE, LBSITW, Poojappura, Thiruvananthapuram sabnans1988@gmail.com 2 Associate

More information

International Journal of Research in Computer and Communication Technology, Vol 3, Issue 11, November

International Journal of Research in Computer and Communication Technology, Vol 3, Issue 11, November Annotation Wrapper for Annotating The Search Result Records Retrieved From Any Given Web Database 1G.LavaRaju, 2Darapu Uma 1,2Dept. of CSE, PYDAH College of Engineering, Patavala, Kakinada, AP, India ABSTRACT:

More information

Ontology Augmentation Through Matching with Web Tables

Ontology Augmentation Through Matching with Web Tables Ontology Augmentation Through Matching with Web Tables Oliver Lehmberg 1 and Oktie Hassanzadeh 2 1 University of Mannheim, B6 26, 68159 Mannheim, Germany 2 IBM Research, Yorktown Heights, New York, U.S.A.

More information

10th International Workshop on Quality in Databases QDB 2012

10th International Workshop on Quality in Databases QDB 2012 10th International Workshop on Quality in Databases QDB 2012 Xin Luna Dong AT&T Labs-Research, USA lunadong@research.att.com 1. QDB GOALS The problem of low-quality data in databases, data warehouses,

More information

Similarity Joins of Text with Incomplete Information Formats

Similarity Joins of Text with Incomplete Information Formats Similarity Joins of Text with Incomplete Information Formats Shaoxu Song and Lei Chen Department of Computer Science Hong Kong University of Science and Technology {sshaoxu,leichen}@cs.ust.hk Abstract.

More information

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Gidhane* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY AN EFFICIENT APPROACH FOR TEXT MINING USING SIDE INFORMATION Kiran V. Gaidhane*, Prof. L. H. Patil, Prof. C. U. Chouhan DOI: 10.5281/zenodo.58632

More information

Prof. Dr. Christian Bizer

Prof. Dr. Christian Bizer STI Summit July 6 th, 2011, Riga, Latvia Global Data Integration and Global Data Mining Prof. Dr. Christian Bizer Freie Universität ität Berlin Germany Outline 1. Topology of the Web of Data What data

More information

XML Schema Matching Using Structural Information

XML Schema Matching Using Structural Information XML Schema Matching Using Structural Information A.Rajesh Research Scholar Dr.MGR University, Maduravoyil, Chennai S.K.Srivatsa Sr.Professor St.Joseph s Engineering College, Chennai ABSTRACT Schema matching

More information

HeteroClass: A Framework for Effective Classification from Heterogeneous Databases

HeteroClass: A Framework for Effective Classification from Heterogeneous Databases HeteroClass: A Framework for Effective Classification from Heterogeneous Databases CS512 Project Report Mayssam Sayyadian May 2006 Abstract Classification is an important data mining task and it has been

More information

Map-Reduce for Cube Computation

Map-Reduce for Cube Computation 299 Map-Reduce for Cube Computation Prof. Pramod Patil 1, Prini Kotian 2, Aishwarya Gaonkar 3, Sachin Wani 4, Pramod Gaikwad 5 Department of Computer Science, Dr.D.Y.Patil Institute of Engineering and

More information

Quotient Cube: How to Summarize the Semantics of a Data Cube

Quotient Cube: How to Summarize the Semantics of a Data Cube Quotient Cube: How to Summarize the Semantics of a Data Cube Laks V.S. Lakshmanan (Univ. of British Columbia) * Jian Pei (State Univ. of New York at Buffalo) * Jiawei Han (Univ. of Illinois at Urbana-Champaign)

More information

An Uncertain Data Integration System

An Uncertain Data Integration System Author manuscript, published in "Int. Conf. On Ontologies, DataBases, and Applications of Semantics (ODBASE), France" An Uncertain Data Integration System Naser Ayat #1, Hamideh Afsarmanesh #2, Reza Akbarinia

More information

Evaluation of Keyword Search System with Ranking

Evaluation of Keyword Search System with Ranking Evaluation of Keyword Search System with Ranking P.Saranya, Dr.S.Babu UG Scholar, Department of CSE, Final Year, IFET College of Engineering, Villupuram, Tamil nadu, India Associate Professor, Department

More information

Identifying Value Mappings for Data Integration: An Unsupervised Approach

Identifying Value Mappings for Data Integration: An Unsupervised Approach Identifying Value Mappings for Data Integration: An Unsupervised Approach Jaewoo Kang 1, Dongwon Lee 2, and Prasenjit Mitra 2 1 NC State University, Raleigh NC 27695, USA 2 Penn State University, University

More information

Symmetrically Exploiting XML

Symmetrically Exploiting XML Symmetrically Exploiting XML Shuohao Zhang and Curtis Dyreson School of E.E. and Computer Science Washington State University Pullman, Washington, USA The 15 th International World Wide Web Conference

More information

Arbee L.P. Chen ( 陳良弼 )

Arbee L.P. Chen ( 陳良弼 ) Arbee L.P. Chen ( 陳良弼 ) Asia University Taichung, Taiwan EDUCATION Phone: (04)23323456x1011 Email: arbee@asia.edu.tw - Ph.D. in Computer Engineering, Department of Electrical Engineering, University of

More information

An Iterative Approach to Record Deduplication

An Iterative Approach to Record Deduplication An Iterative Approach to Record Deduplication M. Roshini Karunya, S. Lalitha, B.Tech., M.E., II ME (CSE), Gnanamani College of Technology, A.K.Samuthiram, India 1 Assistant Professor, Gnanamani College

More information

Answering Structured Queries on Unstructured Data

Answering Structured Queries on Unstructured Data Answering Structured Queries on Unstructured Data Jing Liu and Xin Dong University of Washington Seattle, WA 9895 {liujing, lunadong}@cs.washington.edu Alon Halevy Google Inc. Mountain View, CA 9422 halevy@google.com

More information

Enriching Knowledge Domain Visualizations: Analysis of a Record Linkage and Information Fusion Approach to Citation Data

Enriching Knowledge Domain Visualizations: Analysis of a Record Linkage and Information Fusion Approach to Citation Data Enriching Knowledge Domain Visualizations: Analysis of a Record Linkage and Information Fusion Approach to Citation Data Marie B. Synnestvedt, MSEd 1, 2 1 Drexel University College of Information Science

More information

Survey on Community Question Answering Systems

Survey on Community Question Answering Systems World Journal of Technology, Engineering and Research, Volume 3, Issue 1 (2018) 114-119 Contents available at WJTER World Journal of Technology, Engineering and Research Journal Homepage: www.wjter.com

More information

Classification of Contradiction Patterns

Classification of Contradiction Patterns Classification of Contradiction Patterns Heiko Müller, Ulf Leser, and Johann-Christoph Freytag Humboldt-Universität zu Berlin, Unter den Linden 6, D-10099 Berlin, Germany, {hmueller, leser, freytag}@informatik.hu-berlin.de

More information

itrails: Pay-as-you-go Information Integration in Dataspaces

itrails: Pay-as-you-go Information Integration in Dataspaces itrails: Pay-as-you-go Information Integration in Dataspaces Marcos Vaz Salles Jens Dittrich Shant Karakashian Olivier Girard Lukas Blunschi ETH Zurich VLDB 2007 Outline Motivation itrails Experiments

More information

Outline. Part I. Introduction Part II. ML for DI. Part III. DI for ML Part IV. Conclusions and research direction

Outline. Part I. Introduction Part II. ML for DI. Part III. DI for ML Part IV. Conclusions and research direction Outline Part I. Introduction Part II. ML for DI ML for entity linkage ML for data extraction ML for data fusion ML for schema alignment Part III. DI for ML Part IV. Conclusions and research direction Data

More information

Searching SNT in XML Documents Using Reduction Factor

Searching SNT in XML Documents Using Reduction Factor Searching SNT in XML Documents Using Reduction Factor Mary Posonia A Department of computer science, Sathyabama University, Tamilnadu, Chennai, India maryposonia@sathyabamauniversity.ac.in http://www.sathyabamauniversity.ac.in

More information

A Survey on Keyword Diversification Over XML Data

A Survey on Keyword Diversification Over XML Data ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology An ISO 3297: 2007 Certified Organization Volume 6, Special Issue 5,

More information

A Hierarchical Document Clustering Approach with Frequent Itemsets

A Hierarchical Document Clustering Approach with Frequent Itemsets A Hierarchical Document Clustering Approach with Frequent Itemsets Cheng-Jhe Lee, Chiun-Chieh Hsu, and Da-Ren Chen Abstract In order to effectively retrieve required information from the large amount of

More information

Annotating Multiple Web Databases Using Svm

Annotating Multiple Web Databases Using Svm Annotating Multiple Web Databases Using Svm M.Yazhmozhi 1, M. Lavanya 2, Dr. N. Rajkumar 3 PG Scholar, Department of Software Engineering, Sri Ramakrishna Engineering College, Coimbatore, India 1, 3 Head

More information

Record Linkage with Uniqueness Constraints and Erroneous Values

Record Linkage with Uniqueness Constraints and Erroneous Values Record Linkage with Uniqueness Constraints and Erroneous Values ABSTRACT Songtao Guo AT&T Interactive Research sguo@attinteractive.com Divesh Srivastava AT&T Labs-Research divesh@research.att.com Many

More information

A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration

A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration 2010 Sixth International Conference on Semantics, Knowledge and Grids A Holistic Solution for Duplicate Entity Identification in Deep Web Data Integration Wei Liu 1,2, Xiaofeng Meng 3 1 Institute of Computer

More information

Introduction & Administrivia

Introduction & Administrivia Introduction & Administrivia Information Retrieval Evangelos Kanoulas ekanoulas@uva.nl Section 1: Unstructured data Sec. 8.1 2 Big Data Growth of global data volume data everywhere! Web data: observation,

More information

A Clustering-Based Framework to Control Block Sizes for Entity Resolution

A Clustering-Based Framework to Control Block Sizes for Entity Resolution A Clustering-Based Framework to Control Block s for Entity Resolution Jeffrey Fisher Research School of Computer Science Australian National University Canberra ACT 0200 jeffrey.fisher@anu.edu.au Peter

More information

Functional Dependencies and Single Valued Normalization (Up to BCNF)

Functional Dependencies and Single Valued Normalization (Up to BCNF) Functional Dependencies and Single Valued Normalization (Up to BCNF) Harsh Srivastava 1, Jyotiraditya Tripathi 2, Dr. Preeti Tripathi 3 1 & 2 M.Tech. Student, Centre for Computer Sci. & Tech. Central University

More information

MINING OF LARGE SCALE DATA USING BESTPEER++ STRATEGY

MINING OF LARGE SCALE DATA USING BESTPEER++ STRATEGY MINING OF LARGE SCALE DATA USING BESTPEER++ STRATEGY *S. ANUSUYA,*R.B. ARUNA,*V. DEEPASRI,**DR.T. AMITHA *UG Students, **Professor Department Of Computer Science and Engineering Dhanalakshmi College of

More information

A Review Paper on Query Optimization for Crowdsourcing Systems

A Review Paper on Query Optimization for Crowdsourcing Systems A Review Paper on Query Optimization for Crowdsourcing Systems Rohini Pingle M.E. Computer Engineering, Gokhale Education Society s, R. H. Sapat College of Engineering, Management Studies and Research,

More information

An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages

An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages An Efficient Technique for Tag Extraction and Content Retrieval from Web Pages S.Sathya M.Sc 1, Dr. B.Srinivasan M.C.A., M.Phil, M.B.A., Ph.D., 2 1 Mphil Scholar, Department of Computer Science, Gobi Arts

More information

Extraction of Automatic Search Result Records Using Content Density Algorithm Based on Node Similarity

Extraction of Automatic Search Result Records Using Content Density Algorithm Based on Node Similarity Extraction of Automatic Search Result Records Using Content Density Algorithm Based on Node Similarity Yasar Gozudeli*, Oktay Yildiz*, Hacer Karacan*, Muhammed R. Baker*, Ali Minnet**, Murat Kalender**,

More information

A FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS

A FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS A FRAMEWORK FOR EFFICIENT DATA SEARCH THROUGH XML TREE PATTERNS SRIVANI SARIKONDA 1 PG Scholar Department of CSE P.SANDEEP REDDY 2 Associate professor Department of CSE DR.M.V.SIVA PRASAD 3 Principal Abstract:

More information

Adaptive Windows for Duplicate Detection

Adaptive Windows for Duplicate Detection Adaptive Windows for Duplicate Detection Uwe Draisbach #1, Felix Naumann #2, Sascha Szott 3, Oliver Wonneberg +4 # Hasso-Plattner-Institute, Potsdam, Germany 1 uwe.draisbach@hpi.uni-potsdam.de 2 felix.naumann@hpi.uni-potsdam.de

More information

Unity: Speeding the Creation of Community Vocabularies for Information Integration and Reuse

Unity: Speeding the Creation of Community Vocabularies for Information Integration and Reuse Unity: Speeding the Creation of Community Vocabularies for Information Integration and Reuse Ken Smith, Peter Mork, Len Seligman, Peter Leveille, Beth Yost, Maya Li, Chris Wolf The MITRE Corporation {kps,

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web

More information

Distributed Database Management Systems M. Tamer Özsu and Patrick Valduriez

Distributed Database Management Systems M. Tamer Özsu and Patrick Valduriez Distributed Database Management Systems 1998 M. Tamer Özsu and Patrick Valduriez Outline Introduction - Ch 1 Background - Ch 2, 3 Distributed DBMS Architecture - Ch 4 Distributed Database Design - Ch 5

More information

IFRAT: An IoT Field Recognition Algorithm based on Time-series Data

IFRAT: An IoT Field Recognition Algorithm based on Time-series Data IFRAT: An IoT Field Algorithm based on Time-series Data Shuai Guo, Zhongwen Guo, Zhijin Qiu, Yingjian Liu and Yu Wang Ocean University of China, Qingdao, Shandong, China University of North Carolina at

More information