(Big Data Integration) : :

Size: px

Start display at page:

Download "(Big Data Integration) : :"

Rosamund Potter
5 years ago
Views:

1 (Big Data Integration) : :

2 3 # $%&'! ()* +$,- 2/30

3 ()* + # $%&' = 3 : $ 2 : 17 ;' $ # < 2 6 ' $%&',# +'= > 0 - A 1 3/30

3?. - B 6 @* @(C : E6 - > ()* (C :(C E6 1' +'= - ''3-6 F :* 2G '> H-! +'-?

4 3?. - : E6 - > ()* (C :(C E6 1' +'= - ''3-6 F :* 2G '> H-! +'-? - 0 A :B 6 I2 #- 2 - J H-!6 -! H-!6 :?. L- K + A'3 >- 3 - MN 5* O5?6 I7... 4/30

5 # $%&'! S0 Q' 2 R2 # $%&' 0 A 3! # 5/30

6 # $%&'! 3! '>!. ' # $%&'? 2 7 IT&' '> ' # 6/30

7 # $%&'! 3! # > U6 1 7/30

8 - # $%&'! S1 S2 S3 (name, games, runs) (name, team, score) a: (id, name); b: (id, team, runs) 3 H!. S4 S5 (name, club, matches) (name, team, matches) A 8/30

9 - # $%&'! 3 H!. A S1 S2 S3 S4 S5 USP (name, games, runs) (name, team, score) a: (id, name); b: (id, team, runs) (name, club, matches) (name, team, matches) MS (n, t, g, s) 9/30

10 - # $%&'! 3 H!. A S1 S2 S3 S4 S5 USP MS (n, t, g, s) MSAM (name, games, runs) (name, team, score) a: (id, name); b: (id, team, runs) (name, club, matches) (name, team, matches) MS.n: S1.name, S2.name, MS.t: S2.team, S4.club, MS.g: S1.games, S4.matches, MS.s: S1.runs, S2.score, 10/30

11 - # $%&'! 3 H!. A S1 S2 S3 S4 S5 MS (n, t, g, s) MSSM (name, games, runs) (name, team, score) a: (id, name); b: (id, team, runs) (name, club, matches) (name, team, matches) n, t, g, s (MS(n, t, g, s) S1(n, g, s) S2(n, t, s) i (S3a(i, n) & S3b(i, t, s)) S4(n, t, g) S5(n, t, g)) 11/30

12 * - B 6 +$ : 3!# [8] H!. 2 0 V5W V5W 2 0 I< # X M + *C = [6] A -? ;2 Y Q' ;2 ;& A # *C M 1'#6 Z [9,10]$% Y76 - '> Q' E +' 3 2 $%&' M Q' I * 2 17 V.6 L$$ - T #! E ;E6 /30

13 (C +$ : 3 ' ( [-14]-> 5 5 Q' 2 Y76 M-> 2'#2 Y76 M-> I 3 M-> # + [W [11]!W L- L- X 2 5 HTML X -,#3 +3 ' () [15]*> &7 ' 0\ # )* 17 :H*/ [W =' + - -< 1> '> [W 3 #2 # '> ;' 2 Y76 H*/ [W & 13/30

14 3- # $%&'! 2 ]/ /30

15 3- # $%&'! 2 ]/ /30

16 3- # $%&'! 2 ]/ /30

17 3- # $%&'! 2 ]/ /30

18 B 6 - (C +$ : 3 3 [21] & *+,-+ % +/A L$$ #! [24,26] *+,-+.+/ : $ O0 2 ]/2,- #! O0 ]/2 -_\ T' # :5> [29]' () ' ( *01 ;=' 0\ - =' 0 ` 2 :' :('A H!. X 2 # T$2 # T$2 K 2 O? K3 ' 2 K3 I 3 18/30

19 B 6 - * +$ : 3 3 [27,28]&* 2 *01 # *C'# M* > T' a% [30]0% 2 *01 b: U c'6 ' - =6 I# M '> O&6 I M [31] *3% + 2 *01 ' - =6 '> H!. 2 '&' '-? d 34 19/30

20 - # $%&'! 3 % ewg6 : 17! S1 S2 S3 S4 S5 Jagadish UM ATT UM UM UI Dewitt MSR MSR UW UW UW Bernstein MSR MSR MSR MSR MSR Carey UCI ATT BEA BEA BEA Franklin UCB UCB UMD UMD UMD 20/30

21 - # $%&'! 3 % ewg6 : 17! USP S1 S2 S3 S4 S5 Jagadish UM ATT UM UM UI Dewitt MSR MSR UW UW UW Bernstein MSR MSR MSR MSR MSR Carey UCI ATT BEA BEA BEA Franklin UCB UCB UMD UMD UMD 21/30

22 - # $%&'! 3 % ewg6 : 17! USP S1 S2 S3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD 22/30

23 - # $%&'! 3 % ewg6 : 17! USP S1 S2 S3 Jagadish UM ATT UM Dewitt MSR MSR UW Bernstein MSR MSR MSR Carey UCI ATT BEA Franklin UCB UCB UMD 23/30

24 (C +$ : [37]5-'6!78 3 [38]5-06!78 24/30

25 * - B 6 +$ : [37]' $!78 3 [39]0%!78 ''3 I< I2 7E # #2 - f?. ' X6 I 3 25/30

26 ?. +$ : 3 [32];+% 5+ 2 *?% < 9 :;% < =>% + U ' ;2 f?. VG ( J I= 7?) 17 2 * I= 2'# ;2 % ewg6 ()ME Q' b: ] O *C Q' # i6 *C'# NE [33]<-@ + % A H?! #! ( ' BC = 2 76 C' I< E2 BC =' 2 76 [34] IR+ % A 7 Q' #! f?. - C' ;2 7 I=... 26/30

()* # $%&' +$ 1,- 3! 58 58 [10,9] [8,6] )5< [30]!? F5?

)$%& [14-] )$%& [15] $5623)4 [21]? @A? [26,24]? @A? B?

27 ()* # $%&' +$ 1,- 3! [10,9] [8,6] )5< [30]!? F5?! [31]5 2 H [29] $56 23) [8,6] D! [28,27] ' ( )$%& [11] ( -. )$%& [14-] )$%& [15] $5623)4 B?C N% ) $%& $#8 #$ %&'( )* +,&- 5J KL? [32]M? F? &! -(? 5 # [37]$56 $) [39] # [37]FA6I # [38]FAI #.)/) %&01 [33] IR? 5 [34] 27/30

28 3 1. Doan, A., A. Halevy, and Z. Ives, Principles of data integration. Elsevier, Dong, X.L. and D. Srivastava, Big data integration. Synthesis Lectures on Data Management, Chen, J., et al., Big data challenge: a data management perspective. Frontiers of Computer Science, Bernstein, P.A., J. Madhavan, and E. Rahm, Generic schema matching, ten years later. Proceedings of the VLDB Endowment, Fagin, R., et al., Clio: Schema mapping creation and data exchange, in Conceptual Modeling: Foundations and Applications, Springer Berlin Heidelberg, Dong, X.L., A. Halevy, and C. Yu, Data integration with uncertainty. The VLDB Journal The International Journal on Very Large Data Bases, Franklin, M., A. Halevy, and D. Maier, From databases to dataspaces: a new abstraction for information management. ACM Sigmod Record, Das Sarma, A., X. Dong, and A. Halevy. Bootstrapping pay-as-you-go data integration systems. in Proceedings of the 2008 ACM SIGMOD international conference on Management of data, Wang, Z., et al., A unified approach to matching semantic data on the web. Knowledge-Based Systems, Kang, L., L. Yi, and L. Dong. Research on Construction Methods of Big Data Semantic Model. in Proceedings of the World Congress on Engineering, Chuang, S.-L. and K.C.-C. Chang. Integrating web query results: holistic schema matching. in Proceedings of the 17th ACM conference on Information and knowledge management, Cafarella, M.J., et al., Webtables: exploring the power of tables on the web. Proceedings of the VLDB Endowment, Das Sarma, A., et al. Finding related tables. in Proceedings of the 20 ACM SIGMOD International Conference on Management of Data, Limaye, G., S. Sarawagi, and S. Chakrabarti, Annotating and searching web tables using entities, types and relationships. Proceedings of the VLDB Endowment, /30

29 3 15. Torre-Bastida, A.I., et al., Semantic Information Fusion of Linked Open Data and Social Big Data for the Creation of an Extended Corporate CRM Database, in Intelligent Distributed Computing VIII, Kum, H.-C., et al., Privacy preserving interactive record linkage (PPIRL). Journal of the American Medical Informatics Association, Fan, W., et al., Reasoning about record matching rules. Proceedings of the VLDB Endowment, Fellegi, I.P. and A.B. Sunter, A theory for record linkage. Journal of the American Statistical Association, Elmagarmid, A.K., P.G. Ipeirotis, and V.S. Verykios, Duplicate record detection: A survey. Knowledge and Data Engineering, IEEE Transactions on, Hernández, M.A. and S.J. Stolfo, Real-world data is dirty: Data cleansing and the merge/purge problem. Data mining and knowledge discovery, Efthymiou, V., K. Stefanidis, and V. Christophides, Big data entity resolution: From highly to somehow similar entity descriptions in the Web. IEEE Big Data, Köpcke, H., A. Thor, and E. Rahm, Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, Kolb, L., A. Thor, and E. Rahm. Load balancing for mapreduce-based entity resolution. in Data Engineering (ICDE), 20 IEEE 28th International Conference on, Papadakis, G., et al., Meta-blocking: Taking entity resolutionto the next level. Knowledge and Data Engineering, IEEE Transactions on, Li, F., et al., Distributed data management using MapReduce. ACM Computing Surveys (CSUR), Efthymiou, V., et al., Parallel Meta-blocking: Realizing Scalable Entity Resolution over Large, Heterogeneous Data. IEEE Big Data, Whang, S.E. and H. Garcia-Molina, Incremental entity resolution on rules and data. The VLDB Journal The International Journal on Very Large Data Bases, /30

3 28. Gruenheid, A., X.L. Dong, and D. Srivastava, Incremental record linkage. Proceedings of the VLDB Endowment, 2014. 29. Kannan, A., et al.

30 3 28. Gruenheid, A., X.L. Dong, and D. Srivastava, Incremental record linkage. Proceedings of the VLDB Endowment, Kannan, A., et al. Matching unstructured product offers to structured product specifications. in Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, Li, P., et al., Linking temporal records. Proceedings of the VLDB Endowment, Guo, S., et al., Record linkage with uniqueness constraints and erroneous values. Proceedings of the VLDB Endowment, Dong, X.L., L. Berti-Equille, and D. Srivastava, Integrating conflicting data: the role of source dependence. Proceedings of the VLDB Endowment, Yin, X. and W. Tan. Semi-supervised truth discovery. in Proceedings of the 20th international conference on World wide web, Galland, A., et al. Corroborating information from disagreeing views. in Proceedings of the third ACM international conference on Web search and data mining, Zhao, B. and J. Han, A probabilistic model for estimating real-valued truth from conflicting sources. Proc. of QDB, Pochampally, R., et al. Fusing data with correlations. in Proceedings of the 2014 ACM SIGMOD international conference on Management of data, Dong, X.L., et al., From data fusion to knowledge fusion. Proceedings of the VLDB Endowment, Liu, X., et al., Online data fusion. Proceedings of the VLDB Endowment, Dong, X.L., L. Berti-Equille, and D. Srivastava, Truth discovery and copying detection in a dynamic world. Proceedings of the VLDB Endowment, /30

31 >6 # &G6 2

Entity Resolution with Heavy Indexing

Entity Resolution with Heavy Indexing Csaba István Sidló Data Mining and Web Search Group, Informatics Laboratory Institute for Computer Science and Control, Hungarian Academy of Sciences sidlo@ilab.sztaki.hu