Relational Clustering for Multi-type Entity Resolution

Size: px

Start display at page:

Download "Relational Clustering for Multi-type Entity Resolution"

Melvin Mason
5 years ago
Views:

1 Relational Clustering for Multi-type Entity Resolution Indrajit Bhattacharya and Lise Getoor Department of Computer Science, University of Maryland Presented by Martin Leginus 13th of March, 2013

2 Agenda Motivation Related work Use case scenarios Problem formulation Relational clustering Similarity measures Results Discussion

3 Why there is a need for entity resolution? The correspondence problem - 2 pictures refer to the same entity. Natural language processing - recognizing which noun phrases refer to the same entity. Data preprocessing - detection of duplicates.

4 Why there is a need for relational entity resolution? Traditional approaches utilize textual similarity measures. Collective Entity Resolution in Relational Data 3 Jim Doe Jason Doe J Doe James Doe James Doe Jonathan Doe Jonathan Doe Jason Doe Jackie Doe Jon Doe Jeanette Doe Jason Doe Jeanette Doe Jackie Doe Jean Doe (a) Relational evidences might improve the accuracy of the resolution. Fig. 1. Example of (a) a reference graph for simple example given in the text and (b) the resolved tit h (b)

5 Related work Textual similarity calculated for the descriptions of two entities. Supervised alg. that learn string similarity measures from labelled data. Performance is improved with blocking approach. Relational features considered for data integration problems.

6 Use case example Two citation examples of the same paper: Fast algorithms for mining association rules in large databases. Agrawal, Rakesh and Srikant, Ramakrishnan. In Proc. 20th Int. Conf. Very Large Data Bases, VLDB, 1994 Fast algorithms for mining association rules. Agrawal, R., Srikant, R. in VLDB-94,1994 String edit distance does not work. Multiple entity resolution problem i.e., author, paper and venue entities.

7 $\(-8QL4L7M;LR2Q\50F0L9", D?F"F0"5+96+0"26ST$;CRL3KM$/ 0"5;F %S0_' (-+EKMQ+"+U#L$]+M0C,; ž 1X(*mf $'.BEEe >D?0 74^'.*$]4""()Ce Joint resolution using entity relations R Agrawal R Srikant Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Fast Algorithms for Mining Association Rules in Large Databases r1 e1 r5 GÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇ GÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇ ÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉ GÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉ GÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉ r2 e2 r6 GÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉ ËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGË r3 e3 r7 ËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGË ËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGË VLDB 94 C1 Proc of the 20th Intl. Conference... C2 GÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍ r4 e4 r8 GÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍ GÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍ GÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍ h1 h2 ko"d kmkmd Local and global resolution. ç-üs Ü^Ó:Ò!ßÏUßÝ[Ó ásãvåmó+ánñió&âßòï.áiðmâ"èvö HØéFæIÓÞ"à7Ò#Ò#Ó ê Ï[Ð7ÑIÒ#ÓcÔ7Õ&Ö% MØÙAÚn 7ÛÜÝ[Ó2Þ?Ï[ß ßÏ[àMásâ!ÜI ÒâÓ+ãvÏ[áNßàZà7äâÓ:Ò#åMÓ+ã= 7ÑIßænà7Ò ç^üs ÜHÓ:Ò 7áIãcåMÓ+ánÑIÓTÒ#Ó+ë`Ó:Ò#ÓáIÞ"Ó+âQì4Ï[ßærßæIÓ\Ò#Óâà7Ý[åMÓ+ã@Ó+áNßÏ[ßÏ[Ó+â"è Ü^àMáIãsÏ.áIÐY 7ÑnßæIà7Ò Positive and negative relational evidence. K F0" :1Qe F;: D" f 76e >D 0 9D f ' ž 1X(-mf+ 8%Q9G0"+04a$B9WX>'I8-41" F(**#*4"Ws" i 6$:;IJ*C6L5naKM?)e,; $'[$aq_68op$$:ee a#j% ž 1X(*mf ; 1X(-mfs76e >D?0 ž 1X(*mfs#LQ4L5+n0

8 Problem formulation Entities and references are denoted by e and r. Assigned variables of e and r are denoted by e.a and r.a. References are typed and r.t is observed. Each reference r corresponds to a hidden entity so that each r has assigned entity label r.e. The problem is to discover the hidden set of entities E = {e i } and entity labels r.e for each reference. References are observed as members of hyper-edges. The membership of a reference is stored in hyper-edge label r.h = h (if reference r h).

9 GÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍ GÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍ GÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍ Problem formulation R Agrawal R Srikant Fast Algorithms for Mining Association Rules Rakesh Agrawal Ramakrishnan Srikant Fast Algorithms for Mining Association Rules in Large Databases GÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇ r1 e1 r5 GÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇ GÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇGÇ ÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉ GÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉ GÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉ r2 e2 r6 GÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉGÉ ËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGË r3 e3 r7 ËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGË ËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGËGË VLDB 94 C1 Proc of the 20th Intl. Conference... C2 GÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍGÍ r4 e4 r8 h1 h2 ko"d kmkmd The set of hidden entities is E = {e 1, e 2, e 3, e 4 } where ç-üs Ü^Ó:Ò!ßÏUßÝ[Ó ásãvåmó+ánñió&âßòï.áiðmâ"èvö HØéFæIÓÞ"à7Ò#Ò#Ó ê Î)Ï[Ð7ÑIÒ#ÓcÔ7Õ&Ö% MØÙAÚn 7ÛÜÝ[Ó2Þ?Ï[ß ßÏ[àMásâ!ÜI ÒâÓ+ãvÏ[áNßàZà7äâÓ:Ò#åMÓ+ã= 7ÑIßænà7Ò ç^üs ÜHÓ:Ò 7áIãcåMÓ+ánÑIÓTÒ#Ó+ë`Ó:Ò#ÓáIÞ"Ó+âQì4Ï[ßærßæIÓ\Ò#Óâà7Ý[åMÓ+ã@Ó+áNßÏ[ßÏ[Ó+â"è âü^àmáiãsï.áiðy 7ÑnßæIà7Ò r 1.E = r 5.E = e 1, $\(-8QL4L7M;LR2Q\50F0L9", r 2.E = r 6.E = e 2, D?F"F0"5+96+0"26ST$;CRL3KM$/ F0" :1Qe F;: D" f 76e >D 0 9D f ' 1K i 6$:;IJ*C6L5naKM?)e,; ()F(**#*4"Ws" 1X(-mfs76e >D?0 ž 1X(*mfs#LQ4L5+n0 ž r 3.E = r 7.E = e 3, $b ±". º±ƒW ()8c!0"?6r5+$:8=e8 9D f ; r 4.E = r 8.E = e 4 0"5;F %S0_' (-+EKMQ+"+U#L$]+M0C,; ž 1X(*mf $'.BEEe >D?0 74^'.*$]4""()Ce 1X(-mf+ 8%Q9G0"+04a$B9WX>'I8-41" ž a#j% ž 1X(*mf ; $'[$aq_68op$$:ee 3.4 Positive And Negative Relational Evidence

10 Resolution by clustering The goal is to group all the references corresponding to the same entity into one cluster. The membership of a reference to a cluster is represented with r.c. All references from the cluster are of the same type. 1 At the beginning, each reference belongs to the separate cluster. 2 At each step, the cluster pair, with the highest similarity to be the same entity, is merged. The general similarity is defined as: sim(c i, c j ) = (1 α) sim attr (c i, c j ) + α sim rel (c i, c j ) where 0 α 1

11 Attribute a relational similarity Attribute similarity Any basic similarity measure for two reference attributes. The similarity for two clusters is calculated between two most representative attributes of those clusters. Relational similarity The measure between two clusters considering the clusters that they link to via observed edges. Edge detail similarity Neighborhood similarity

12 Edge detail similarity Each cluster is associated with the set of hyper-edges: c.h = {h r.h = h r.c = c} The similarity between two edges is defined as: sim(h i, h j ) = t (sim t (h i, h j )) where: sim t (h i, h j ) = Jaccard(π t (h i ), π t (h j ))) and π t (h) = {c r.c = c c.t = t r.h = h} The final similarity is defined as: sim rel (c i, c j ) = max(h i, h j ){sim(h i, h j )} where h i c i.h, h j c j.h

13 Neighborhood similarity The similarity between two clusters is defined as: sim rel (c i, c j ) = Jaccard(N t (c i ), N t (c j ))) where N t (c) = m π t (h), h c.h The obtained neighborhoods are multisets.

14 Implementation Greedy agglomerative clustering that merges closest cluster pair at each step. All candidate pairs are sorted by their similarities in a priority queue - blocking approach. During the initial phase, references with the identical attributes v 1 = v 2 or with a reference which is initialed form of the other are merged.

15 Datasets and baseline methods CiteSeer dataset contains 2892 references with 1165 authors, contained in 1504 documents. arxiv dataset contains references with 9200 authors, contained in papers. Baseline method ATTR based on SoftTF-IDF where the secondary distance measures can be Jaro-Winkler, Jaro or Scaled Levenstein distance.

16 Accuracy results with different similarity measures

17 Precision, recall and F1 results for both datasets

18 4Š () EhH s; Performance '[ > 0/?$X 1 4 :5 QKMC ' $91 :KM : ,9 F 24 ñ Š# 4'., B +$C 76\ ñ 4'. $91 :KM EM$> 8% 7()9X&9Cˆ L + D" % 1 / mcê 76 CPU time (secs) Execution time ATTR* RC (Nbr) with Bootstrap RC (Nbr) w/o Bootstrap RC (Edge) with Bootstrap RC (Edge) w/o Bootstrap 200 1C$/ Number of References (in Thousands) " Î)Ï[ÐMÑIÒ#ÓíìIÕrÙaÚNÓ+Þ?ÑIßÏ[à7ágßÏmÛÓ=ë`à7Ò0 CéFé/lœ Z 7ásãîl>m,êÙHl >" (-/

19 Attribute vs relational similarity effects on accuracy Varying alpha: Jaro for CiteSeer Varying alpha: Jaro Winkler for CiteSeer Varying alpha: Scaled Levenstein for CiteSeer best F RC-ER (Nbr) RC-ER (Edge) ATTR ATTR* best F RC-ER (Nbr) RC-ER (Edge) ATTR ATTR* best F RC-ER (Nbr) RC-ER (Edge) ATTR ATTR* ko"d alpha kmkmd alpha kmd alpha Varying alpha: Jaro for HEP Varying alpha: Jaro Winkler for HEP Varying alpha: Scaled Levenstein for HEP best F RC-ER (Nbr) RC-ER (Edge) ATTR ATTR* best F RC-ER (Nbr) RC-ER (Edge) ATTR ATTR* best F RC-ER (Nbr) RC-ER (Edge) ATTR ATTR* ko67d alpha km#d alpha k.'xd alpha Î)Ï[Ð7ÑIÒ#Ó`ÏNÕZéFæIÓcä^Ó+âß2ÎEÔcÛÓ: 7âÑnÒÓ+âS 7ÞæsÏ[Ó:åMÓ+ã ä4 ïl>m,êùhl ì4ï[ßætáió+ï[ðmæpä^à7òæiàpànãt 7ásã Ó+ãIÐ7ÓrãnÓ+ß$ 7Ï.ÝFâÏmÛ2Ï.ÝU ÒÏ[ßÏ[Ó+âVà+åMÓ+Ò å+ Ò nï.áið2þ"àpû\äï.ái ßÏ[àMáZìLÓ+Ï[ÐMæPß<"@ë`à7Ò\Ö êþ:ø/mlï[ßóz NÓ:Ó:Ò_ ásãyöwãnê0ëø/ó3ùjÿgñiâï.áiðâï.áiðmý[ó!ý.ï.ágªvë`à7ò_ ß$ßÒÏ[äÑIßÓSâÏ.Û2Ï.Ý[ ÒÏUßM Zì4Ï[ßæ N Òànç n Ò#àê Ï.ágªNÝ[Ó:ÒT ásã` nþ 7Ý[Ó+ã= HÓ+åMÓ+ásâß$ÓÏ[á@Ò#Ó+âÜHÓ+Þ"ßÏ[åMÓ+Ý' 7è 05+BQKMA506_C0"?)$'.$*:, 8BK Q5W/ 91Q$!;Hh^]+M$4 Indrajit *3, B8# :$a63 Bhattacharya, Lise Getoor 6$9T501Q#7KEW01L09498T405$;: % Relational ób JẼMˆ " "Œ:; Clustering for Multi-type Entity Resolution

20 Conclusions Introduced two relational similarity measures Relational similarity in combination with attributes similarity outperform other non-relational approaches. Successful usage of bootstrapping and blocking approach for improved performance.

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

Link Mining & Entity Resolution Lise Getoor University of Maryland, College Park Learning in Structured Domains Traditional machine learning and data mining approaches assume: A random sample of homogeneous