Entity Resolution over Graphs

Size: px

Start display at page:

Download "Entity Resolution over Graphs"

Anissa Hopkins
5 years ago
Views:

1 Entity Resolution over Graphs Bingxin Li Supervisor: Dr. Qing Wang Australian National University Semester 1, 2014

2 Acknowledgements I would take this opportunity to thank my supervisor, Dr. Qing Wang, who has given me valuable support, encouragement and advice without which this work would not have been completed. I am thankful to my good friends, Sid Xiao and Nicolas Cui, for their support. I would also like to thank my parents and Dongxia Bai for their valuable support and constant encouragement.

3 Abstract Entity resolution (ER) is a long-standing challenge in many areas, such as data management and information retrieval. It is to identify which records in one or more databases refer to the same real world entity. Traditional entity resolution techniques are based on calculating the string similarity of attributes, such as name and address for people. However, some records that look similar are not necessarily referring to the same entity, traditional entity resolution techniques have some limitations in dealing with complex data sets. To take into considerate the relational similarity between records for entity resolution, collective entity resolution techniques have been proposed. This project investigates the existing entity resolution techniques, and proposes a new entity resolution technique over graphs connectivity entity resolution, which can enrich collective entity resolution techniques. According to the experiment results, connectivity entity resolution can achieve better results than collective entity resolution in the Cora data set.

4 Contents 1 Introduction Motivating example Objectives Contribution and organisation Background of Entity Resolution General Entity Resolution Process Entity Resolution Process for this Project Methodology Preliminaries Attribute-based Entity Resolution Levenshtein Similarity Cosine Similarity Jaccard Similarity Collective Entity Resolution Connectivity Entity Resolution Experiments Data Set Measures Experiments on Attribute-based ER Shortest Distance in Record Graphs Experiment on Collective Entity Resolution Experiments on Connectivity Entity Resolution Results Comparison Conclusion 22 Appendices 23 A Relational Jaccard Similarity Evaluation 24 B Density-based Similarity Evaluation 26 C Codes 29 1

5 Chapter 1 Introduction Entity resolution (ER) is to identify which records in one or more databases refer to the same real world entity. It is a well-known problem that widely exists in many areas, such as artificial intelligence, information retrieval and database management [3]. Entity resolution also has other different names including record linkage [9], entity matching [5], reduplication or duplicate detection [7], etc. To solve the entity resolution problem, the traditional entity resolution techniques are to calculate the string similarity and use the result of string similarity of records to decide whether two records should be matched. Traditional entity resolution techniques mainly focus on string similarity using attribute values (e.g. name, address, etc. of people). For example, to identify whether the David Aha and D. Aha are the same person, we can calculate the string similarity between David Aha and D. Aha. The traditional techniques are often useful for solving the entity resolution problem. However, they do not always yield a high accuracy. One of the reason is that some attributes may be similar but do not refer to the same real-world entity. Some recent studies of entity resolution take relational similarity into account [Kalashnikov et al. 2005] so-called collective entity resolution. Collective entity resolution techniques take into account not only string similarity but also relational similarity between records, yielding more accurate entity resolution results at a higher computational cost. For example, in such technique records Jane Aha and Jane Lindsay are solved using both string similarity and the relational similarity. In this case, if Jane Aha and Jane Lindsay have same parents, they are the same person even the string similarity between Jane Aha and Jane Lindsay is not high. The collective entity resolution often uses the neighbours of nodes as the relation in record graph to solve the entity resolution. Although it is effective to address the entity resolution problem, in many cases, the computational efficiency is often a concern. 2

6 1.1 Motivating example Let us consider the following example the problem of resolving the researchers in a database. Assume that there are four records and each record has three attributes (name, affiliation, ): r1: The name is M. J., the organisation is ANU and the is mj@gmail.com ; r2: The name is Michael Jordan, the organisation is ANU and the is mj@anu.edu.au ; r3: The name is M. Jordan, the organisation is UC and the is mj@uc.edu.au ; r4: The name is M. J., the organisation is UNSW and the is mj@unsw.edu.au. Figure 1.1: An example database with four records The example is shown in Figure 1.1. Now imagine that we would like to find out, given these four records, which of these researchers refer to the same author entities. (1) When using traditional entity resolution techniques, we can compare how similar their names and affiliations. Because r1, r2 and r3 have same affiliation (ANU) as well as M. J., M. Jordan and Michael Jordan could be three forms of one name, we may determine that r1,r2 and r3 refer to the same entity. This gives us a set of clusters (r1, r2, r3), (r4), shown in Figure 1.2. (2) When using collective entity resolution techniques, we would consider how there records are related, and how similar their relationships are suppose that. We may conclude that r2 and r3 refer to the same entity, but r1 refers to a different one. This gives us a set of clusters: (r1), (r2, r3), (r4), shown in Figure

7 Figure 1.2: Clustering using traditional entity resolution Figure 1.3: Clustering using collective entity resolution (3) However, r4 may refer to the same entity as r2 and r3. Suppose that we have a record graph, shown in Figure 1.4, to present the connectivity relationships of records. According to the relationships, we may get corresponding result is a set of clusters: (r2, r3, r4), (r1) shown in Figure 1.5. This project will focus on exploring graph connectivity in entity resolution. 1.2 Objectives This project is to investigate how collective entity resolution techniques can be enriched by graph properties like connectivity of records. The specific tasks are: (1) To conduct a literature review on entity resolution techniques (2) To incorporate graph properties like connectivity of records into a framework of collective entity resolution (3) To evaluate the above approach using one or two real data sets. 4

8 Figure 1.4: Record graph Figure 1.5: Clustering using connectivity entity resolution 1.3 Contribution and organisation In this project, I propose a connectivity entity resolution approach based on a novel supervised relational clustering algorithm. The specific contributions in this report are as follows: I proposed a connectivity entity resolution approach and evaluate the effectiveness of this approach. I compared the effectiveness of collective entity resolution and connectivity entity resolution. I conducted experimental results on a real-world dataset. The rest of this report is organised as follows. In Chapter 2, I present some related work for entity resolution. In Chapter 3, three methodologies for entity resolution will be discussed. In Chapter 4, I present my results and discuss different similarity thresholds on a real-world dataset. 5

9 In Chapter 5, I conclude the report and discuss future work. 6

10 Chapter 2 Background of Entity Resolution 2.1 General Entity Resolution Process A general entity resolution process [6] is shown in Figure 2.1. Given data from databases, the first step is data pre-processing, which assures data are in the same format. Then, indexing is the second step. The purpose of indexing is to reduce the quadratic complexity of the entity resolution process. After that the third step is record pair comparison. Candidate record pairs are generated from the indexing data structures. The classification step is the next one and candidate record pairs are classified into matches, non-matches and potential matches. If record pairs are classified into potential matches, a clerical review is needed. Besides, the quality and completeness of matched data would be evaluated in the evaluation step. Figure 2.1: General entity resolution process [6] 7

11 2.2 Entity Resolution Process for this Project Figure 2.2: Entity Resolution Process In this project, I focus on the clustering and matching records for entity resolution and the process is shown in Figure Retrieve records from one or more databases. 2. Construct an initial graph based on the records. 3. Decide matches based on attribute-based entity resolution. 4. Generate clusters or transform clusters Generate clusters using the threshold of attribute-base entity resolution Records transform clusters using collective entity resolution or connectivity entity resolution in different situation 8

12 Chapter 3 Methodology 3.1 Preliminaries Record graph is the foundation of all work in this project. Records, which are from databases, are considered as nodes to generate corresponding record graph. And the edges of this graph is the specific relationships of records in databases. A cluster means a group of records having a related specific relations. And records of each cluster are totally different with records of other cluster in the relevant relations. In this project, a cluster is defined as the group of nodes referring to the same entity in the real world. It means, in a record graph, a node refers to the same entity with nodes in the same cluster and refers to a completely different entity with nodes in the different clusters. Shortest path means a path between two nodes in a graph such that the sum of the weights is minimised and shortest distance is the weights of shortest path between two nodes in a graph. In this project, the shortest distance is used as a parameter in connectivity entity resolution. 3.2 Attribute-based Entity Resolution Attribute-based entity resolution is a traditional techniques and string similarity is a common measure to calculate the similarity between two records using two string similarity. It has many different algorithms to implement. This paper will discuss three common ways as follows Levenshtein Similarity Levenshtein Similarity uses Levenshtein distance to calculate the similarity between two strings, the source string s 1 and the target string s 2. The distance is the number of deletions, insertions, or substitutions required to transform s 1 into s 2. The greater the Levenshtein distance, the more different the strings are. The Levenshtein distance is one methods of edit distance. The edit distance 9

13 is calculated using a dynamic programming algorithm that creates a matrix that maps out the cost(number of edits) of getting from every character in string s 1 to every character in string s 2. A cell within such a matrix is denoted by d[i,j] when in row i (0 i s 1 ) and column j (0 j s 2 ). And the first row and column are filled in such that d[i, 0] = i for 0 i s 1 and d[0, j] = j for 0 j s 2. The remaining cells should be followed the following formula: If s 1 [i] = s 2 [j], d[i, j] = d[i 1, j 1] If s 1 [i] s 2 [j], d[i 1, j] + 1 d[i, j] = d[i, j 1] + 1 d[i 1, j 1] + 1 a deletion an insertion a substitution The Levenshtein similarity is calculated by the the Levenshtein distance and the formula is: Sim Levenshtein (s 1, s 2 ) = 1 Distance Levenshtein(s 1, s 2 ) max( s 1, s 2 ) The higher the edit distance, the lower the similarity between two string Cosine Similarity Cosine similarity is another measure of similarity. It is often used in information retrieval to calculate the similarity of two documents, but it is still useful to calculate the string similarity. It is a measure of similarity between two vectors of an inner product space that measure the cosine of the angle between them. It uses the Euclidean dot product formula: a b = a b cos θ Given two vectors of attributes A and B, the cosine similarity is: Sim Cosine = cos(θ) = A B A B = n A i B i i=1 n n (A i ) 2 (B i ) 2 The result of Sim Cosine ranges from -1 meaning exactly opposite to 1 meaning exactly the same, 0 means independence and the value indicates the similarity Jaccard Similarity Jaccard Similairty(also called Jaccard similarity coefficient) is a measure for comparing the similarity and diversity of sample sets. The Jaccard coefficient i=1 i=1 10

14 measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets: A B J(A, B) = A B. (If A and B are both empty, J(A, B) = 0). Clearly, 0 J(A, B) 1. For string similarity, I use 2-grams way to transform two strings to two sets A and B. For example, the source string is similarity and the target string is similar, they could be transformed to seta: si, im, mi, il, la, ar, ri, it, ty and setb: si, im, mi, il, la, ar. Then A B = 6, A B = 9, so Sim Jaccard = A B = 6/9 = 0.67 A B In this project, we calculate the string similarity using Jaccard similarity. In fact, for the string similarity, their results do not have much difference. But the Jaccard similarity is more efficient. To some extent, attribute-based similarity is enough to solve entity resolution. However, when two different records have similar string of attributes, we cannot solve the entity resolution just using the attribute-based similarity. In this situation, some approaches of entity resolution over graphs, such as collective entity resolution, could be useful. 3.3 Collective Entity Resolution Collective entity resolution is one of approaches of entity resolution over graphs. The main idea of collective entity resolution is relational similarity. For example, if two records have same name attributes D. Aha, the neighbours of two D. Aha in a graph can provide some evidence. If two D. Aha have different parents, it is a high probability that they do not refer to the same entity. The goal of collective entity resolution is that resolutions are not made independently, but instead one resolution decision affects other resolutions vis edges or relations. In addition, the entity resolution could be motivated as a clustering problem in this situation. Entity resolution can be considered as a clustering problem when any similarity measure between pairs of records is given. The goal is to cluster the records so that only those that correspond to the same entity are assigned to the same cluster [1]. So far, the attribute-based similarity is discussed in the previous chapter. For the collective entity resolution, I will use the clustering approach combine with the attribute-based similarity. We define the similarity of two clusters c i and c j as: sim(c i, c j ) = (1 α) sim A (c i, c j ) + α sim R (c i, c j ), 0 α 1 11

15 where sim A () is the similarity of the attributes and sim R () is the relational similarity between the records in the two clusters. The c i and c j stand for different clusters. Besides, the intention of α is to ensure a high accuracy for different situations and it depends on different data sets. The α is a significant value I obtained by analysis in the experiment. For relational similarity, the relationship or neighbours of records in a record graph are always regarded as important factors. In this project, I use numbers of common neighbours to calculate the Jaccard similarity which is the main measure for collective entity resolution. Jaccard similarity in here uses clusters neighbourhood as parameters. For two different clusters c i and c j, the Jaccard similarity can be represented by the following formula: Sim Jaccard (c i, c j ) = Nbr(c i) Nbr(c j ) Nbr(c i ) Nbr(c j ) The Nbr(c i ) Nbr(c j ) stands for the numbers of common neighbours between cluster c i and cluster c j. The Nbr(c i ) Nbr(c j ) stands for the numbers of all neighbours of cluster c i and c j. 3.4 Connectivity Entity Resolution In this project, the graph is my core theory. Clusters in graphs can also be defined through connectivity by calculating the number of paths that exist between each pair of nodes [8]. For some nodes to belong to the same cluster, they should be highly connected to each other [4]. Through connectivities between nodes in a graph, we are able to obtain more relationships providing evidence for clustering. The formula of the connectivity entity resolution: sim(c i, c j ) = (1 β) sim A (c i, c j ) + β sim C (c i, c j ), 0 β 1 where sim A () is attribute-based similarity and sim C () is the connectivity similarity between two nodes in a record graph. The c i and c j are different clusters. The parameter β has similar intention of the α in the previous section. I will determine it in the experiments. For the connectivity similarity sim C, I use the density-based similarity sim Density to measure. And the formula for the density-based similarity: 12

16 Sim Density = NumNode(c i, c j ) dis(c i, c j ) Figure 3.1: Density Figure 3.1 reflects the meaning of density-based similarity in this part. There are two graphs presenting two situation in real world. Between node 1 and node 2, they have one common neighbour node 3 in graph 1. However, they have three common neighbours (node 3, node 4 and node 5) from node 1 to node 2 in graph 2. Obviously, relationship between node 1 and node 2 in graph 2 is stronger than they are in graph 1. 13

17 Chapter 4 Experiments In this project, I used the MacBook Air with OS X to execute all the experiments. For the MacBook Air, the processor is 1.3 GHz Intel Core i5 and the memory is 4 GB 1600 MHz DDR3. The codes are written by Java and the IDE is Eclipse. 4.1 Data Set I experiment with the Cora dataset. The Cora dataset has three tables: authors, publications and venues. I only used the authors table in my experiments. This table has 4571 records and five attributes are shown in Table 4.1. The attribute cluster is the ground truth, which provides us true clusters to evaluate entity resolution techquics. That is the reason I choose authors table of the Cora dataset. Attribute aid(primary key) id(foreign key) no name cluster Description author id publication id authorship order author s name cluster id (the ground truth) Table 4.1: Attributes of the authors table 4.2 Measures In this experiment, there are four counts I used in evaluating entity resolution results. True Positive (TP) : It stands for the numbers of matches appearing in both identified cluster and the ground truth cluster. 14

18 False Positive (FP) : It means the numbers of matches are in the identified cluster but do not belong to the ground truth cluster. True Negative (TN) : It means the numbers of matches do not belong to the identified cluster or the ground truth cluster. False Negative (FN) : It represents the numbers of matches are in the ground truth cluster but are not in the identified cluster. After the above counts are calculated, we use the following three measurements to evaluate entity resolution. Precision: It is defined: P recision = T P T P + F P High precision value means the approach or algorithm returned more correct results. Recall: It is defined: Recall = T P T P + F N High recall value means the approach or algorithm returned most pairs of true matches. The F1 score: calculated by the precision and recall and it can be considered as a weighted average of the precision and recall. The formula is: F 1 = 2 Precision Recall Precision + Recall The F1 score ranges from 0 to 1. 1 is the best score and 0 is the worst value. 4.3 Experiments on Attribute-based ER Attribute-base entity resolution [2] compute the similarity sima(r i, r j ) for each pair of r i, r j records based on their attribute values and only those pairs that have similarity above some threshold are considered as matches. I used Jaccard similarity with 2-grams as the string similarity method to implement. The results is shown in Table 4.2. In general, when the threshold changes from 0.3 to 1, precision values become 15

19 Threshold TP FP FN Precision Recall F Table 4.2: string similarity higher and recall values become lower. The highest F1 is when the threshold is at 0.4. Although the trend of precision mainly increase from 0.3 to 1, the precision value reaches the highest point when the threshold is at 0.9. Why is not the highest precision value at the threshold 1. I analysed this problem and I found out many different persons have same name, such as M. Ahlshkog. 4.4 Shortest Distance in Record Graphs All rest of work will be based on the entity resolution over graphs. Firstly I generated a record graph G(N i, E i ). In this authors table, I used the aid as node id, it means each node (N i ) in the graph stands for one record in authors table. Then I used the co-author relationship to generate edges (E i ) between nodes. In the authors table, attribute value of pid provide the relationship. For example, both record aid = 1 and record aid =2 have a same pid = 1, it means record aid =1 and aid =2 are co-authors each other. This graph is named initial graph. Graphs always are complex, so the distance between two nodes could be more than one. The shortest distance between two nodes is my emphasis. And it is valuable to provide the evidence of one of relations between two nodes. I used the Dijkstra s algorithm to calculate the shortest distance (codes are in Appendix C). The results of shortest distance for initial graph is presented in the Table 4.3. According to the Table 4.3, we can clearly see the trend of the numbers of pairs with different distance when the attribute-based similarity changes. Firstly, each tuple has the same total numbers of pairs, which ensure the correctness of this experiment. Then, I need to analyse the numbers of pairs with different distance and attribute-based similarity. The shortest distance equals 0 means two records in a cluster. Therefore, when the attribute-based similarity is 0.9, the numbers of pairs with the shortest distance equals 0 is the less one. But the numbers of 16

20 sim/dis nodis sum Table 4.3: distance Figure 4.1: Shortest Distance based Graph pairs with no distance is the most one. It because the attribute-based similarity equals 0.9 is the strongest restriction in this situation; and most nodes have no relationships with other nodes in the record graph. Based on the initial graph, I use attribute-based entity resolution i.e. string similarity to evaluate. In specific, I merge the clusters when the string similarity sim A 0.9. Because when sim A = 0.9, the Precision is high. It ensures new graph (based on sim A >= 0.9) has a high accuracy. However, Recall is low when sim A = 0.9. That is why we need to evaluate the pairs of nodes by collective entity resolution or connectivity entity resolution. After that, I get a updated graph named base-graph (codes are in Appendix). Then I analyse the shortest distance between all pairs in the new graph (basegraph). The reason is that shortest distance as a parameter for density-based similarity is necessary to analyse. In Figure 4.1, we can see numbers of pairs with distance = 2 occupies the majority part. 17

21 4.5 Experiment on Collective Entity Resolution Based on the base-graph, I evaluated relational Jaccard similarity for all pairs of nodes with string similarity ranging from 0.4 to 0.9. Due to long time consuming for this part and analysis for the results in each period in case some problems, I get results using separate period i.e. string similarity between 0.4 and 0.5, between 0.5 and 0.6, between 0.6 and 0.7, between 0.7 and 0.8, between 0.8 and 0.9(in Appendix ). Then I summary the final results are follows. Figure 4.2: final jaccard According to Figure 4.2, when sim R = 0.01, the F1 is the highest value. For the formula: sim(c i, c j ) = (1 α) sim A (c i, c j ) + α sim R (c i, c j ), 0 α 1. Base on the previous results, I choose α = 0.8, which could get the best F1. And I need to have a try for the parameters x. Thus, using formula 0.8 sim A sim R x to evaluate, results are shown in Figure 4.3. When x=0.36, F1 is the highest one. 18

Figure 4.3: Final Jaccard 4.6 Experiments on Connectivity Entity Resolution Based on the base-graph, I evaluated density-based similarity for all pairs of nodes with string similarity ranging from 0.

22 Figure 4.3: Final Jaccard 4.6 Experiments on Connectivity Entity Resolution Based on the base-graph, I evaluated density-based similarity for all pairs of nodes with string similarity ranging from 0.4 to 0.9. Due to long time consuming for this part and analysis for the results in each period in case some problems, I get results using separate period i.e. string similarity between 0.4 and 0.5, between 0.5 and 0.6, between 0.6 and 0.7, between 0.7 and 0.8, between 0.8 and 0.9(in Appendix ). Then I summary the final results are follows. In Figure 4.4, Figure 4.4: final density when density=0.5, F1 reaches the highest point. For the formula: sim(c i, c j ) = (1 β) sim A (c i, c j ) + β sim C (c i, c j ), 0 β 1. Base on the previous results, I choose β = 0.8, which could get the best F1. And I need to have a try for the parameters x. Thus, using formula 0.8 sim A sim C x to evaluate. 19

23 Figure 4.5: Final Density For the parameter x is difficult to estimate and I tried many times in different period. Results are shown in Figure 4.5. When x = 6, F1 is highest. 20

4.7 Results Comparison Based on the previous two sections, for the collective entity resolution, using 0.8 sim A + 0.2 sim R 0.36 get the highest F1; for the connectivity entity resolution, using 0.

24 4.7 Results Comparison Based on the previous two sections, for the collective entity resolution, using 0.8 sim A sim R 0.36 get the highest F1; for the connectivity entity resolution, using 0.8 sim A sim C 6 get the highest F1. Compare the results in Figure 4.6. Figure 4.6: Comparision In this project, F1 for collective entity resolution techniques is lower than F1 for connectivity entity resolution techniques. Therefore, connectivity entity resolution is effective. 21

25 Chapter 5 Conclusion In this project, I investigated existing entity resolution techniques: attributebased entity resolution techniques and collective entity resolution techniques. Then I proposed the connectivity entity resolution techniques and evaluated the effectiveness over Cora dataset. In the experiment, results of connectivity entity resolution techniques is better than results of collective entity resolution techniques. However, during the experiments, there two main challenges: How to construct a graph for graph-based entity resolution techniques, such as collective entity resolution. In this report, base graph is using sim A 0.9 to generate. How to tune parameters for different feature in relational similarity or density-based similarity. In this report, I estimated the parameters and tried many times. Then I choose the formula 0.8 sim A sim R 0.36 for collective entity resolution and 0.8 sim A sim C 6 for connectivity entity resolution. Although the results are better, it is evaluated just only one data set. In future work, we need evaluate one more data sets to verify the effectiveness and evaluate efficiency of connectivity entity resolution. 22

26 Appendices 23

27 Appendix A Relational Jaccard Similarity Evaluation Figure A.1: sim R with0.8 sim A < 0.9 Figure A.2: sim R with0.7 sim A <

28 Figure A.3: sim R with0.6 sim A < 0.7 Figure A.4: sim R with0.5 sim A < 0.6 Figure A.5: sim R with0.4 sim A <

29 Appendix B Density-based Similarity Evaluation Figure B.1: sim C with0.8 sim A < 0.9 Figure B.2: sim C with0.7 sim A <

30 Figure B.3: sim C with0.6 sim A < 0.7 Figure B.4: sim C with0.5 sim A < 0.6 Figure B.5: sim C with0.4 sim A <

31 0.8 sim A sim C x Precision Recall F Table B.1: Different parameters in connectivity entity resolution results 28

32 Appendix C Codes S h o r t e s t Distance p r i v a t e i n t d i j k s t r a s e a r c h ( Node n1, Node n2 ) { Comparator<Node> o r d e r = new Comparator<Node >(){ p u b l i c i n t compare ( Node node1, Node node2 ) { r eturn node2. gettargetdistance ( ) >= node1. gettargetdistance ( )? 1 : 1 ; } } ; Queue<Node> queue = new PriorityQueue<Node>(nodes. v a l u e s ( ). s i z e ( ), order ) ; List <Node> nodelist = new ArrayList<Node>(nodes. v a l u e s ( ) ) ; f o r ( Node n : nodelist ) { n. s e t T a r g e t D i s tance ( I n t e g e r.max VALUE / 2 ) ; } n1. s e t TargetDistance ( 0 ) ; queue. addall ( nodes. v a l u e s ( ) ) ; while ( queue. isempty ( ) == f a l s e ) { Node u = queue. p o l l ( ) ; f o r ( Edge edge : u. getedges ( ) ) { Node v = edge. getnode1 ( ). e q u a l s ( u )? edge. getnode2 ( ) : edge. getnode1 ( ) ; r e l a x (u, v, edge. getdistance ( ), queue ) ; } } r eturn n2. gettargetdistance ( ) ; } p r i v a t e void r e l a x ( Node u, Node v, i n t distance, Queue<Node> queue ) { i f ( v. gettargetdistance ( ) > u. gettargetdistance ( ) + d i s t a n c e ) { v. s e t T argetdistance ( u. gettargetdistance ( ) + d i s t a n c e ) ; queue. remove ( v ) ; queue. add ( v ) ; } } 29

33 Bibliography [1] I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):5, [2] W. Cohen, P. Ravikumar, and S. Fienberg. A comparison of string metrics for matching names and records. In KDD Workshop on Data Cleaning and Object Consolidation, volume 3, pages 73 78, [3] L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 5(12): , [4] E. Hartuv and R. Shamir. A clustering algorithm based on graph connectivity. Information processing letters, 76(4): , [5] H. Köpcke and E. Rahm. Frameworks for entity matching: A comparison. Data & Knowledge Engineering, 69(2): , [6] H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 3(1-2): , [7] D. T. Meyer and W. J. Bolosky. A study of practical deduplication. ACM Transactions on Storage (TOS), 7(4):14, [8] S. E. Schaeffer. Graph clustering. Computer Science Review, 1(1):27 64, [9] W. E. Winkler. Overview of record linkage and current research directions. In Bureau of the Census. Citeseer,

Information Integration of Partially Labeled Data

Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de