Entity Resolution over Graphs

Size: px
Start display at page:

Download "Entity Resolution over Graphs"

Transcription

1 Entity Resolution over Graphs Bingxin Li Supervisor: Dr. Qing Wang Australian National University Semester 1, 2014

2 Acknowledgements I would take this opportunity to thank my supervisor, Dr. Qing Wang, who has given me valuable support, encouragement and advice without which this work would not have been completed. I am thankful to my good friends, Sid Xiao and Nicolas Cui, for their support. I would also like to thank my parents and Dongxia Bai for their valuable support and constant encouragement.

3 Abstract Entity resolution (ER) is a long-standing challenge in many areas, such as data management and information retrieval. It is to identify which records in one or more databases refer to the same real world entity. Traditional entity resolution techniques are based on calculating the string similarity of attributes, such as name and address for people. However, some records that look similar are not necessarily referring to the same entity, traditional entity resolution techniques have some limitations in dealing with complex data sets. To take into considerate the relational similarity between records for entity resolution, collective entity resolution techniques have been proposed. This project investigates the existing entity resolution techniques, and proposes a new entity resolution technique over graphs connectivity entity resolution, which can enrich collective entity resolution techniques. According to the experiment results, connectivity entity resolution can achieve better results than collective entity resolution in the Cora data set.

4 Contents 1 Introduction Motivating example Objectives Contribution and organisation Background of Entity Resolution General Entity Resolution Process Entity Resolution Process for this Project Methodology Preliminaries Attribute-based Entity Resolution Levenshtein Similarity Cosine Similarity Jaccard Similarity Collective Entity Resolution Connectivity Entity Resolution Experiments Data Set Measures Experiments on Attribute-based ER Shortest Distance in Record Graphs Experiment on Collective Entity Resolution Experiments on Connectivity Entity Resolution Results Comparison Conclusion 22 Appendices 23 A Relational Jaccard Similarity Evaluation 24 B Density-based Similarity Evaluation 26 C Codes 29 1

5 Chapter 1 Introduction Entity resolution (ER) is to identify which records in one or more databases refer to the same real world entity. It is a well-known problem that widely exists in many areas, such as artificial intelligence, information retrieval and database management [3]. Entity resolution also has other different names including record linkage [9], entity matching [5], reduplication or duplicate detection [7], etc. To solve the entity resolution problem, the traditional entity resolution techniques are to calculate the string similarity and use the result of string similarity of records to decide whether two records should be matched. Traditional entity resolution techniques mainly focus on string similarity using attribute values (e.g. name, address, etc. of people). For example, to identify whether the David Aha and D. Aha are the same person, we can calculate the string similarity between David Aha and D. Aha. The traditional techniques are often useful for solving the entity resolution problem. However, they do not always yield a high accuracy. One of the reason is that some attributes may be similar but do not refer to the same real-world entity. Some recent studies of entity resolution take relational similarity into account [Kalashnikov et al. 2005] so-called collective entity resolution. Collective entity resolution techniques take into account not only string similarity but also relational similarity between records, yielding more accurate entity resolution results at a higher computational cost. For example, in such technique records Jane Aha and Jane Lindsay are solved using both string similarity and the relational similarity. In this case, if Jane Aha and Jane Lindsay have same parents, they are the same person even the string similarity between Jane Aha and Jane Lindsay is not high. The collective entity resolution often uses the neighbours of nodes as the relation in record graph to solve the entity resolution. Although it is effective to address the entity resolution problem, in many cases, the computational efficiency is often a concern. 2

6 1.1 Motivating example Let us consider the following example the problem of resolving the researchers in a database. Assume that there are four records and each record has three attributes (name, affiliation, ): r1: The name is M. J., the organisation is ANU and the is mj@gmail.com ; r2: The name is Michael Jordan, the organisation is ANU and the is mj@anu.edu.au ; r3: The name is M. Jordan, the organisation is UC and the is mj@uc.edu.au ; r4: The name is M. J., the organisation is UNSW and the is mj@unsw.edu.au. Figure 1.1: An example database with four records The example is shown in Figure 1.1. Now imagine that we would like to find out, given these four records, which of these researchers refer to the same author entities. (1) When using traditional entity resolution techniques, we can compare how similar their names and affiliations. Because r1, r2 and r3 have same affiliation (ANU) as well as M. J., M. Jordan and Michael Jordan could be three forms of one name, we may determine that r1,r2 and r3 refer to the same entity. This gives us a set of clusters (r1, r2, r3), (r4), shown in Figure 1.2. (2) When using collective entity resolution techniques, we would consider how there records are related, and how similar their relationships are suppose that. We may conclude that r2 and r3 refer to the same entity, but r1 refers to a different one. This gives us a set of clusters: (r1), (r2, r3), (r4), shown in Figure

7 Figure 1.2: Clustering using traditional entity resolution Figure 1.3: Clustering using collective entity resolution (3) However, r4 may refer to the same entity as r2 and r3. Suppose that we have a record graph, shown in Figure 1.4, to present the connectivity relationships of records. According to the relationships, we may get corresponding result is a set of clusters: (r2, r3, r4), (r1) shown in Figure 1.5. This project will focus on exploring graph connectivity in entity resolution. 1.2 Objectives This project is to investigate how collective entity resolution techniques can be enriched by graph properties like connectivity of records. The specific tasks are: (1) To conduct a literature review on entity resolution techniques (2) To incorporate graph properties like connectivity of records into a framework of collective entity resolution (3) To evaluate the above approach using one or two real data sets. 4

8 Figure 1.4: Record graph Figure 1.5: Clustering using connectivity entity resolution 1.3 Contribution and organisation In this project, I propose a connectivity entity resolution approach based on a novel supervised relational clustering algorithm. The specific contributions in this report are as follows: I proposed a connectivity entity resolution approach and evaluate the effectiveness of this approach. I compared the effectiveness of collective entity resolution and connectivity entity resolution. I conducted experimental results on a real-world dataset. The rest of this report is organised as follows. In Chapter 2, I present some related work for entity resolution. In Chapter 3, three methodologies for entity resolution will be discussed. In Chapter 4, I present my results and discuss different similarity thresholds on a real-world dataset. 5

9 In Chapter 5, I conclude the report and discuss future work. 6

10 Chapter 2 Background of Entity Resolution 2.1 General Entity Resolution Process A general entity resolution process [6] is shown in Figure 2.1. Given data from databases, the first step is data pre-processing, which assures data are in the same format. Then, indexing is the second step. The purpose of indexing is to reduce the quadratic complexity of the entity resolution process. After that the third step is record pair comparison. Candidate record pairs are generated from the indexing data structures. The classification step is the next one and candidate record pairs are classified into matches, non-matches and potential matches. If record pairs are classified into potential matches, a clerical review is needed. Besides, the quality and completeness of matched data would be evaluated in the evaluation step. Figure 2.1: General entity resolution process [6] 7

11 2.2 Entity Resolution Process for this Project Figure 2.2: Entity Resolution Process In this project, I focus on the clustering and matching records for entity resolution and the process is shown in Figure Retrieve records from one or more databases. 2. Construct an initial graph based on the records. 3. Decide matches based on attribute-based entity resolution. 4. Generate clusters or transform clusters Generate clusters using the threshold of attribute-base entity resolution Records transform clusters using collective entity resolution or connectivity entity resolution in different situation 8

12 Chapter 3 Methodology 3.1 Preliminaries Record graph is the foundation of all work in this project. Records, which are from databases, are considered as nodes to generate corresponding record graph. And the edges of this graph is the specific relationships of records in databases. A cluster means a group of records having a related specific relations. And records of each cluster are totally different with records of other cluster in the relevant relations. In this project, a cluster is defined as the group of nodes referring to the same entity in the real world. It means, in a record graph, a node refers to the same entity with nodes in the same cluster and refers to a completely different entity with nodes in the different clusters. Shortest path means a path between two nodes in a graph such that the sum of the weights is minimised and shortest distance is the weights of shortest path between two nodes in a graph. In this project, the shortest distance is used as a parameter in connectivity entity resolution. 3.2 Attribute-based Entity Resolution Attribute-based entity resolution is a traditional techniques and string similarity is a common measure to calculate the similarity between two records using two string similarity. It has many different algorithms to implement. This paper will discuss three common ways as follows Levenshtein Similarity Levenshtein Similarity uses Levenshtein distance to calculate the similarity between two strings, the source string s 1 and the target string s 2. The distance is the number of deletions, insertions, or substitutions required to transform s 1 into s 2. The greater the Levenshtein distance, the more different the strings are. The Levenshtein distance is one methods of edit distance. The edit distance 9

13 is calculated using a dynamic programming algorithm that creates a matrix that maps out the cost(number of edits) of getting from every character in string s 1 to every character in string s 2. A cell within such a matrix is denoted by d[i,j] when in row i (0 i s 1 ) and column j (0 j s 2 ). And the first row and column are filled in such that d[i, 0] = i for 0 i s 1 and d[0, j] = j for 0 j s 2. The remaining cells should be followed the following formula: If s 1 [i] = s 2 [j], d[i, j] = d[i 1, j 1] If s 1 [i] s 2 [j], d[i 1, j] + 1 d[i, j] = d[i, j 1] + 1 d[i 1, j 1] + 1 a deletion an insertion a substitution The Levenshtein similarity is calculated by the the Levenshtein distance and the formula is: Sim Levenshtein (s 1, s 2 ) = 1 Distance Levenshtein(s 1, s 2 ) max( s 1, s 2 ) The higher the edit distance, the lower the similarity between two string Cosine Similarity Cosine similarity is another measure of similarity. It is often used in information retrieval to calculate the similarity of two documents, but it is still useful to calculate the string similarity. It is a measure of similarity between two vectors of an inner product space that measure the cosine of the angle between them. It uses the Euclidean dot product formula: a b = a b cos θ Given two vectors of attributes A and B, the cosine similarity is: Sim Cosine = cos(θ) = A B A B = n A i B i i=1 n n (A i ) 2 (B i ) 2 The result of Sim Cosine ranges from -1 meaning exactly opposite to 1 meaning exactly the same, 0 means independence and the value indicates the similarity Jaccard Similarity Jaccard Similairty(also called Jaccard similarity coefficient) is a measure for comparing the similarity and diversity of sample sets. The Jaccard coefficient i=1 i=1 10

14 measures similarity between finite sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets: A B J(A, B) = A B. (If A and B are both empty, J(A, B) = 0). Clearly, 0 J(A, B) 1. For string similarity, I use 2-grams way to transform two strings to two sets A and B. For example, the source string is similarity and the target string is similar, they could be transformed to seta: si, im, mi, il, la, ar, ri, it, ty and setb: si, im, mi, il, la, ar. Then A B = 6, A B = 9, so Sim Jaccard = A B = 6/9 = 0.67 A B In this project, we calculate the string similarity using Jaccard similarity. In fact, for the string similarity, their results do not have much difference. But the Jaccard similarity is more efficient. To some extent, attribute-based similarity is enough to solve entity resolution. However, when two different records have similar string of attributes, we cannot solve the entity resolution just using the attribute-based similarity. In this situation, some approaches of entity resolution over graphs, such as collective entity resolution, could be useful. 3.3 Collective Entity Resolution Collective entity resolution is one of approaches of entity resolution over graphs. The main idea of collective entity resolution is relational similarity. For example, if two records have same name attributes D. Aha, the neighbours of two D. Aha in a graph can provide some evidence. If two D. Aha have different parents, it is a high probability that they do not refer to the same entity. The goal of collective entity resolution is that resolutions are not made independently, but instead one resolution decision affects other resolutions vis edges or relations. In addition, the entity resolution could be motivated as a clustering problem in this situation. Entity resolution can be considered as a clustering problem when any similarity measure between pairs of records is given. The goal is to cluster the records so that only those that correspond to the same entity are assigned to the same cluster [1]. So far, the attribute-based similarity is discussed in the previous chapter. For the collective entity resolution, I will use the clustering approach combine with the attribute-based similarity. We define the similarity of two clusters c i and c j as: sim(c i, c j ) = (1 α) sim A (c i, c j ) + α sim R (c i, c j ), 0 α 1 11

15 where sim A () is the similarity of the attributes and sim R () is the relational similarity between the records in the two clusters. The c i and c j stand for different clusters. Besides, the intention of α is to ensure a high accuracy for different situations and it depends on different data sets. The α is a significant value I obtained by analysis in the experiment. For relational similarity, the relationship or neighbours of records in a record graph are always regarded as important factors. In this project, I use numbers of common neighbours to calculate the Jaccard similarity which is the main measure for collective entity resolution. Jaccard similarity in here uses clusters neighbourhood as parameters. For two different clusters c i and c j, the Jaccard similarity can be represented by the following formula: Sim Jaccard (c i, c j ) = Nbr(c i) Nbr(c j ) Nbr(c i ) Nbr(c j ) The Nbr(c i ) Nbr(c j ) stands for the numbers of common neighbours between cluster c i and cluster c j. The Nbr(c i ) Nbr(c j ) stands for the numbers of all neighbours of cluster c i and c j. 3.4 Connectivity Entity Resolution In this project, the graph is my core theory. Clusters in graphs can also be defined through connectivity by calculating the number of paths that exist between each pair of nodes [8]. For some nodes to belong to the same cluster, they should be highly connected to each other [4]. Through connectivities between nodes in a graph, we are able to obtain more relationships providing evidence for clustering. The formula of the connectivity entity resolution: sim(c i, c j ) = (1 β) sim A (c i, c j ) + β sim C (c i, c j ), 0 β 1 where sim A () is attribute-based similarity and sim C () is the connectivity similarity between two nodes in a record graph. The c i and c j are different clusters. The parameter β has similar intention of the α in the previous section. I will determine it in the experiments. For the connectivity similarity sim C, I use the density-based similarity sim Density to measure. And the formula for the density-based similarity: 12

16 Sim Density = NumNode(c i, c j ) dis(c i, c j ) Figure 3.1: Density Figure 3.1 reflects the meaning of density-based similarity in this part. There are two graphs presenting two situation in real world. Between node 1 and node 2, they have one common neighbour node 3 in graph 1. However, they have three common neighbours (node 3, node 4 and node 5) from node 1 to node 2 in graph 2. Obviously, relationship between node 1 and node 2 in graph 2 is stronger than they are in graph 1. 13

17 Chapter 4 Experiments In this project, I used the MacBook Air with OS X to execute all the experiments. For the MacBook Air, the processor is 1.3 GHz Intel Core i5 and the memory is 4 GB 1600 MHz DDR3. The codes are written by Java and the IDE is Eclipse. 4.1 Data Set I experiment with the Cora dataset. The Cora dataset has three tables: authors, publications and venues. I only used the authors table in my experiments. This table has 4571 records and five attributes are shown in Table 4.1. The attribute cluster is the ground truth, which provides us true clusters to evaluate entity resolution techquics. That is the reason I choose authors table of the Cora dataset. Attribute aid(primary key) id(foreign key) no name cluster Description author id publication id authorship order author s name cluster id (the ground truth) Table 4.1: Attributes of the authors table 4.2 Measures In this experiment, there are four counts I used in evaluating entity resolution results. True Positive (TP) : It stands for the numbers of matches appearing in both identified cluster and the ground truth cluster. 14

18 False Positive (FP) : It means the numbers of matches are in the identified cluster but do not belong to the ground truth cluster. True Negative (TN) : It means the numbers of matches do not belong to the identified cluster or the ground truth cluster. False Negative (FN) : It represents the numbers of matches are in the ground truth cluster but are not in the identified cluster. After the above counts are calculated, we use the following three measurements to evaluate entity resolution. Precision: It is defined: P recision = T P T P + F P High precision value means the approach or algorithm returned more correct results. Recall: It is defined: Recall = T P T P + F N High recall value means the approach or algorithm returned most pairs of true matches. The F1 score: calculated by the precision and recall and it can be considered as a weighted average of the precision and recall. The formula is: F 1 = 2 Precision Recall Precision + Recall The F1 score ranges from 0 to 1. 1 is the best score and 0 is the worst value. 4.3 Experiments on Attribute-based ER Attribute-base entity resolution [2] compute the similarity sima(r i, r j ) for each pair of r i, r j records based on their attribute values and only those pairs that have similarity above some threshold are considered as matches. I used Jaccard similarity with 2-grams as the string similarity method to implement. The results is shown in Table 4.2. In general, when the threshold changes from 0.3 to 1, precision values become 15

19 Threshold TP FP FN Precision Recall F Table 4.2: string similarity higher and recall values become lower. The highest F1 is when the threshold is at 0.4. Although the trend of precision mainly increase from 0.3 to 1, the precision value reaches the highest point when the threshold is at 0.9. Why is not the highest precision value at the threshold 1. I analysed this problem and I found out many different persons have same name, such as M. Ahlshkog. 4.4 Shortest Distance in Record Graphs All rest of work will be based on the entity resolution over graphs. Firstly I generated a record graph G(N i, E i ). In this authors table, I used the aid as node id, it means each node (N i ) in the graph stands for one record in authors table. Then I used the co-author relationship to generate edges (E i ) between nodes. In the authors table, attribute value of pid provide the relationship. For example, both record aid = 1 and record aid =2 have a same pid = 1, it means record aid =1 and aid =2 are co-authors each other. This graph is named initial graph. Graphs always are complex, so the distance between two nodes could be more than one. The shortest distance between two nodes is my emphasis. And it is valuable to provide the evidence of one of relations between two nodes. I used the Dijkstra s algorithm to calculate the shortest distance (codes are in Appendix C). The results of shortest distance for initial graph is presented in the Table 4.3. According to the Table 4.3, we can clearly see the trend of the numbers of pairs with different distance when the attribute-based similarity changes. Firstly, each tuple has the same total numbers of pairs, which ensure the correctness of this experiment. Then, I need to analyse the numbers of pairs with different distance and attribute-based similarity. The shortest distance equals 0 means two records in a cluster. Therefore, when the attribute-based similarity is 0.9, the numbers of pairs with the shortest distance equals 0 is the less one. But the numbers of 16

20 sim/dis nodis sum Table 4.3: distance Figure 4.1: Shortest Distance based Graph pairs with no distance is the most one. It because the attribute-based similarity equals 0.9 is the strongest restriction in this situation; and most nodes have no relationships with other nodes in the record graph. Based on the initial graph, I use attribute-based entity resolution i.e. string similarity to evaluate. In specific, I merge the clusters when the string similarity sim A 0.9. Because when sim A = 0.9, the Precision is high. It ensures new graph (based on sim A >= 0.9) has a high accuracy. However, Recall is low when sim A = 0.9. That is why we need to evaluate the pairs of nodes by collective entity resolution or connectivity entity resolution. After that, I get a updated graph named base-graph (codes are in Appendix). Then I analyse the shortest distance between all pairs in the new graph (basegraph). The reason is that shortest distance as a parameter for density-based similarity is necessary to analyse. In Figure 4.1, we can see numbers of pairs with distance = 2 occupies the majority part. 17

21 4.5 Experiment on Collective Entity Resolution Based on the base-graph, I evaluated relational Jaccard similarity for all pairs of nodes with string similarity ranging from 0.4 to 0.9. Due to long time consuming for this part and analysis for the results in each period in case some problems, I get results using separate period i.e. string similarity between 0.4 and 0.5, between 0.5 and 0.6, between 0.6 and 0.7, between 0.7 and 0.8, between 0.8 and 0.9(in Appendix ). Then I summary the final results are follows. Figure 4.2: final jaccard According to Figure 4.2, when sim R = 0.01, the F1 is the highest value. For the formula: sim(c i, c j ) = (1 α) sim A (c i, c j ) + α sim R (c i, c j ), 0 α 1. Base on the previous results, I choose α = 0.8, which could get the best F1. And I need to have a try for the parameters x. Thus, using formula 0.8 sim A sim R x to evaluate, results are shown in Figure 4.3. When x=0.36, F1 is the highest one. 18

22 Figure 4.3: Final Jaccard 4.6 Experiments on Connectivity Entity Resolution Based on the base-graph, I evaluated density-based similarity for all pairs of nodes with string similarity ranging from 0.4 to 0.9. Due to long time consuming for this part and analysis for the results in each period in case some problems, I get results using separate period i.e. string similarity between 0.4 and 0.5, between 0.5 and 0.6, between 0.6 and 0.7, between 0.7 and 0.8, between 0.8 and 0.9(in Appendix ). Then I summary the final results are follows. In Figure 4.4, Figure 4.4: final density when density=0.5, F1 reaches the highest point. For the formula: sim(c i, c j ) = (1 β) sim A (c i, c j ) + β sim C (c i, c j ), 0 β 1. Base on the previous results, I choose β = 0.8, which could get the best F1. And I need to have a try for the parameters x. Thus, using formula 0.8 sim A sim C x to evaluate. 19

23 Figure 4.5: Final Density For the parameter x is difficult to estimate and I tried many times in different period. Results are shown in Figure 4.5. When x = 6, F1 is highest. 20

24 4.7 Results Comparison Based on the previous two sections, for the collective entity resolution, using 0.8 sim A sim R 0.36 get the highest F1; for the connectivity entity resolution, using 0.8 sim A sim C 6 get the highest F1. Compare the results in Figure 4.6. Figure 4.6: Comparision In this project, F1 for collective entity resolution techniques is lower than F1 for connectivity entity resolution techniques. Therefore, connectivity entity resolution is effective. 21

25 Chapter 5 Conclusion In this project, I investigated existing entity resolution techniques: attributebased entity resolution techniques and collective entity resolution techniques. Then I proposed the connectivity entity resolution techniques and evaluated the effectiveness over Cora dataset. In the experiment, results of connectivity entity resolution techniques is better than results of collective entity resolution techniques. However, during the experiments, there two main challenges: How to construct a graph for graph-based entity resolution techniques, such as collective entity resolution. In this report, base graph is using sim A 0.9 to generate. How to tune parameters for different feature in relational similarity or density-based similarity. In this report, I estimated the parameters and tried many times. Then I choose the formula 0.8 sim A sim R 0.36 for collective entity resolution and 0.8 sim A sim C 6 for connectivity entity resolution. Although the results are better, it is evaluated just only one data set. In future work, we need evaluate one more data sets to verify the effectiveness and evaluate efficiency of connectivity entity resolution. 22

26 Appendices 23

27 Appendix A Relational Jaccard Similarity Evaluation Figure A.1: sim R with0.8 sim A < 0.9 Figure A.2: sim R with0.7 sim A <

28 Figure A.3: sim R with0.6 sim A < 0.7 Figure A.4: sim R with0.5 sim A < 0.6 Figure A.5: sim R with0.4 sim A <

29 Appendix B Density-based Similarity Evaluation Figure B.1: sim C with0.8 sim A < 0.9 Figure B.2: sim C with0.7 sim A <

30 Figure B.3: sim C with0.6 sim A < 0.7 Figure B.4: sim C with0.5 sim A < 0.6 Figure B.5: sim C with0.4 sim A <

31 0.8 sim A sim C x Precision Recall F Table B.1: Different parameters in connectivity entity resolution results 28

32 Appendix C Codes S h o r t e s t Distance p r i v a t e i n t d i j k s t r a s e a r c h ( Node n1, Node n2 ) { Comparator<Node> o r d e r = new Comparator<Node >(){ p u b l i c i n t compare ( Node node1, Node node2 ) { r eturn node2. gettargetdistance ( ) >= node1. gettargetdistance ( )? 1 : 1 ; } } ; Queue<Node> queue = new PriorityQueue<Node>(nodes. v a l u e s ( ). s i z e ( ), order ) ; List <Node> nodelist = new ArrayList<Node>(nodes. v a l u e s ( ) ) ; f o r ( Node n : nodelist ) { n. s e t T a r g e t D i s tance ( I n t e g e r.max VALUE / 2 ) ; } n1. s e t TargetDistance ( 0 ) ; queue. addall ( nodes. v a l u e s ( ) ) ; while ( queue. isempty ( ) == f a l s e ) { Node u = queue. p o l l ( ) ; f o r ( Edge edge : u. getedges ( ) ) { Node v = edge. getnode1 ( ). e q u a l s ( u )? edge. getnode2 ( ) : edge. getnode1 ( ) ; r e l a x (u, v, edge. getdistance ( ), queue ) ; } } r eturn n2. gettargetdistance ( ) ; } p r i v a t e void r e l a x ( Node u, Node v, i n t distance, Queue<Node> queue ) { i f ( v. gettargetdistance ( ) > u. gettargetdistance ( ) + d i s t a n c e ) { v. s e t T argetdistance ( u. gettargetdistance ( ) + d i s t a n c e ) ; queue. remove ( v ) ; queue. add ( v ) ; } } 29

33 Bibliography [1] I. Bhattacharya and L. Getoor. Collective entity resolution in relational data. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):5, [2] W. Cohen, P. Ravikumar, and S. Fienberg. A comparison of string metrics for matching names and records. In KDD Workshop on Data Cleaning and Object Consolidation, volume 3, pages 73 78, [3] L. Getoor and A. Machanavajjhala. Entity resolution: theory, practice & open challenges. Proceedings of the VLDB Endowment, 5(12): , [4] E. Hartuv and R. Shamir. A clustering algorithm based on graph connectivity. Information processing letters, 76(4): , [5] H. Köpcke and E. Rahm. Frameworks for entity matching: A comparison. Data & Knowledge Engineering, 69(2): , [6] H. Köpcke, A. Thor, and E. Rahm. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment, 3(1-2): , [7] D. T. Meyer and W. J. Bolosky. A study of practical deduplication. ACM Transactions on Storage (TOS), 7(4):14, [8] S. E. Schaeffer. Graph clustering. Computer Science Review, 1(1):27 64, [9] W. E. Winkler. Overview of record linkage and current research directions. In Bureau of the Census. Citeseer,

Information Integration of Partially Labeled Data

Information Integration of Partially Labeled Data Information Integration of Partially Labeled Data Steffen Rendle and Lars Schmidt-Thieme Information Systems and Machine Learning Lab, University of Hildesheim srendle@ismll.uni-hildesheim.de, schmidt-thieme@ismll.uni-hildesheim.de

More information

Rule-Based Method for Entity Resolution Using Optimized Root Discovery (ORD)

Rule-Based Method for Entity Resolution Using Optimized Root Discovery (ORD) American-Eurasian Journal of Scientific Research 12 (5): 255-259, 2017 ISSN 1818-6785 IDOSI Publications, 2017 DOI: 10.5829/idosi.aejsr.2017.255.259 Rule-Based Method for Entity Resolution Using Optimized

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park Link Mining & Entity Resolution Lise Getoor University of Maryland, College Park Learning in Structured Domains Traditional machine learning and data mining approaches assume: A random sample of homogeneous

More information

Active Blocking Scheme Learning for Entity Resolution

Active Blocking Scheme Learning for Entity Resolution Active Blocking Scheme Learning for Entity Resolution Jingyu Shao and Qing Wang Research School of Computer Science, Australian National University {jingyu.shao,qing.wang}@anu.edu.au Abstract. Blocking

More information

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Classify My Social Contacts into Circles Stanford University CS224W Fall 2014

Classify My Social Contacts into Circles Stanford University CS224W Fall 2014 Classify My Social Contacts into Circles Stanford University CS224W Fall 2014 Amer Hammudi (SUNet ID: ahammudi) ahammudi@stanford.edu Darren Koh (SUNet: dtkoh) dtkoh@stanford.edu Jia Li (SUNet: jli14)

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Use of Synthetic Data in Testing Administrative Records Systems

Use of Synthetic Data in Testing Administrative Records Systems Use of Synthetic Data in Testing Administrative Records Systems K. Bradley Paxton and Thomas Hager ADI, LLC 200 Canal View Boulevard, Rochester, NY 14623 brad.paxton@adillc.net, tom.hager@adillc.net Executive

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

A Hybrid Model Words-Driven Approach for Web Product Duplicate Detection

A Hybrid Model Words-Driven Approach for Web Product Duplicate Detection A Hybrid Model Words-Driven Approach for Web Product Duplicate Detection Marnix de Bakker, Flavius Frasincar, and Damir Vandic Erasmus University Rotterdam PO Box 1738, NL-3000 DR Rotterdam, The Netherlands

More information

Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification

Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification Peter Christen Department of Computer Science, ANU College of Engineering and Computer Science, The Australian National University,

More information

CHAPTER 7 CONCLUSION AND FUTURE WORK

CHAPTER 7 CONCLUSION AND FUTURE WORK CHAPTER 7 CONCLUSION AND FUTURE WORK 7.1 Conclusion Data pre-processing is very important in data mining process. Certain data cleaning techniques usually are not applicable to all kinds of data. Deduplication

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS46: Mining Massive Datasets Jure Leskovec, Stanford University http://cs46.stanford.edu /7/ Jure Leskovec, Stanford C46: Mining Massive Datasets Many real-world problems Web Search and Text Mining Billions

More information

TABLE OF CONTENTS PAGE TITLE NO.

TABLE OF CONTENTS PAGE TITLE NO. TABLE OF CONTENTS CHAPTER PAGE TITLE ABSTRACT iv LIST OF TABLES xi LIST OF FIGURES xii LIST OF ABBREVIATIONS & SYMBOLS xiv 1. INTRODUCTION 1 2. LITERATURE SURVEY 14 3. MOTIVATIONS & OBJECTIVES OF THIS

More information

Recommendation System for Location-based Social Network CS224W Project Report

Recommendation System for Location-based Social Network CS224W Project Report Recommendation System for Location-based Social Network CS224W Project Report Group 42, Yiying Cheng, Yangru Fang, Yongqing Yuan 1 Introduction With the rapid development of mobile devices and wireless

More information

An Empirical Analysis of Communities in Real-World Networks

An Empirical Analysis of Communities in Real-World Networks An Empirical Analysis of Communities in Real-World Networks Chuan Sheng Foo Computer Science Department Stanford University csfoo@cs.stanford.edu ABSTRACT Little work has been done on the characterization

More information

Relational Clustering for Multi-type Entity Resolution

Relational Clustering for Multi-type Entity Resolution Relational Clustering for Multi-type Entity Resolution Indrajit Bhattacharya and Lise Getoor Department of Computer Science, University of Maryland Presented by Martin Leginus 13th of March, 2013 Agenda

More information

Similarity Joins in MapReduce

Similarity Joins in MapReduce Similarity Joins in MapReduce Benjamin Coors, Kristian Hunt, and Alain Kaeslin KTH Royal Institute of Technology {coors,khunt,kaeslin}@kth.se Abstract. This paper studies how similarity joins can be implemented

More information

Query-Time Entity Resolution

Query-Time Entity Resolution Query-Time Entity Resolution Indrajit Bhattacharya University of Maryland, College Park MD, USA 20742 indrajit@cs.umd.edu Lise Getoor University of Maryland, College Park MD, USA 20742 getoor@cs.umd.edu

More information

A Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence

A Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence 2nd International Conference on Electronics, Network and Computer Engineering (ICENCE 206) A Network Intrusion Detection System Architecture Based on Snort and Computational Intelligence Tao Liu, a, Da

More information

Collective Entity Resolution in Relational Data

Collective Entity Resolution in Relational Data Collective Entity Resolution in Relational Data I. Bhattacharya, L. Getoor University of Maryland Presented by: Srikar Pyda, Brett Walenz CS590.01 - Duke University Parts of this presentation from: http://www.norc.org/pdfs/may%202011%20personal%20validation%20and%20entity%20resolution%20conference/getoorcollectiveentityresolution

More information

Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data

Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data Int'l Conf. Information and Knowledge Engineering IKE'15 187 Applying Phonetic Hash Functions to Improve Record Linking in Student Enrollment Data (Research in progress) A. Pei Wang 1, B. Daniel Pullen

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

Improve Record Linkage Using Active Learning Techniques

Improve Record Linkage Using Active Learning Techniques Improve Record Linkage Using Active Learning Techniques Chong Feng u4943054 supervised by Dr.Qing Wang and Dr.Dinusha Vatsalan COMP8715 Individual Project (12 units) Research School of Computer Science

More information

Collaborative Filtering using Euclidean Distance in Recommendation Engine

Collaborative Filtering using Euclidean Distance in Recommendation Engine Indian Journal of Science and Technology, Vol 9(37), DOI: 10.17485/ijst/2016/v9i37/102074, October 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Collaborative Filtering using Euclidean Distance

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Data Linkage Methods: Overview of Computer Science Research

Data Linkage Methods: Overview of Computer Science Research Data Linkage Methods: Overview of Computer Science Research Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra,

More information

Unsupervised Duplicate Detection (UDD) Of Query Results from Multiple Web Databases. A Thesis Presented to

Unsupervised Duplicate Detection (UDD) Of Query Results from Multiple Web Databases. A Thesis Presented to Unsupervised Duplicate Detection (UDD) Of Query Results from Multiple Web Databases A Thesis Presented to The Faculty of the Computer Science Program California State University Channel Islands In (Partial)

More information

Query-time Entity Resolution

Query-time Entity Resolution Journal of Artificial Intelligence Research 30 (2007) 621-657 Submitted 03/07; published 12/07 Query-time Entity Resolution Indrajit Bhattacharya IBM India Research Laboratory Vasant Kunj, New Delhi 110

More information

Real-time Collaborative Filtering Recommender Systems

Real-time Collaborative Filtering Recommender Systems Real-time Collaborative Filtering Recommender Systems Huizhi Liang, Haoran Du, Qing Wang Presenter: Qing Wang Research School of Computer Science The Australian National University Australia Partially

More information

node2vec: Scalable Feature Learning for Networks

node2vec: Scalable Feature Learning for Networks node2vec: Scalable Feature Learning for Networks A paper by Aditya Grover and Jure Leskovec, presented at Knowledge Discovery and Data Mining 16. 11/27/2018 Presented by: Dharvi Verma CS 848: Graph Database

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

Cluster Analysis. Angela Montanari and Laura Anderlucci

Cluster Analysis. Angela Montanari and Laura Anderlucci Cluster Analysis Angela Montanari and Laura Anderlucci 1 Introduction Clustering a set of n objects into k groups is usually moved by the aim of identifying internally homogenous groups according to a

More information

Authorship Disambiguation and Alias Resolution in Data

Authorship Disambiguation and Alias Resolution in  Data Authorship Disambiguation and Alias Resolution in Email Data Freek Maes Johannes C. Scholtes Department of Knowledge Engineering Maastricht University, P.O. Box 616, 6200 MD Maastricht Abstract Given a

More information

A Bagging Method using Decision Trees in the Role of Base Classifiers

A Bagging Method using Decision Trees in the Role of Base Classifiers A Bagging Method using Decision Trees in the Role of Base Classifiers Kristína Machová 1, František Barčák 2, Peter Bednár 3 1 Department of Cybernetics and Artificial Intelligence, Technical University,

More information

Leveraging Data and Structure in Ontology Integration

Leveraging Data and Structure in Ontology Integration Leveraging Data and Structure in Ontology Integration O. Udrea L. Getoor R.J. Miller Group 15 Enrico Savioli Andrea Reale Andrea Sorbini DEIS University of Bologna Searching Information in Large Spaces

More information

Measuring Intrusion Detection Capability: An Information- Theoretic Approach

Measuring Intrusion Detection Capability: An Information- Theoretic Approach Measuring Intrusion Detection Capability: An Information- Theoretic Approach Guofei Gu, Prahlad Fogla, David Dagon, Wenke Lee Georgia Tech Boris Skoric Philips Research Lab Outline Motivation Problem Why

More information

Ontology Alignment using Combined Similarity Method and Matching Method

Ontology Alignment using Combined Similarity Method and Matching Method Ontology Alignment using Combined Similarity Method and Matching Method Didih Rizki Chandranegara, Riyanarto Sarno Informatics Department Institut Teknologi Sepuluh Nopember Surabaya, Indonesia diedieh02@gmail.com,

More information

Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:

Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect: Outline Effective and Scalable Solutions for Mixed and Split Citation Problems in Digital Libraries Dongwon Lee, Byung-Won On Penn State University, USA Jaewoo Kang North Carolina State University, USA

More information

TRANSACTIONAL CLUSTERING. Anna Monreale University of Pisa

TRANSACTIONAL CLUSTERING. Anna Monreale University of Pisa TRANSACTIONAL CLUSTERING Anna Monreale University of Pisa Clustering Clustering : Grouping of objects into different sets, or more precisely, the partitioning of a data set into subsets (clusters), so

More information

A Learning Method for Entity Matching

A Learning Method for Entity Matching A Learning Method for Entity Matching Jie Chen Cheqing Jin Rong Zhang Aoying Zhou Shanghai Key Laboratory of Trustworthy Computing, Software Engineering Institute, East China Normal University, China 5500002@ecnu.cn,

More information

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005

Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Data Mining with Oracle 10g using Clustering and Classification Algorithms Nhamo Mdzingwa September 25, 2005 Abstract Deciding on which algorithm to use, in terms of which is the most effective and accurate

More information

Deduplication of Hospital Data using Genetic Programming

Deduplication of Hospital Data using Genetic Programming Deduplication of Hospital Data using Genetic Programming P. Gujar Department of computer engineering Thakur college of engineering and Technology, Kandiwali, Maharashtra, India Priyanka Desai Department

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Iterative Graph Summarization based on Grouping

Iterative Graph Summarization based on Grouping Iterative Graph Summarization based on Grouping Sirui Li Supervisor: Dr. Qing Wang COMP4560: Advanced Computing Project Australian National University Semester 1, 2017 May 26, 2017 Acknowledgements First

More information

A framework of identity resolution: evaluating identity attributes and matching algorithms

A framework of identity resolution: evaluating identity attributes and matching algorithms Li and Wang Security Informatics (2015) 4:6 DOI 10.1186/s13388-015-0021-0 RESEARCH A framework of identity resolution: evaluating identity attributes and matching algorithms Jiexun Li 1 and Alan G. Wang

More information

Performance Evaluation of Various Classification Algorithms

Performance Evaluation of Various Classification Algorithms Performance Evaluation of Various Classification Algorithms Shafali Deora Amritsar College of Engineering & Technology, Punjab Technical University -----------------------------------------------------------***----------------------------------------------------------

More information

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction

An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction International Journal of Engineering Science Invention Volume 2 Issue 1 January. 2013 An Approach for Privacy Preserving in Association Rule Mining Using Data Restriction Janakiramaiah Bonam 1, Dr.RamaMohan

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Data Clustering. Danushka Bollegala

Data Clustering. Danushka Bollegala Data Clustering Danushka Bollegala Outline Why cluster data? Clustering as unsupervised learning Clustering algorithms k-means, k-medoids agglomerative clustering Brown s clustering Spectral clustering

More information

Shingling Minhashing Locality-Sensitive Hashing. Jeffrey D. Ullman Stanford University

Shingling Minhashing Locality-Sensitive Hashing. Jeffrey D. Ullman Stanford University Shingling Minhashing Locality-Sensitive Hashing Jeffrey D. Ullman Stanford University 2 Wednesday, January 13 Computer Forum Career Fair 11am - 4pm Lawn between the Gates and Packard Buildings Policy for

More information

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering INF4820 Algorithms for AI and NLP Evaluating Classifiers Clustering Erik Velldal & Stephan Oepen Language Technology Group (LTG) September 23, 2015 Agenda Last week Supervised vs unsupervised learning.

More information

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X Analysis about Classification Techniques on Categorical Data in Data Mining Assistant Professor P. Meena Department of Computer Science Adhiyaman Arts and Science College for Women Uthangarai, Krishnagiri,

More information

Analyzing Dshield Logs Using Fully Automatic Cross-Associations

Analyzing Dshield Logs Using Fully Automatic Cross-Associations Analyzing Dshield Logs Using Fully Automatic Cross-Associations Anh Le 1 1 Donald Bren School of Information and Computer Sciences University of California, Irvine Irvine, CA, 92697, USA anh.le@uci.edu

More information

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error INTRODUCTION TO MACHINE LEARNING Measuring model performance or error Is our model any good? Context of task Accuracy Computation time Interpretability 3 types of tasks Classification Regression Clustering

More information

Adaptive Temporal Entity Resolution on Dynamic Databases

Adaptive Temporal Entity Resolution on Dynamic Databases Adaptive Temporal Entity Resolution on Dynamic Databases Peter Christen 1 and Ross Gayler 2 1 Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National

More information

Distance-based Outlier Detection: Consolidation and Renewed Bearing

Distance-based Outlier Detection: Consolidation and Renewed Bearing Distance-based Outlier Detection: Consolidation and Renewed Bearing Gustavo. H. Orair, Carlos H. C. Teixeira, Wagner Meira Jr., Ye Wang, Srinivasan Parthasarathy September 15, 2010 Table of contents Introduction

More information

Document Clustering: Comparison of Similarity Measures

Document Clustering: Comparison of Similarity Measures Document Clustering: Comparison of Similarity Measures Shouvik Sachdeva Bhupendra Kastore Indian Institute of Technology, Kanpur CS365 Project, 2014 Outline 1 Introduction The Problem and the Motivation

More information

The Application of K-medoids and PAM to the Clustering of Rules

The Application of K-medoids and PAM to the Clustering of Rules The Application of K-medoids and PAM to the Clustering of Rules A. P. Reynolds, G. Richards, and V. J. Rayward-Smith School of Computing Sciences, University of East Anglia, Norwich Abstract. Earlier research

More information

Ontology-based Integration and Refinement of Evaluation-Committee Data from Heterogeneous Data Sources

Ontology-based Integration and Refinement of Evaluation-Committee Data from Heterogeneous Data Sources Indian Journal of Science and Technology, Vol 8(23), DOI: 10.17485/ijst/2015/v8i23/79342 September 2015 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Ontology-based Integration and Refinement of Evaluation-Committee

More information

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

Concept Tree Based Clustering Visualization with Shaded Similarity Matrices Syracuse University SURFACE School of Information Studies: Faculty Scholarship School of Information Studies (ischool) 12-2002 Concept Tree Based Clustering Visualization with Shaded Similarity Matrices

More information

code pattern analysis of object-oriented programming languages

code pattern analysis of object-oriented programming languages code pattern analysis of object-oriented programming languages by Xubo Miao A thesis submitted to the School of Computing in conformity with the requirements for the degree of Master of Science Queen s

More information

Understanding Clustering Supervising the unsupervised

Understanding Clustering Supervising the unsupervised Understanding Clustering Supervising the unsupervised Janu Verma IBM T.J. Watson Research Center, New York http://jverma.github.io/ jverma@us.ibm.com @januverma Clustering Grouping together similar data

More information

Privacy Preserving Probabilistic Record Linkage

Privacy Preserving Probabilistic Record Linkage Privacy Preserving Probabilistic Record Linkage Duncan Smith (Duncan.G.Smith@Manchester.ac.uk) Natalie Shlomo (Natalie.Shlomo@Manchester.ac.uk) Social Statistics, School of Social Sciences University of

More information

RLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German.

RLC RLC RLC. Merge ToolBox MTB. Getting Started. German. Record Linkage Software, Version RLC RLC RLC. German. German. German RLC German RLC German RLC Merge ToolBox MTB German RLC Record Linkage Software, Version 0.742 Getting Started German RLC German RLC 12 November 2012 Tobias Bachteler German Record Linkage Center

More information

Handling Missing Values via Decomposition of the Conditioned Set

Handling Missing Values via Decomposition of the Conditioned Set Handling Missing Values via Decomposition of the Conditioned Set Mei-Ling Shyu, Indika Priyantha Kuruppu-Appuhamilage Department of Electrical and Computer Engineering, University of Miami Coral Gables,

More information

Community detection algorithms survey and overlapping communities. Presented by Sai Ravi Kiran Mallampati

Community detection algorithms survey and overlapping communities. Presented by Sai Ravi Kiran Mallampati Community detection algorithms survey and overlapping communities Presented by Sai Ravi Kiran Mallampati (sairavi5@vt.edu) 1 Outline Various community detection algorithms: Intuition * Evaluation of the

More information

Feature Subset Selection using Clusters & Informed Search. Team 3

Feature Subset Selection using Clusters & Informed Search. Team 3 Feature Subset Selection using Clusters & Informed Search Team 3 THE PROBLEM [This text box to be deleted before presentation Here I will be discussing exactly what the prob Is (classification based on

More information

Record Linkage using Probabilistic Methods and Data Mining Techniques

Record Linkage using Probabilistic Methods and Data Mining Techniques Doi:10.5901/mjss.2017.v8n3p203 Abstract Record Linkage using Probabilistic Methods and Data Mining Techniques Ogerta Elezaj Faculty of Economy, University of Tirana Gloria Tuxhari Faculty of Economy, University

More information

Automation of data mapping using machine learning techniques

Automation of data mapping using machine learning techniques Data integration using machine learning Automation of data mapping using machine learning techniques Master of Science Thesis Complex Adaptive Systems MARCUS BIRGERSSON GUSTAV HANSSON Department of Computer

More information

Automatic Shadow Removal by Illuminance in HSV Color Space

Automatic Shadow Removal by Illuminance in HSV Color Space Computer Science and Information Technology 3(3): 70-75, 2015 DOI: 10.13189/csit.2015.030303 http://www.hrpub.org Automatic Shadow Removal by Illuminance in HSV Color Space Wenbo Huang 1, KyoungYeon Kim

More information

Clustering & Classification (chapter 15)

Clustering & Classification (chapter 15) Clustering & Classification (chapter 5) Kai Goebel Bill Cheetham RPI/GE Global Research goebel@cs.rpi.edu cheetham@cs.rpi.edu Outline k-means Fuzzy c-means Mountain Clustering knn Fuzzy knn Hierarchical

More information

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri Scene Completion Problem The Bare Data Approach High Dimensional Data Many real-world problems Web Search and Text Mining Billions

More information

Towards Breaking the Quality Curse. AWebQuerying Web-Querying Approach to Web People Search.

Towards Breaking the Quality Curse. AWebQuerying Web-Querying Approach to Web People Search. Towards Breaking the Quality Curse. AWebQuerying Web-Querying Approach to Web People Search. Dmitri V. Kalashnikov Rabia Nuray-Turan Sharad Mehrotra Dept of Computer Science University of California, Irvine

More information

Prelim 2 Solution. CS 2110, April 26, 2016, 7:30 PM

Prelim 2 Solution. CS 2110, April 26, 2016, 7:30 PM Prelim Solution CS 110, April 6, 016, 7:0 PM 1 5 Total Question True/False Complexity Heaps Trees Graphs Max 10 0 0 0 0 100 Score Grader The exam is closed book and closed notes. Do not begin until instructed.

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

Automatic training example selection for scalable unsupervised record linkage

Automatic training example selection for scalable unsupervised record linkage Automatic training example selection for scalable unsupervised record linkage Peter Christen Department of Computer Science, The Australian National University, Canberra, Australia Contact: peter.christen@anu.edu.au

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

PROGRAM EFFICIENCY & COMPLEXITY ANALYSIS

PROGRAM EFFICIENCY & COMPLEXITY ANALYSIS Lecture 03-04 PROGRAM EFFICIENCY & COMPLEXITY ANALYSIS By: Dr. Zahoor Jan 1 ALGORITHM DEFINITION A finite set of statements that guarantees an optimal solution in finite interval of time 2 GOOD ALGORITHMS?

More information

3-D MRI Brain Scan Classification Using A Point Series Based Representation

3-D MRI Brain Scan Classification Using A Point Series Based Representation 3-D MRI Brain Scan Classification Using A Point Series Based Representation Akadej Udomchaiporn 1, Frans Coenen 1, Marta García-Fiñana 2, and Vanessa Sluming 3 1 Department of Computer Science, University

More information

SimEval - A Tool for Evaluating the Quality of Similarity Functions

SimEval - A Tool for Evaluating the Quality of Similarity Functions SimEval - A Tool for Evaluating the Quality of Similarity Functions Carlos A. Heuser Francisco N. A. Krieser Viviane Moreira Orengo UFRGS - Instituto de Informtica Caixa Postal 15.064-91501-970 - Porto

More information

Prelim 2 Solution. CS 2110, April 26, 2016, 5:30 PM

Prelim 2 Solution. CS 2110, April 26, 2016, 5:30 PM Prelim Solution CS 110, April 6, 016, 5:0 PM 1 5 Total Question True/False Complexity Heaps Trees Graphs Max 10 0 0 0 0 100 Score Grader The exam is closed book and closed notes. Do not begin until instructed.

More information

Towards Incremental Grounding in Tuffy

Towards Incremental Grounding in Tuffy Towards Incremental Grounding in Tuffy Wentao Wu, Junming Sui, Ye Liu University of Wisconsin-Madison ABSTRACT Markov Logic Networks (MLN) have become a powerful framework in logical and statistical modeling.

More information

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure Neelam Singh neelamjain.jain@gmail.com Neha Garg nehagarg.february@gmail.com Janmejay Pant geujay2010@gmail.com

More information

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016 CPSC 340: Machine Learning and Data Mining Non-Parametric Models Fall 2016 Admin Course add/drop deadline tomorrow. Assignment 1 is due Friday. Setup your CS undergrad account ASAP to use Handin: https://www.cs.ubc.ca/getacct

More information

Analysis of Dendrogram Tree for Identifying and Visualizing Trends in Multi-attribute Transactional Data

Analysis of Dendrogram Tree for Identifying and Visualizing Trends in Multi-attribute Transactional Data Analysis of Dendrogram Tree for Identifying and Visualizing Trends in Multi-attribute Transactional Data D.Radha Rani 1, A.Vini Bharati 2, P.Lakshmi Durga Madhuri 3, M.Phaneendra Babu 4, A.Sravani 5 Department

More information

VisoLink: A User-Centric Social Relationship Mining

VisoLink: A User-Centric Social Relationship Mining VisoLink: A User-Centric Social Relationship Mining Lisa Fan and Botang Li Department of Computer Science, University of Regina Regina, Saskatchewan S4S 0A2 Canada {fan, li269}@cs.uregina.ca Abstract.

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

A note on using the F-measure for evaluating data linkage algorithms

A note on using the F-measure for evaluating data linkage algorithms Noname manuscript No. (will be inserted by the editor) A note on using the for evaluating data linkage algorithms David Hand Peter Christen Received: date / Accepted: date Abstract Record linkage is the

More information

Change Detection in Remotely Sensed Images Based on Image Fusion and Fuzzy Clustering

Change Detection in Remotely Sensed Images Based on Image Fusion and Fuzzy Clustering International Journal of Electronics Engineering Research. ISSN 0975-6450 Volume 9, Number 1 (2017) pp. 141-150 Research India Publications http://www.ripublication.com Change Detection in Remotely Sensed

More information

Results of NBJLM for OAEI 2010

Results of NBJLM for OAEI 2010 Results of NBJLM for OAEI 2010 Song Wang 1,2, Gang Wang 1 and Xiaoguang Liu 1 1 College of Information Technical Science, Nankai University Nankai-Baidu Joint Lab, Weijin Road 94, Tianjin, China 2 Military

More information

LinkedMDB. The first linked data source dedicated to movies

LinkedMDB. The first linked data source dedicated to movies Oktie Hassanzadeh Mariano Consens University of Toronto April 20th, 2009 Madrid, Spain Presentation at the Linked Data On the Web (LDOW) 2009 Workshop LinkedMDB 2 The first linked data source dedicated

More information

Keywords: clustering algorithms, unsupervised learning, cluster validity

Keywords: clustering algorithms, unsupervised learning, cluster validity Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Clustering Based

More information

Available online at ScienceDirect. Procedia Computer Science 95 (2016 )

Available online at  ScienceDirect. Procedia Computer Science 95 (2016 ) Available online at www.sciencedirect.com ScienceDirect Procedia Computer Science 95 (2016 ) 327 334 Complex Adaptive Systems, Publication 6 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri

More information

Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach

Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach Peter Christen Ross Gayler 2 Department of Computer Science, The Australian National University, Canberra 2,

More information

Outline. Data Integration. Entity Matching/Identification. Duplicate Detection. More Resources. Duplicates Detection in Database Integration

Outline. Data Integration. Entity Matching/Identification. Duplicate Detection. More Resources. Duplicates Detection in Database Integration Outline Duplicates Detection in Database Integration Background HumMer Automatic Data Fusion System Duplicate Detection methods An efficient method using priority queue Approach based on Extended key Approach

More information

Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach

Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach Automatic Linguistic Indexing of Pictures by a Statistical Modeling Approach Abstract Automatic linguistic indexing of pictures is an important but highly challenging problem for researchers in content-based

More information

Quality and Complexity Measures for Data Linkage and Deduplication

Quality and Complexity Measures for Data Linkage and Deduplication Quality and Complexity Measures for Data Linkage and Deduplication Peter Christen and Karl Goiser Department of Computer Science, The Australian National University, Canberra ACT 0200, Australia {peter.christen,karl.goiser}@anu.edu.au

More information