Entity Resolution, Clustering Author References

Similar documents
Collective Entity Resolution in Relational Data

Craig Knoblock University of Southern California. These slides are based in part on slides from Sheila Tejada and Misha Bilenko

Behavioral Data Mining. Lecture 18 Clustering

CS 2750 Machine Learning. Lecture 19. Clustering. CS 2750 Machine Learning. Clustering. Groups together similar instances in the data sample

Data Linkage Methods: Overview of Computer Science Research

CS 1675 Introduction to Machine Learning Lecture 18. Clustering. Clustering. Groups together similar instances in the data sample

Machine Learning (BSMC-GA 4439) Wenke Liu

Large Scale 3D Reconstruction by Structure from Motion

Statistics 202: Data Mining. c Jonathan Taylor. Week 8 Based in part on slides from textbook, slides of Susan Holmes. December 2, / 1

Privacy Preserving Probabilistic Record Linkage

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

Extending Naive Bayes Classifier with Hierarchy Feature Level Information for Record Linkage. Yun Zhou. AMBN-2015, Yokohama, Japan 16/11/2015

k-means Clustering Todd W. Neller Gettysburg College Laura E. Brown Michigan Technological University

Introduction Entity Match Service. Step-by-Step Description

Clustering. Chapter 10 in Introduction to statistical learning

Object Identification with Attribute-Mediated Dependences

Identity Uncertainty and Citation Matching

Query-Time Entity Resolution

Machine Learning (BSMC-GA 4439) Wenke Liu

TABLE OF CONTENTS PAGE TITLE NO.

STATS306B STATS306B. Clustering. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

Cluster Analysis. Ying Shen, SSE, Tongji University

Semantic text features from small world graphs

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

selection of similarity functions for

Developing Focused Crawlers for Genre Specific Search Engines

Rule-Based Method for Entity Resolution Using Optimized Root Discovery (ORD)

Overview of Record Linkage Techniques

CLUSTERING. JELENA JOVANOVIĆ Web:

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

A Fast Linkage Detection Scheme for Multi-Source Information Integration

Query-time Entity Resolution

Foundations of Machine Learning CentraleSupélec Fall Clustering Chloé-Agathe Azencot

Text Categorization (I)

K-means clustering Based in part on slides from textbook, slides of Susan Holmes. December 2, Statistics 202: Data Mining.

Mean Field and Variational Methods finishing off

Introduction to Data Mining

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

vector space retrieval many slides courtesy James Amherst

A Comparison of Document Clustering Techniques

SOCIAL MEDIA MINING. Data Mining Essentials

Clustering. Supervised vs. Unsupervised Learning

Learning to Match and Cluster Large High-Dimensional Data Sets For Data Integration

TELCOM2125: Network Science and Analysis

Supervised vs. Unsupervised Learning

Plagiarism Detection Using FP-Growth Algorithm

Duplicate Record Detection: A Survey

Network Traffic Measurements and Analysis

Inferring Ongoing Activities of Workstation Users by Clustering

RELAIS: A Record Linkage Toolkit Training course on record linkage Monica Scannapieco Istat

Knowledge Graph Completion. Mayank Kejriwal (USC/ISI)

Information Integration of Partially Labeled Data

Knowledge Discovery and Data Mining 1 (VO) ( )

Understanding Clustering Supervising the unsupervised

Feature Selection for fmri Classification

Cluster analysis formalism, algorithms. Department of Cybernetics, Czech Technical University in Prague.

Package lga. R topics documented: February 20, 2015

Efficient Strategies for Improving Partitioning-Based Author Coreference by Incorporating Web Pages as Graph Nodes

Mean Field and Variational Methods finishing off

Machine Learning using MapReduce

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Unsupervised Duplicate Detection (UDD) Of Query Results from Multiple Web Databases. A Thesis Presented to

K-Means Clustering 3/3/17

A Survey on Removal of Duplicate Records in Database

Clustering (COSC 416) Nazli Goharian. Document Clustering.

A Combination Approach to Cluster Validation Based on Statistical Quantiles

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

Unsupervised Learning: Clustering

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Entity Linking. David Soares Batista. November 11, Disciplina de Recuperação de Informação, Instituto Superior Técnico

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

ECE521 Lecture 18 Graphical Models Hidden Markov Models

Record Linkage. BYU ScholarsArchive. Brigham Young University. Stasha Ann Bown Larsen Brigham Young University - Provo. All Theses and Dissertations

Criterion Functions for Document Clustering Experiments and Analysis

Data Linkage Techniques: Past, Present and Future

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Search Engines. Information Retrieval in Practice

Chapter 2. Related Work

Python With Data Science

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Unsupervised Learning

Hierarchical Clustering Lecture 9

Cluster Analysis for Microarray Data

CS371R: Final Exam Dec. 18, 2017

Stephen Scott.

DATA POSITION AND PROFILING IN DOMAIN-INDEPENDENT WAREHOUSE CLEANING

AUTHOR NAME DISAMBIGUATION AND CROSS SOURCE DOCUMENT COREFERENCE

Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Mining the Heterogeneous Transformations between Data Sources to Aid Record Linkage

Constraint-Based Entity Matching

A Joint Model for Discovering and Linking Entities

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

Unsupervised Learning

CS490W. Text Clustering. Luo Si. Department of Computer Science Purdue University

CS299 Detailed Plan. Shawn Tice. February 5, The high-level steps for classifying web pages in Yioop are as follows:

CS 231A Computer Vision (Fall 2012) Problem Set 3

Transcription:

, Clustering Author References Vlad Shchogolev vs299@columbia.edu May 1, 2007

Outline 1 What is? Motivation 2 Formal Definition Efficieny Considerations Measuring Text Similarity Other approaches 3 Clustering author references Learning distance function Clustering experiment

Set-Up What is? Motivation Given: a table of records, each record has a number of fields. Problem: decide which pairs of records refer to the same entity. Variation: given two tables, pair up matching records

What is? Motivation A new name for an old research area Record linkage Originally studied by Dunn, 1946 Formalized by Fellegi and Sunter, 1969 Merge/purge problem Data matching, object identity problem Coreference resolution, reference reconciliation, etc.

Why is this useful? What is? Motivation Historical research Medical research Government record-keeping Tracking terrorists Wal-Mart Yahoo Local, Google Local Bibliographic citations

Formal Definition Formal Definition Efficiency Considerations Measuring Text Similarity Other approaches Two datasets A and B, and let (a, b) A B. Also define M as the set of matched pairs (a = b) and U as the set of unmatched pairs. For a given pair, there is a list of comparison values for (c 1,... c n ) where c k is the comparison value for the k th field of the item.

Formal Definition Efficiency Considerations Measuring Text Similarity Other approaches Probabilistic Model, Fellegi and Sunter Define quantities m k = P{c k = 0 (a, b) M} and u k = P{c k = 0 (a, b) U} Assume independence assumption (!) Weight assigned to each component is: { log(m k /u k ) if c k = 0, w k = log(1 m k )/(1 u k )) if c k = 1. Equivalent to Naive Bayes

Canopies Formal Definition Efficiency Considerations Measuring Text Similarity Other approaches Need methods to find candidate pairs McCallum, Nigam, Ungar introduced canopies approach: clustering is performed in two stages, starting with a rough stage that divides data into overlapping subsets If clustering measures distance to cluster using cluster centroid, and canopy is larger than true cluster, nothing is lost.

Formal Definition Efficiency Considerations Measuring Text Similarity Other approaches Clustering bibliographic references Goal was to compute the citation graph for research papers First pass: used a fast TF-IDF approach Second pass: used an expensive string edit distance computation, combined with a HMM for field extraction Results in equally good accuracy, but orders of magnitude faster

Measuring Text Similarity Formal Definition Efficiency Considerations Measuring Text Similarity Other approaches Several methods exist to construct a similarity measure for text: edit distance (customizable costs) Jaro s algorithm (transpositions) character N-grams TF-IDF string kernels term-vector dot product soundex Research typically finds that no single method is best

Refinements Formal Definition Efficiency Considerations Measuring Text Similarity Other approaches Minton, Nanjo et al. introduced transformation graphs to handle higher level concepts: synonyms misspelling abbreviation acronym concatenation Again, a Naive Bayes approach is used to learn weights for transformations.

Domain-independent approach Formal Definition Efficiency Considerations Measuring Text Similarity Other approaches Monge and Elkan suggest a domain-independent approach Each record is treat as a single long string Similarity is measured using edit distance

Clustering author references Clustering authors Learning distance function Clustering experiment A related problem to bibliographic references Experiment run on a large biology research corpus Goal is to determine when two matching names (e.g. Smith J.) refer to the same person Extra structure: social network consisting of co-authorship edges

Learning distance function Clustering authors Learning distance function Clustering experiment Similar to distance metric learning paper, but not in Euclidean space Not domain-independent Domain knowledge used to pick from one of 3 comparison functions for each field: equality set intersection character N-gram similarity Most important feature: number of common co-authors

Clustering experiment Clustering authors Learning distance function Clustering experiment Clustering name references can be useful for judging co-authorship importance Tried experiments with simple Greedy Agglomerative Clustering. Measured within-cluster dispersion at each clustering step: Data is clustered into k clusters C 1,... C k Sum of pairwise distances: D r = i,j C r d i,j Dispersion measure: W k = k r=1 1 2n r D r

Gap Statistic Clustering authors Learning distance function Clustering experiment Can we estimate the true number of clusters? Define Gap Statistic (Tibshirani et al. 2000) Gap n (k) = En(log(W k )) log(w k ) Expectation is over a sample from reference distribution The quantity log(w k ) can be thought of as log-likelihood

Reference distribution Clustering authors Learning distance function Clustering experiment Gap n (k) = E n(log(w k )) log(w k ) For uniform distribution, expectation should decrease at the rate (2/p) log k. Better: a uniform distribution over a box align with the principal components. Goal is to produce evidence against the null model (single cluster)

Estimation Procedure Clustering authors Learning distance function Clustering experiment Sample B Monte Carlo reference datasets from reference distribution Cluster each sampled dataset Estimate the gap: Gap(k) = (1/B) b log(w kb ) log(w k) Pick smallest k such that Gap(k) > Gap(k + 1) s k+1

Estimation Procedure Clustering authors Learning distance function Clustering experiment Estimation performs poorly; reference distribution not appropriate? Euclidean distance metric not appropriate for high dimensions However: have evidence that W k graph has some signal

Clustering authors Learning distance function Clustering experiment Two within-cluster dispersion graphs

Rethink cluster estimation for this class of problems Distance metric learning works well, but... Should take advantage of additional structure