AUTOMATICALLY GENERATING DATA LINKAGES USING A DOMAIN-INDEPENDENT CANDIDATE SELECTION APPROACH

Similar documents
Automatically Generating Data Linkages Using a Domain-Independent Candidate Selection Approach

Learning Blocking Schemes for Record Linkage

Ontology Instance Linking: Towards Interlinked Knowledge Graphs

Towards a Linked Semantic Web: Precisely, Comprehensively and Scalably Linking Heterogeneous Data in the Semantic Web

arxiv: v1 [cs.ir] 27 Sep 2017

Active Blocking Scheme Learning for Entity Resolution

A Review on Unsupervised Record Deduplication in Data Warehouse Environment

Semantic annotation of unstructured and ungrammatical text

Knowledge Graph Completion. Mayank Kejriwal (USC/ISI)

Leveraging Unlabeled Data to Scale Blocking for Record Linkage

Entity and Knowledge Base-oriented Information Retrieval

Towards Scalable Real-Time Entity Resolution using a Similarity-Aware Inverted Index Approach

A fast approach for parallel deduplication on multicore processors

Comparison of Online Record Linkage Techniques

Exploiting Semantics Where We Find Them

SLINT: A Schema-Independent Linked Data Interlinking System

ISENS: A System for Information Integration, Exploration, and Querying of Multi-Ontology Data Sources

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

Automatic training example selection for scalable unsupervised record linkage

RELATIONAL OPERATORS #1

Craig Knoblock University of Southern California. These slides are based in part on slides from Sheila Tejada and Misha Bilenko

On Measuring the Lattice of Commonalities Among Several Linked Datasets

Extending Functional Dependency to Detect Abnormal Data in RDF Graphs

Information Integration of Partially Labeled Data

Ontology-based Integration and Refinement of Evaluation-Committee Data from Heterogeneous Data Sources

On the Effect of Geometries Simplification on Geo-spatial Link Discovery

Context based Re-ranking of Web Documents (CReWD)

Efficient Orienteering-Route Search over Uncertain Spatial Datasets

KARMA. Pedro Szekely and Craig A. Knoblock. University of Southern California, Information Sciences Institute

Active Query Sensing for Mobile Location Search

Privacy Preserving Probabilistic Record Linkage

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

Snehal Thakkar, Craig A. Knoblock, Jose Luis Ambite, Cyrus Shahabi

Automatically Constructing a Directory of Molecular Biology Databases

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS. Presented by: Ahmed Elbagoury

Improving the Efficacy of Approximate Searching by Personal-Name

SERIMI Results for OAEI 2011

Improving effectiveness in Large Scale Data by concentrating Deduplication

Data Warehousing (Special Indexing Techniques)

Multi-Stage Rocchio Classification for Large-scale Multilabeled

Craig Knoblock University of Southern California

Scaling Parallel Rule-based Reasoning

Cluster-based Instance Consolidation For Subsequent Matching

Performance and scalability of fast blocking techniques for deduplication and data linkage

Use of Synthetic Data in Testing Administrative Records Systems

A Fast Linkage Detection Scheme for Multi-Source Information Integration

Grid Resources Search Engine based on Ontology

Mining the Heterogeneous Transformations between Data Sources to Aid Record Linkage

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

A Machine Learning Approach for Instance Matching Based on Similarity Metrics

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task. Junjun Wang 2013/4/22

Privacy Preserving Data Publishing: From k-anonymity to Differential Privacy. Xiaokui Xiao Nanyang Technological University

The Semantic Web Revisited. Nigel Shadbolt Tim Berners-Lee Wendy Hall

Detecting Hidden Relations in Geographic Data

Dynamic Aggregation to Support Pattern Discovery: A case study with web logs

Similarity Joins of Text with Incomplete Information Formats

Keys and Pseudo-Keys Detection for Web Datasets Cleansing and Interlinking

Approximate String Joins

A Learning Method for Entity Matching

Efficient Entity Matching over Multiple Data Sources with MapReduce

Adaptive Temporal Entity Resolution on Dynamic Databases

Introduction to blocking techniques and traditional record linkage

Computational Cost of Querying for Related Entities in Different Ontologies

Handling instance coreferencing in the KnoFuss architecture

URI Disambiguation in the Context of Linked Data

Intuitive and Interactive Query Formulation to Improve the Usability of Query Systems for Heterogeneous Graphs

A Survey on Efficient Location Tracker Using Keyword Search

Evaluation of Query Generators for Entity Search Engines

Theme Identification in RDF Graphs

Database Group Research Overview. Immanuel Trummer

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Constraint-Based Entity Matching

A database can be modeled as: + a collection of entities, + a set of relationships among entities.

In This Lecture. Normalisation to BCNF. Lossless decomposition. Normalisation so Far. Relational algebra reminder: product

Multithreaded Architectures and The Sort Benchmark. Phil Garcia Hank Korth Dept. of Computer Science and Engineering Lehigh University

CiteSeer x : A Scholarly Big Dataset

PERFORMANCE OF RDF QUERY PROCESSING ON THE INTEL SCC

Name Disambiguation Using Web Connection

Efficient Top-k Shortest-Path Distance Queries on Large Networks by Pruned Landmark Labeling with Application to Network Structure Prediction

selection of similarity functions for

Quality-Driven Information Filtering in the Context of Web-Based Information Systems

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Incremental Export of Relational Database Contents into RDF Graphs

Effective Keyword Search over (Semi)-Structured Big Data Mehdi Kargar

AAAI 2018 Tutorial Building Knowledge Graphs. Craig Knoblock University of Southern California

X. Query Optimization

Keyword Search in RDF Databases

MTTTS17 Dimensionality Reduction and Visualization. Spring 2018 Jaakko Peltonen. Lecture 11: Neighbor Embedding Methods continued

SIMPLE AND EFFECTIVE METHOD FOR SELECTING QUASI-IDENTIFIER

Data Cleansing. LIU Jingyuan, Vislab WANG Yilei, Theoretical group

Effective Semantic Search over Huge RDF Data

Entity Resolution over Graphs

An Efficient Approach to Triple Search and Join of HDT Processing Using GPU

FREQUENT PATTERN MINING IN BIG DATA USING MAVEN PLUGIN. School of Computing, SASTRA University, Thanjavur , India

An Empirical Evaluation of RDF Graph Partitioning Techniques

Open Data Integration. Renée J. Miller

RADON2: A buffered-intersection Matrix Computing Approach To Accelerate Link Discovery Over Geo-Spatial RDF Knowledge Bases

Ranking Web Pages by Associating Keywords with Locations

Scalable Reduction of Large Datasets to Interesting Subsets

A Comparison of MapReduce Join Algorithms for RDF

Transcription:

AUTOMATICALLY GENERATING DATA LINKAGES USING A DOMAIN-INDEPENDENT CANDIDATE SELECTION APPROACH Dezhao Song and Jeff Heflin SWAT Lab Department of Computer Science and Engineering Lehigh University

11/10/2011 2 Outline Introduction Related Work Algorithms Evaluation Conclusion and Future work

11/10/2011 3 Introduction What is Data Linkage Generation Find instances that represent the same real world entity What is an instance Person, publication, geographical location name mentions Dezhao Song is a Ph.D. student at Lehigh University. He is in his fourth year at LU. Check Dezhao s website for more details. Database records Given name Last name Address Zip Email Affiliation Dezhao Song Packard Lab 18015 des308 LU D. Song 19 Memorial Dr. W. 18015 songdezhao Lehigh Ontology instances http://www.informatik.uni-trier.de/~ley/db/indices/a-tree/s/song:dezhao.html http://data.semanticweb.org/person/dezhao-song/html Entity Coreference owl:sameas

11/10/2011 4 The General/Naive Approach Given name A 1 A 2 A 3.. A i Last name Dezhao Song Packard Lab Dezhao S. 19 Memorial Dr. W. B 1 B 2 B 3.. B i Address Zip Email Affiliati on Blocking/ Candidate Selection 18015 des308 LU 18015 songde zhao Lehigh Comparing all instance pairs between datasets A and B Could be expensive on large datasets Solution: Reduce the search space by filtering out unlikely coreferent pairs Choose the context used for filtering domain-independently Maximally reduce pairs that are not coreferent Retain as many true matches as possible The filtering itself should be efficient The entire entity coreference process should scale well after applying filtering while maintaining good precision and recall

11/10/2011 5 Related Work Candidate Selection selecting a set of instance pairs Ed-Join. Xiao et. al. (VLDB2008.) All-Pairs. Bayardo et. al. (WWW2007.) ASN. Yan et. al. (JCDL 2007.) BSL. Michelson and Knoblock. (AAAI2006, JAIR2008.) Best five. W. E. Winkler. (Tech Report, U.S. Census Bureau 2005.) Adaptive filtering. Gu and Baxter. (SIAM2004.) Marlin. Bilenko and Mooney. (SIGKDD2003.)

11/10/2011 6 System Framework Blocking Properties Learner Instances Indexer Candidate Pair Selector Data Blocking Properties Inverted index Candidate pairs

11/10/2011 7 Learning Blocking Properties from RDF Graphs Example Given name Last name The criterion for choosing candidate selection properties Sufficiently discriminating Discriminability(p) = Reduce non-coreferent pairs distinct object values of a property triples of a property Covering a majority of the instances instances that have a value for a property Coverage(p)= instances Increase coverage on true matches F L =2*Discriminability*Coverage/(Discriminability+Coverage)

11/10/2011 8 Combining Properties A single property may not be sufficient for candidate selection. E.g., Last name + given name could do better than last name. Compute F L (P) for all properties Combine A with every other property; Concatenate property values; Update property set Max(F L (P)) > α OR Properties =1 N A=ARGMAX P F L (P) Y return ARGMAX P F L (P) A set of tuples: (instance, property, value), are formed.

11/10/2011 9 System Framework Blocking Properties Learner Instances Indexer Candidate Pair Selector Data Blocking Properties Inverted index Candidate pairs

11/10/2011 10 Indexing Instances Comparing all pairs of instances on the selected attributes can be expensive on datasets with millions of instances. Need a solution to do efficient look-up and only compare instances that at least have one token in common. Inverted Index issue disjunctive queries with tokenized attribute values RDF Index John #i John Smith James Smith full name full name #i #j Index Full name James #j Smith #i #j Pred n

11/10/2011 11 System Framework Blocking Properties Learner Instances Indexer Candidate Pair Selector Data Blocking Properties Inverted index Candidate pairs

11/10/2011 12 Selecting Candidate Instance Pairs For each tuple, the indexer returns a set of tuples whose values share at least one token on one predicate in the learned key. Still too many candidate pairs get returned. Need a second-level similarity measure to further reduce the size of the candidate set Further reduce non-coreferent pairs Need to be careful to maintain good recall

11/10/2011 13 Alternative Similarity Measures Given two tuples (a, prop1, value1) & (b, prop1, value2): Direct_Comp Directly compute the string similarity between their values. Token_Sim Check the percentage of the shared highly similar tokens. Character level bigram Check the percentage of the shared character level bigrams. The one we chose

11/10/2011 14 Evaluation Evaluation Metrics for candidate selection Pairwise Completeness (PC) = Reduction Ratio (RR) = instance sets respectively F CS =2*PC*RR/(PC+RR) Runtime of the candidate selection process Evaluation Metrics for entity coreference Precision = correctly detected pairs matches Candidate Set M * N True matches in candidate set groundtruth coreferent pairs, M and N are the size of two Recall = correctly detected pairs all detected pairs F1-score = 2*Precision*Recall/(Precision+Recall)

11/10/2011 15 Datasets Dataset Size Coreferent pairs Properties Type RKB Person RKB Publication 100,000 123,719 names, publications, affiliation, etc. 100,000 51,697 title, date, proceeding, Journal, etc. RDF RDF SWAT Person 100,000 9,482 names, papers, etc. RDF Hotel 1125*132 1,028 name, rating, area, price and date. Restaurant 331*533 112 name, address, type and city Census 10,000 5,000 names, zip, date of birth, phone, age, etc. Segmented posts Segmented posts CSV

11/10/2011 16 Learned Candidate Selection Key Dataset RKB Person RKB Publication SWAT Person Hotel Restaurant Census Learned Properties full-name, job, email, web-addr and phone title CiteSeer:name and FOAF:name name name date-of-birth, surname and address_1

11/10/2011 17 Evaluation on RDF Datasets Dataset System Pairs RR (%) PC (%) F cs (%) CS Time (s) Coref F1 (%) Total Time (s) RKB Person bigram 14,024 99.97 99.33 99.65 13.32 94.48 25.45 All-Pairs Bayardo et. al. WWW2007 EdJoin Xiao et. al. VLDB2008 Naïve Song and Heflin. CIKM2010 680,403 98.64 99.76 99.20 1.34 92.04 195.37 150,074 99.70 99.72 99.71 1.73 92.38 72.79 N/A N/A N/A N/A N/A 91.64 4,766 RKB Publication bigram 6,831 99.99 99.97 99.98 18.26 99.74 31.73 All-Pairs 1,527,656 96.64 97.95 97.44 3.93 98.59 877.80 EdJoin 2,579,333 94.84 98.57 96.67 409.08 99.04 1473.47 Naive N/A N/A N/A N/A N/A 99.55 34,567 SWAT Person bigram 7,129 99.99 98.72 99.35 13.46 95.07 21.21 All-Pairs 508,505 98.98 99.91 99.44 1.00 95.06 108.89 EdJoin 228,830 99.54 99.79 99.66 409.08 95.01 51.66 Naive N/A N/A N/A N/A N/A 95.02 12,140

11/10/2011 18 Dataset System Pairs RR (%) PC (%) F cs (%) Restaurant bigram 182 99.90 98.21 99.05 All-Pairs 1,967 98.89 99.11 99.00 EdJoin 6.715 96.19 96.43 96.31 BSL 1,306 99.26 98.16 98.71 ASN N/A N/A <96 <98 Marlin 78,773 55.35 100.00 71.26 Hotel bigram 4,142 97.21 94.26 95.71 All-Pairs 6.953 95.32 95.91 95.62 EdJoin 17,623 88.13 98.93 93.22 BSL 27,383 81.56 99.79 89.76 Census bigram 166,844 99.67 97.76 98.70 All-Pairs 5,231 99.99 100.00 99.99 EdJoin 11,010 99.98 99.50 99.74 BSL 939,906 98.12 99.85 98.98 Evaluation on Non-RDF Datasets Candidate Selection was applied to each of the three full datasets. We actually ran bigram, All-Pairs, EdJoin to measure their performance The number of selected pairs for Marlin, BSL and Best Five were estimated based upon their reported RR. We reference reported results for the rest. Best Five 239,976 99.52 99.16 99.34

11/10/2011 19 Scalability of Candidate Selection The purpose of a blocking algorithm is to help entity coreference algorithms to scale to large datasets. A blocking algorithm itself needs to scale well to such datasets. A single Sun workstation: 6GB memory and one eightcore Intel Xeon 2.93GHz processor. Dataset RKB Person RKB Publication Census Size 1 million 1 million 1 million SWAT Person 500,000

11/10/2011 20 Runtime of Candidate Selection

speedup factor 11/10/2011 21 Runtime Speedup of the Entire Entity Coreference Process Compare to previously developed entity coreference algorithm (Song and Heflin CIKM2010) Runtime Speedup Factor = runtime without candidate selection runtime by applying candidate selection 1600 1400 1200 1000 800 600 400 200 0 2000 4000 6000 8000 10000 12000 14000 16000 18000 20000 number of instances RKB Person RKB Publication SWAT Person

11/10/2011 24 Conclusion and Future Work Developed a domain-independent candidate selection scheme for scaling entity coreference processes on structured data. Achieved both high PC (> 98%) and RR (> 99%) on RDF data; The blocking algorithm itself scales to large datasets > 1 million The entire entity coreference process scales well 10 2-3 Significant improvements were achieved on the final results of the entire entity coreference process 1.84% on RKB Person Future work Applying to even larger datasets Applying to other entity coreference algorithms Exploring appropriate methods for matching numeric data