Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:

Similar documents
Practical maintenance of evolving metadata for digital preservation: algorithmic solution and system support

AUTHOR NAME DISAMBIGUATION AND CROSS SOURCE DOCUMENT COREFERENCE

System Support for Name Authority Control Problem in Digital Libraries: OpenDBLP Approach

The Pennsylvania State University The Graduate School DATA CLEANING TECHNIQUES BY MEANS OF ENTITY RESOLUTION

The Pennsylvania State University. The Graduate School EFFECTIVE SOLUTIONS FOR NAME LINKAGE AND THEIR APPLICATIONS. A Thesis in

Scalable Name Disambiguation using Multi-level Graph Partition

Similarity Joins in MapReduce

Improving Grouped-Entity Resolution using Quasi-Cliques

Scholarly Big Data: Leverage for Science

Search Engines. Information Retrieval in Practice

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

Determine the Entity Number in Hierarchical Clustering for Web Personal Name Disambiguation

CSE 158. Web Mining and Recommender Systems. Midterm recap

Collective Entity Resolution in Relational Data

TABLE OF CONTENTS PAGE TITLE NO.

Survey of String Similarity Join Algorithms on Large Scale Data

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function

Representation Learning using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval

Based on Raymond J. Mooney s slides

Visualization and text mining of patent and non-patent data

Automatic Identification of User Goals in Web Search [WWW 05]

NUS-I2R: Learning a Combined System for Entity Linking

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Clustering using Topic Models

Authorship Disambiguation and Alias Resolution in Data

Towards Breaking the Quality Curse. AWebQuerying Web-Querying Approach to Web People Search.

Name Disambiguation Using Web Connection

Locality-Sensitive Hashing

Robust and Efficient Fuzzy Match for Online Data Cleaning. Motivation. Methodology

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Link Prediction for Social Network

Database Applications (15-415)

Semantic Scholar. ICSTI Towards a More Efficient Review of Research Literature 11 September

Clustering & Classification (chapter 15)

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Data Mining in Bioinformatics Day 1: Classification

Naming Disambiguation Based on Approximate String Matching for Co- Authorship Networks

Classification. 1 o Semestre 2007/2008

Text Classification and Clustering Using Kernels for Structured Data

Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification

PANDA: A Platform for Academic Knowledge Discovery and Acquisition

PERSONALIZED MOBILE SEARCH ENGINE BASED ON MULTIPLE PREFERENCE, USER PROFILE AND ANDROID PLATFORM

Entity Resolution, Clustering Author References

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System

ECG782: Multidimensional Digital Signal Processing

Developing Focused Crawlers for Genre Specific Search Engines

Scopus. Information literacy in Chemistry. J une 14, 2011

Spam Classification Documentation

slide courtesy of D. Yarowsky Splitting Words a.k.a. Word Sense Disambiguation Intro to NLP - J. Eisner 1

Finding Topic-centric Identified Experts based on Full Text Analysis

Contents. Preface to the Second Edition

SPARK: Top-k Keyword Query in Relational Database

Introduction to Automated Text Analysis. bit.ly/poir599

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

SUPERVISED TERM WEIGHTING METHODS FOR URL CLASSIFICATION

CS229 Final Project: Predicting Expected Response Times

Basic Problem Addressed. The Approach I: Training. Main Idea. The Approach II: Testing. Why a set of vocabularies?

CSE 5243 INTRO. TO DATA MINING

Database Applications (15-415)

Data Linkage Methods: Overview of Computer Science Research

Joint Entity Resolution

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

Approximate String Joins

Support Vector Machines + Classification for IR

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

SUPPORT VECTOR MACHINES

Jure Leskovec, Cornell/Stanford University. Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research

Performance Improvement of Hardware-Based Packet Classification Algorithm

Classification of Tweets using Supervised and Semisupervised Learning

Lecture 10: Support Vector Machines and their Applications

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts.

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

A Feature Selection Method to Handle Imbalanced Data in Text Classification

ART 알고리즘특강자료 ( 응용 01)

A hybrid method to categorize HTML documents

Support Vector Machines

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Text Categorization (I)

Column Stores vs. Row Stores How Different Are They Really?

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Table Of Contents: xix Foreword to Second Edition

Data Structures and Algorithms

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

Templates. for scalable data analysis. 2 Synchronous Templates. Amr Ahmed, Alexander J Smola, Markus Weimer. Yahoo! Research & UC Berkeley & ANU

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

Support vector machine (II): non-linear SVM. LING 572 Fei Xia

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems

String Vector based KNN for Text Categorization

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task. Junjun Wang 2013/4/22

The FreeSearch System

Correlation Based Feature Selection with Irrelevant Feature Removal

A Learning Method for Entity Matching

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

Transcription:

Outline Effective and Scalable Solutions for Mixed and Split Citation Problems in Digital Libraries Dongwon Lee, Byung-Won On Penn State University, USA Jaewoo Kang North Carolina State University, USA Motivation Mixed Citation (MC) Problem Split Citation (SC) Problem Problem Definition Our approach Preliminary Experimentation Summary Sanghyun Park Yonsei University, KOREA IQIS 005 Motivation Eg. : DBLP Digital Libraries (DL) often have many errors that negatively affect: Quality of DL Query results User experiences Bibliometric research We present specific problems that often occur in scientific literature DL IQIS 005 3 IQIS 005 4 Eg. : DBLP Different authors citations are mixed under the same name heading Mixed Citation (MC) Problem Eg. : ACM DL Portal Jeffrey D. Ullman @ Stanford Univ. Same authors citations are split into various name variants Split Citation (SC) Problem IQIS 005 5 IQIS 005 6

Eg. 3: CiteSeer & Google Scholar Redundant citations with different formats co-exist in DLs Aftermath DO: how to automatically identify mixed and split citation cases DON T: how to handle the identified cases Aftermath: eg) DL system becomes aware of the name variants so that it can do: Change users query proactively User => give me all citation of Jeffrey D. Ullman DL => would you like to see the citations of J. D. Ullman too? Or, simply consolidate two citations in the storage, and update index properly IQIS 005 7 IQIS 005 8. Mixed Citation Problem Given a collection of citations (C) by an author (ai), can we identify false citations by another author (aj), when ai and aj have the identical name spellings (i.e., homonym)? Solution: Citation Labeling Algorithm Idea: for each citation in the collection, test if the citation really belongs to the given collection Citation Labeling Algorithm Idea: for each citation in the collection, test if the citation really belongs to the given collection Remove an author ai from the citation ci Guess back the removed author name using associated information If the guessed name <> removed name, then the citation ci is a false citation IQIS 005 9 IQIS 005 0 Citation Labeling Algorithm Citation Labeling Algorithm J. Propp, Daniel Ullman Name Disambiguation SIG 3 J. D. Ullman X Y Z Liwei Wang X` Y` Z` Daniel Ullman X`` Y`` Z`` Similarity: Measure the similarity between a citation C and an author A N Labeling Daniel Ullman J. D. Ullman Daniel Ullman J. D. Ullman Gravano (003) s sampling-based join approximation false citation X: Token vectors of coauthors from all the citations of author Y: Z: Token vectors of paper titles from all the citations of author Token vectors of venues from all the citations of author cc: token vectors of coauthors of the citation c ac: token vectors of coauthors from all citations of the author a IQIS 005 IQIS 005

Configuration MC: Scalability (EconPapers) DBLP (real examples) Chen Li, Dongwon Lee, Prasenjit Mitra, Wei Liu, and Wei Wang EconPapers Inject artificial false citations into each author s citation collection Types of token vectors Set Bag Evaluation metrics Time (Scalability) Percentage/rank ratio (Accuracy) Measure how much percentage of false citations are ranked in the bottom 0%, 0%, etc. IQIS 005 3 IQIS 005 4 MC: Accuracy (EconPapers). Split Citation (SC) Problem Given two lists of author names, X and Y, for each author name x ( X), find a set of author names, y, y,, yn ( Y) such that both x and yi ( i n) are variants IQIS 005 5 IQIS 005 6. Split Citation (SC) Problem tuple Given two lists of author names, X and Y, for each author name x ( X), find a set of author names, y, y,, yn ( Y) such that both x and yi ( i n) are variants = Database Join Problem. Split Citation (SC) Problem record Given two lists of author names, X and Y, for each author name x ( X), find a set of author names, y, y,, yn ( Y) such that both x and yi ( i n) are variants = Record Linkage Problem IQIS 005 7 IQIS 005 8 3

Naïve Solution Challenge : Scalability For each author name x in X For each author name y in Y If x ~ y, name variant! dist(x,y) < t DB: Nested Loop Join RL: Pair-wise Record Match O( X Y ) O( X Y ) is too costly Solutions DB: Hashed Join RL: Blocking For each name x in X Assign x to block b ( B) For each name y in Y Assign y to block b ( B) For each block b ( B) Do naïve-solution DL ISI/SCI CAS Medline/ PubMed CiteSeer arxiv SPIRED HEP DBLP CSB Domain General Sciences Chemistry Life Science General Sciences/ Engineering Physics/Math Physics CompSci CompSci Size (in M) 5 3 0 0.3 0.5.4 O( X + Y + B a) << O( X Y ) IQIS 005 9 IQIS 005 0 Challenge : Distance Name Disambiguation Algorithm Diverse name variations Jeffrey D. Ullman J. Ullman Alon Y. Levy Halevy, A. W. Wang X. Wang Sean Engelson Shlomo Argamon Solution Look at additional information of the author names Eg, Coauthor list Keywords used in title Venues to submit Year Affiliation dist(x,y) ~ W i *dist(c(x),c(y)) + W j *dist(t(x),t(y)) + W k *dist(v(x),v(y)) : Jeffrey Ullman m: Wei Wang Wei Wang s Block Measuring Distance Jeffrey Ullman s Block Measuring Distance 0550: W. Wang 50466: Jeffrey D. Ullman 35455: Liwei Wang n: J. D. Ullman Wei Wang: Rank ID Name -------------------- 50466 Jeffrey D. Ullman n J. D. Ullman IQIS 005 IQIS 005 Step : Blocking Step : Measuring Distance Many blocking methods can be applied Sorted Window Token-based N-gram We applied Gravano (003) s samplingbased join approximation algorithm as a blocking method Comparison with other blocking methods => JCDL 005 Naïve Bayes Model Use Bayes Theorem to measure similarity between two names Support Vector Machine Use SVM Classifiers String-based Distance Metrics TFIDF/Jaccard (Token-based) Jaro/JaroWinkler (Edit distances) Vector-based Cosine Distance Cosine Similarity Supervised Un-supervised IQIS 005 3 IQIS 005 4 4

Policy Variations Data sets Blocking Measuring Distance IQIS 005 5 IQIS 005 6 Configuration (eg, DBLP case) SC: Accuracy (DBLP) Authors, x, in X and authors, y, in Y Prepare an artificial name variant x for K randomlychosen x (eg, K=00): Abbreviation of the first name (85%): Ji-Woo K. Li J. K. Li Typo (5%): Ji-Woo K. Li Ji-Woo K. Lee x carries half of x s original citations x carries the other half Inject all x into Y Varying error types gave consistent results. For instance, Test: for each author x in X, find the corresponding name variants x in Y Evaluation metrics Time Accuracy Name Abbreviation: 30% Name Alternation: 30% First Name Misspelling: % Last Name Misspelling: % Contraction: % Middle Name Initial Omission: 4% Combination: 0% Accuracy 0.9 0.8 0.7 0.5 0.4 0.3 0. TFIDF Jaccard Jaro -N -NN -NC -NH Method IQIS 005 7 IQIS 005 8 SC: Accuracy (All data sets) Related Work Accuracy 0.9 0.8 0.7 DBLP e-print BioMed EconPapers Identity / Entity Matching Database Join Record Linkage Merge / Purge Ontology Matching Graph Matching Name Authority Control Problem in LIS NBM SVM Cosine TFIDF Jaccard Jaro JaroWin Please see the paper for details IQIS 005 9 IQIS 005 30 5

Future Work Using additional information of author name Essentially, token comparison Better way: coauthor information as a Graph Graph matching / partitioning Sub-graph detection Conclusion SC Problem Using additional information (eg, coauthor) than name itself is better in distance measure -NC/-NH outperform -N/-NN SVM or Cosine shows the best accuracy (90-93%) IQIS 005 3 IQIS 005 3 Backup Slides Token vectors of coauthors of a citation Cj Token vectors of coauthors from all citations of D. Ullman R Pemantle, D Ullman D Ullman, D Maier, F Sadri, J Hopcroft ID Token Weight R 0.7 Pemantle 0.7 D 0.7 Ullman 0.7 ID Token Weight D 0.45 Ullman 0.89 D 0.45 Maier 0.89 3 F 0.7 Rid Sid Weight 0.95 0.3 Rank Citation ID Similarity 3 Sadri of coauthor string 0.7 Ci 4 J 0.7 4 Hopcroft 0.7 N Cj IQIS 005 33 IQIS 005 34 Distance Metric NBM Use Bayes Theorem to measure the similarity between two author names. To calculate the similarity between Byung-Won On and On, B.-W. Training: estimate the probability per coauthor of Byung-Won On in terms of the Bayes rule. Testing: calculate the posterior probability of On, B.-W. with the coauthors probability values of Byung-Won On. Distance Metric SVM Preprocessing All coauthor info of an author are transformed into vectorspace representation Training Given training examples of author names labeled either YES ( J. Ullman & Jeffrey D. Ullman ) or NO ( J. Ullman & James Ullmann ), SVM creates a maximum-margin hyperplane that splits YES/NO training examples. Testing SVM classifies vectors by mapping them via kernel trick to a high dimensional space (two classes of equivalent pairs and different ones are separated by a hyperplane). Kernel Use Radial Basis Function (RBF kernel) IQIS 005 35 IQIS 005 36 6

Distance Metric String-based Distance Metrics Distance Metric Vector-based Cosine Distance IQIS 005 37 IQIS 005 38 Accuracy w. error types [DBLP] SC: Scalability (DBLP with k=) Accuracy 0.9 0.8 0.7 0.5 iffl Token 4-gram Time (sec) 500 000 500 000 TFIDF Jaccard Jaro 0.4 Name Abbreviation: 30% NBM SVM Cosine TFIDF Jaccard Jaro JaroWin Name Alternation: 30% First Name Misspelling: % Last Name Misspelling: % Contraction: % Middle Name Initial Omission: 4% Combination: 0% 500 0 -N -NN -NC -NH Method IQIS 005 39 IQIS 005 40 Distribution in top-0 Step : Speed [DBLP] % 00 90 80 70 60 50 40 30 0 0 0 TFIDF Jaccard Jaro JaroWin Cosine 3 4 5 6 7 8 9 0 Rank Tim e (sec; Logarithm ic scale) 0000 000 00 0 Token 4- gram iffl NBM SVM VSM TFIDF Jaccard Jaro JaroWin IQIS 005 4 IQIS 005 4 7

Overall: Accuracy [DBLP] MC: Accuracy (DBLP) Accuracy (%) 0.9 0.8 0.7 0.5 iffl Token 4-gram NBM SVM Cosine TFIDF Jaccard Jaro JaroWin IQIS 005 43 IQIS 005 44 Naïve Solution MC problem can be naturally cast to K- Clustering Problem Cluster N data points into K clusters Issue: typical citation collections of an author in DBLP is short (ie, author-citation graph exhibits scale-free network) N is too small to apply clustering Name Disambiguation Algorithm Borrow solutions from RL community Step : Significantly reduce author names via blocking Step : Apply more expensive distance measures within each block IQIS 005 45 IQIS 005 46 SC: Processing time for Step 600 500 400 Time (sec) 300 DBLP e-print BioMed EconPapers 00 00 0 NBM SVM Cosine TFIDF Jaccard Jaro JaroWin IQIS 005 47 8