Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:

Size: px

Start display at page:

Download "Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:"

Julia Rice
6 years ago
Views:

Outline Effective and Scalable Solutions for Mixed and Split Citation Problems in Digital Libraries Dongwon Lee,

Citation (MC) Problem Split Citation (SC) Problem Problem Definition Our approach Preliminary Experimentation

: DBLP Digital Libraries (DL) often have many errors that negatively affect: Quality of DL Query results User

1 Outline Effective and Scalable Solutions for Mixed and Split Citation Problems in Digital Libraries Dongwon Lee, Byung-Won On Penn State University, USA Jaewoo Kang North Carolina State University, USA Motivation Mixed Citation (MC) Problem Split Citation (SC) Problem Problem Definition Our approach Preliminary Experimentation Summary Sanghyun Park Yonsei University, KOREA IQIS 005 Motivation Eg. : DBLP Digital Libraries (DL) often have many errors that negatively affect: Quality of DL Query results User experiences Bibliometric research We present specific problems that often occur in scientific literature DL IQIS IQIS Eg. : DBLP Different authors citations are mixed under the same name heading Mixed Citation (MC) Problem Eg. : ACM DL Portal Jeffrey D. Stanford Univ. Same authors citations are split into various name variants Split Citation (SC) Problem IQIS IQIS 005 6

Eg. 3: CiteSeer & Google Scholar Redundant citations with different formats co-exist in DLs Aftermath DO: how to automatically identify mixed and split citation cases DON T: how to handle the

Ullman DL => would you like to see the citations of J. D. Ullman too? Or, simply consolidate two citations in the storage, and update index properly IQIS 005 7 IQIS 005 8.

Solution: Citation Labeling Algorithm Idea: for each citation in the collection, test if the citation really belongs to the given collection Citation Labeling Algorithm Idea: for each citation in the

2 Eg. 3: CiteSeer & Google Scholar Redundant citations with different formats co-exist in DLs Aftermath DO: how to automatically identify mixed and split citation cases DON T: how to handle the identified cases Aftermath: eg) DL system becomes aware of the name variants so that it can do: Change users query proactively User => give me all citation of Jeffrey D. Ullman DL => would you like to see the citations of J. D. Ullman too? Or, simply consolidate two citations in the storage, and update index properly IQIS IQIS Mixed Citation Problem Given a collection of citations (C) by an author (ai), can we identify false citations by another author (aj), when ai and aj have the identical name spellings (i.e., homonym)? Solution: Citation Labeling Algorithm Idea: for each citation in the collection, test if the citation really belongs to the given collection Citation Labeling Algorithm Idea: for each citation in the collection, test if the citation really belongs to the given collection Remove an author ai from the citation ci Guess back the removed author name using associated information If the guessed name <> removed name, then the citation ci is a false citation IQIS IQIS Citation Labeling Algorithm Citation Labeling Algorithm J. Propp, Daniel Ullman Name Disambiguation SIG 3 J. D. Ullman X Y Z Liwei Wang X` Y` Z` Daniel Ullman X`` Y`` Z`` Similarity: Measure the similarity between a citation C and an author A N Labeling Daniel Ullman J. D. Ullman Daniel Ullman J. D. Ullman Gravano (003) s sampling-based join approximation false citation X: Token vectors of coauthors from all the citations of author Y: Z: Token vectors of paper titles from all the citations of author Token vectors of venues from all the citations of author cc: token vectors of coauthors of the citation c ac: token vectors of coauthors from all citations of the author a IQIS 005 IQIS 005

3 Configuration MC: Scalability (EconPapers) DBLP (real examples) Chen Li, Dongwon Lee, Prasenjit Mitra, Wei Liu, and Wei Wang EconPapers Inject artificial false citations into each author s citation collection Types of token vectors Set Bag Evaluation metrics Time (Scalability) Percentage/rank ratio (Accuracy) Measure how much percentage of false citations are ranked in the bottom 0%, 0%, etc. IQIS IQIS MC: Accuracy (EconPapers). Split Citation (SC) Problem Given two lists of author names, X and Y, for each author name x ( X), find a set of author names, y, y,, yn ( Y) such that both x and yi ( i n) are variants IQIS IQIS Split Citation (SC) Problem tuple Given two lists of author names, X and Y, for each author name x ( X), find a set of author names, y, y,, yn ( Y) such that both x and yi ( i n) are variants = Database Join Problem. Split Citation (SC) Problem record Given two lists of author names, X and Y, for each author name x ( X), find a set of author names, y, y,, yn ( Y) such that both x and yi ( i n) are variants = Record Linkage Problem IQIS IQIS

4 Naïve Solution Challenge : Scalability For each author name x in X For each author name y in Y If x ~ y, name variant! dist(x,y) < t DB: Nested Loop Join RL: Pair-wise Record Match O( X Y ) O( X Y ) is too costly Solutions DB: Hashed Join RL: Blocking For each name x in X Assign x to block b ( B) For each name y in Y Assign y to block b ( B) For each block b ( B) Do naïve-solution DL ISI/SCI CAS Medline/ PubMed CiteSeer arxiv SPIRED HEP DBLP CSB Domain General Sciences Chemistry Life Science General Sciences/ Engineering Physics/Math Physics CompSci CompSci Size (in M) O( X + Y + B a) << O( X Y ) IQIS IQIS Challenge : Distance Name Disambiguation Algorithm Diverse name variations Jeffrey D. Ullman J. Ullman Alon Y. Levy Halevy, A. W. Wang X. Wang Sean Engelson Shlomo Argamon Solution Look at additional information of the author names Eg, Coauthor list Keywords used in title Venues to submit Year Affiliation dist(x,y) ~ W i *dist(c(x),c(y)) + W j *dist(t(x),t(y)) + W k *dist(v(x),v(y)) : Jeffrey Ullman m: Wei Wang Wei Wang s Block Measuring Distance Jeffrey Ullman s Block Measuring Distance 0550: W. Wang 50466: Jeffrey D. Ullman 35455: Liwei Wang n: J. D. Ullman Wei Wang: Rank ID Name Jeffrey D. Ullman n J. D. Ullman IQIS 005 IQIS 005 Step : Blocking Step : Measuring Distance Many blocking methods can be applied Sorted Window Token-based N-gram We applied Gravano (003) s samplingbased join approximation algorithm as a blocking method Comparison with other blocking methods => JCDL 005 Naïve Bayes Model Use Bayes Theorem to measure similarity between two names Support Vector Machine Use SVM Classifiers String-based Distance Metrics TFIDF/Jaccard (Token-based) Jaro/JaroWinkler (Edit distances) Vector-based Cosine Distance Cosine Similarity Supervised Un-supervised IQIS IQIS

5 Policy Variations Data sets Blocking Measuring Distance IQIS IQIS Configuration (eg, DBLP case) SC: Accuracy (DBLP) Authors, x, in X and authors, y, in Y Prepare an artificial name variant x for K randomlychosen x (eg, K=00): Abbreviation of the first name (85%): Ji-Woo K. Li J. K. Li Typo (5%): Ji-Woo K. Li Ji-Woo K. Lee x carries half of x s original citations x carries the other half Inject all x into Y Varying error types gave consistent results. For instance, Test: for each author x in X, find the corresponding name variants x in Y Evaluation metrics Time Accuracy Name Abbreviation: 30% Name Alternation: 30% First Name Misspelling: % Last Name Misspelling: % Contraction: % Middle Name Initial Omission: 4% Combination: 0% Accuracy TFIDF Jaccard Jaro -N -NN -NC -NH Method IQIS IQIS SC: Accuracy (All data sets) Related Work Accuracy DBLP e-print BioMed EconPapers Identity / Entity Matching Database Join Record Linkage Merge / Purge Ontology Matching Graph Matching Name Authority Control Problem in LIS NBM SVM Cosine TFIDF Jaccard Jaro JaroWin Please see the paper for details IQIS IQIS

6 Future Work Using additional information of author name Essentially, token comparison Better way: coauthor information as a Graph Graph matching / partitioning Sub-graph detection Conclusion SC Problem Using additional information (eg, coauthor) than name itself is better in distance measure -NC/-NH outperform -N/-NN SVM or Cosine shows the best accuracy (90-93%) IQIS IQIS Backup Slides Token vectors of coauthors of a citation Cj Token vectors of coauthors from all citations of D. Ullman R Pemantle, D Ullman D Ullman, D Maier, F Sadri, J Hopcroft ID Token Weight R 0.7 Pemantle 0.7 D 0.7 Ullman 0.7 ID Token Weight D 0.45 Ullman 0.89 D 0.45 Maier F 0.7 Rid Sid Weight Rank Citation ID Similarity 3 Sadri of coauthor string 0.7 Ci 4 J Hopcroft 0.7 N Cj IQIS IQIS Distance Metric NBM Use Bayes Theorem to measure the similarity between two author names. To calculate the similarity between Byung-Won On and On, B.-W. Training: estimate the probability per coauthor of Byung-Won On in terms of the Bayes rule. Testing: calculate the posterior probability of On, B.-W. with the coauthors probability values of Byung-Won On. Distance Metric SVM Preprocessing All coauthor info of an author are transformed into vectorspace representation Training Given training examples of author names labeled either YES ( J. Ullman & Jeffrey D. Ullman ) or NO ( J. Ullman & James Ullmann ), SVM creates a maximum-margin hyperplane that splits YES/NO training examples. Testing SVM classifies vectors by mapping them via kernel trick to a high dimensional space (two classes of equivalent pairs and different ones are separated by a hyperplane). Kernel Use Radial Basis Function (RBF kernel) IQIS IQIS

7 Distance Metric String-based Distance Metrics Distance Metric Vector-based Cosine Distance IQIS IQIS Accuracy w. error types [DBLP] SC: Scalability (DBLP with k=) Accuracy iffl Token 4-gram Time (sec) TFIDF Jaccard Jaro 0.4 Name Abbreviation: 30% NBM SVM Cosine TFIDF Jaccard Jaro JaroWin Name Alternation: 30% First Name Misspelling: % Last Name Misspelling: % Contraction: % Middle Name Initial Omission: 4% Combination: 0% N -NN -NC -NH Method IQIS IQIS Distribution in top-0 Step : Speed [DBLP] % TFIDF Jaccard Jaro JaroWin Cosine Rank Tim e (sec; Logarithm ic scale) Token 4- gram iffl NBM SVM VSM TFIDF Jaccard Jaro JaroWin IQIS IQIS

8 Overall: Accuracy [DBLP] MC: Accuracy (DBLP) Accuracy (%) iffl Token 4-gram NBM SVM Cosine TFIDF Jaccard Jaro JaroWin IQIS IQIS Naïve Solution MC problem can be naturally cast to K- Clustering Problem Cluster N data points into K clusters Issue: typical citation collections of an author in DBLP is short (ie, author-citation graph exhibits scale-free network) N is too small to apply clustering Name Disambiguation Algorithm Borrow solutions from RL community Step : Significantly reduce author names via blocking Step : Apply more expensive distance measures within each block IQIS IQIS SC: Processing time for Step Time (sec) 300 DBLP e-print BioMed EconPapers NBM SVM Cosine TFIDF Jaccard Jaro JaroWin IQIS

Practical maintenance of evolving metadata for digital preservation: algorithmic solution and system support

Int J Digit Libr (2007) 6:313 326 DOI 10.1007/s00799-007-0014-9 REGULAR PAPER Practical maintenance of evolving metadata for digital preservation: algorithmic solution and system support Dongwon Lee Published