Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:

Outline Effective and Scalable Solutions for Mixed and Split Citation Problems in Digital Libraries Dongwon Lee, Byung-Won On Penn State University, USA Jaewoo Kang North Carolina State University, USA Motivation Mixed Citation (MC) Problem Split Citation (SC) Problem Problem Definition Our approach Preliminary Experimentation Summary Sanghyun Park Yonsei University, KOREA IQIS 005 Motivation Eg. : DBLP Digital Libraries (DL) often have many errors that negatively affect: Quality of DL Query results User experiences Bibliometric research We present specific problems that often occur in scientific literature DL IQIS 005 3 IQIS 005 4 Eg. : DBLP Different authors citations are mixed under the same name heading Mixed Citation (MC) Problem Eg. : ACM DL Portal Jeffrey D. Ullman @ Stanford Univ. Same authors citations are split into various name variants Split Citation (SC) Problem IQIS 005 5 IQIS 005 6

Eg. 3: CiteSeer & Google Scholar Redundant citations with different formats co-exist in DLs Aftermath DO: how to automatically identify mixed and split citation cases DON T: how to handle the identified cases Aftermath: eg) DL system becomes aware of the name variants so that it can do: Change users query proactively User => give me all citation of Jeffrey D. Ullman DL => would you like to see the citations of J. D. Ullman too? Or, simply consolidate two citations in the storage, and update index properly IQIS 005 7 IQIS 005 8. Mixed Citation Problem Given a collection of citations (C) by an author (ai), can we identify false citations by another author (aj), when ai and aj have the identical name spellings (i.e., homonym)? Solution: Citation Labeling Algorithm Idea: for each citation in the collection, test if the citation really belongs to the given collection Citation Labeling Algorithm Idea: for each citation in the collection, test if the citation really belongs to the given collection Remove an author ai from the citation ci Guess back the removed author name using associated information If the guessed name <> removed name, then the citation ci is a false citation IQIS 005 9 IQIS 005 0 Citation Labeling Algorithm Citation Labeling Algorithm J. Propp, Daniel Ullman Name Disambiguation SIG 3 J. D. Ullman X Y Z Liwei Wang X` Y` Z` Daniel Ullman X`` Y`` Z`` Similarity: Measure the similarity between a citation C and an author A N Labeling Daniel Ullman J. D. Ullman Daniel Ullman J. D. Ullman Gravano (003) s sampling-based join approximation false citation X: Token vectors of coauthors from all the citations of author Y: Z: Token vectors of paper titles from all the citations of author Token vectors of venues from all the citations of author cc: token vectors of coauthors of the citation c ac: token vectors of coauthors from all citations of the author a IQIS 005 IQIS 005

Configuration MC: Scalability (EconPapers) DBLP (real examples) Chen Li, Dongwon Lee, Prasenjit Mitra, Wei Liu, and Wei Wang EconPapers Inject artificial false citations into each author s citation collection Types of token vectors Set Bag Evaluation metrics Time (Scalability) Percentage/rank ratio (Accuracy) Measure how much percentage of false citations are ranked in the bottom 0%, 0%, etc. IQIS 005 3 IQIS 005 4 MC: Accuracy (EconPapers). Split Citation (SC) Problem Given two lists of author names, X and Y, for each author name x ( X), find a set of author names, y, y,, yn ( Y) such that both x and yi ( i n) are variants IQIS 005 5 IQIS 005 6. Split Citation (SC) Problem tuple Given two lists of author names, X and Y, for each author name x ( X), find a set of author names, y, y,, yn ( Y) such that both x and yi ( i n) are variants = Database Join Problem. Split Citation (SC) Problem record Given two lists of author names, X and Y, for each author name x ( X), find a set of author names, y, y,, yn ( Y) such that both x and yi ( i n) are variants = Record Linkage Problem IQIS 005 7 IQIS 005 8 3

Naïve Solution Challenge : Scalability For each author name x in X For each author name y in Y If x ~ y, name variant! dist(x,y) < t DB: Nested Loop Join RL: Pair-wise Record Match O( X Y ) O( X Y ) is too costly Solutions DB: Hashed Join RL: Blocking For each name x in X Assign x to block b ( B) For each name y in Y Assign y to block b ( B) For each block b ( B) Do naïve-solution DL ISI/SCI CAS Medline/ PubMed CiteSeer arxiv SPIRED HEP DBLP CSB Domain General Sciences Chemistry Life Science General Sciences/ Engineering Physics/Math Physics CompSci CompSci Size (in M) 5 3 0 0.3 0.5.4 O( X + Y + B a) << O( X Y ) IQIS 005 9 IQIS 005 0 Challenge : Distance Name Disambiguation Algorithm Diverse name variations Jeffrey D. Ullman J. Ullman Alon Y. Levy Halevy, A. W. Wang X. Wang Sean Engelson Shlomo Argamon Solution Look at additional information of the author names Eg, Coauthor list Keywords used in title Venues to submit Year Affiliation dist(x,y) ~ W i *dist(c(x),c(y)) + W j *dist(t(x),t(y)) + W k *dist(v(x),v(y)) : Jeffrey Ullman m: Wei Wang Wei Wang s Block Measuring Distance Jeffrey Ullman s Block Measuring Distance 0550: W. Wang 50466: Jeffrey D. Ullman 35455: Liwei Wang n: J. D. Ullman Wei Wang: Rank ID Name -------------------- 50466 Jeffrey D. Ullman n J. D. Ullman IQIS 005 IQIS 005 Step : Blocking Step : Measuring Distance Many blocking methods can be applied Sorted Window Token-based N-gram We applied Gravano (003) s samplingbased join approximation algorithm as a blocking method Comparison with other blocking methods => JCDL 005 Naïve Bayes Model Use Bayes Theorem to measure similarity between two names Support Vector Machine Use SVM Classifiers String-based Distance Metrics TFIDF/Jaccard (Token-based) Jaro/JaroWinkler (Edit distances) Vector-based Cosine Distance Cosine Similarity Supervised Un-supervised IQIS 005 3 IQIS 005 4 4

Policy Variations Data sets Blocking Measuring Distance IQIS 005 5 IQIS 005 6 Configuration (eg, DBLP case) SC: Accuracy (DBLP) Authors, x, in X and authors, y, in Y Prepare an artificial name variant x for K randomlychosen x (eg, K=00): Abbreviation of the first name (85%): Ji-Woo K. Li J. K. Li Typo (5%): Ji-Woo K. Li Ji-Woo K. Lee x carries half of x s original citations x carries the other half Inject all x into Y Varying error types gave consistent results. For instance, Test: for each author x in X, find the corresponding name variants x in Y Evaluation metrics Time Accuracy Name Abbreviation: 30% Name Alternation: 30% First Name Misspelling: % Last Name Misspelling: % Contraction: % Middle Name Initial Omission: 4% Combination: 0% Accuracy 0.9 0.8 0.7 0.5 0.4 0.3 0. TFIDF Jaccard Jaro -N -NN -NC -NH Method IQIS 005 7 IQIS 005 8 SC: Accuracy (All data sets) Related Work Accuracy 0.9 0.8 0.7 DBLP e-print BioMed EconPapers Identity / Entity Matching Database Join Record Linkage Merge / Purge Ontology Matching Graph Matching Name Authority Control Problem in LIS NBM SVM Cosine TFIDF Jaccard Jaro JaroWin Please see the paper for details IQIS 005 9 IQIS 005 30 5

Future Work Using additional information of author name Essentially, token comparison Better way: coauthor information as a Graph Graph matching / partitioning Sub-graph detection Conclusion SC Problem Using additional information (eg, coauthor) than name itself is better in distance measure -NC/-NH outperform -N/-NN SVM or Cosine shows the best accuracy (90-93%) IQIS 005 3 IQIS 005 3 Backup Slides Token vectors of coauthors of a citation Cj Token vectors of coauthors from all citations of D. Ullman R Pemantle, D Ullman D Ullman, D Maier, F Sadri, J Hopcroft ID Token Weight R 0.7 Pemantle 0.7 D 0.7 Ullman 0.7 ID Token Weight D 0.45 Ullman 0.89 D 0.45 Maier 0.89 3 F 0.7 Rid Sid Weight 0.95 0.3 Rank Citation ID Similarity 3 Sadri of coauthor string 0.7 Ci 4 J 0.7 4 Hopcroft 0.7 N Cj IQIS 005 33 IQIS 005 34 Distance Metric NBM Use Bayes Theorem to measure the similarity between two author names. To calculate the similarity between Byung-Won On and On, B.-W. Training: estimate the probability per coauthor of Byung-Won On in terms of the Bayes rule. Testing: calculate the posterior probability of On, B.-W. with the coauthors probability values of Byung-Won On. Distance Metric SVM Preprocessing All coauthor info of an author are transformed into vectorspace representation Training Given training examples of author names labeled either YES ( J. Ullman & Jeffrey D. Ullman ) or NO ( J. Ullman & James Ullmann ), SVM creates a maximum-margin hyperplane that splits YES/NO training examples. Testing SVM classifies vectors by mapping them via kernel trick to a high dimensional space (two classes of equivalent pairs and different ones are separated by a hyperplane). Kernel Use Radial Basis Function (RBF kernel) IQIS 005 35 IQIS 005 36 6

Distance Metric String-based Distance Metrics Distance Metric Vector-based Cosine Distance IQIS 005 37 IQIS 005 38 Accuracy w. error types [DBLP] SC: Scalability (DBLP with k=) Accuracy 0.9 0.8 0.7 0.5 iffl Token 4-gram Time (sec) 500 000 500 000 TFIDF Jaccard Jaro 0.4 Name Abbreviation: 30% NBM SVM Cosine TFIDF Jaccard Jaro JaroWin Name Alternation: 30% First Name Misspelling: % Last Name Misspelling: % Contraction: % Middle Name Initial Omission: 4% Combination: 0% 500 0 -N -NN -NC -NH Method IQIS 005 39 IQIS 005 40 Distribution in top-0 Step : Speed [DBLP] % 00 90 80 70 60 50 40 30 0 0 0 TFIDF Jaccard Jaro JaroWin Cosine 3 4 5 6 7 8 9 0 Rank Tim e (sec; Logarithm ic scale) 0000 000 00 0 Token 4- gram iffl NBM SVM VSM TFIDF Jaccard Jaro JaroWin IQIS 005 4 IQIS 005 4 7

Overall: Accuracy [DBLP] MC: Accuracy (DBLP) Accuracy (%) 0.9 0.8 0.7 0.5 iffl Token 4-gram NBM SVM Cosine TFIDF Jaccard Jaro JaroWin IQIS 005 43 IQIS 005 44 Naïve Solution MC problem can be naturally cast to K- Clustering Problem Cluster N data points into K clusters Issue: typical citation collections of an author in DBLP is short (ie, author-citation graph exhibits scale-free network) N is too small to apply clustering Name Disambiguation Algorithm Borrow solutions from RL community Step : Significantly reduce author names via blocking Step : Apply more expensive distance measures within each block IQIS 005 45 IQIS 005 46 SC: Processing time for Step 600 500 400 Time (sec) 300 DBLP e-print BioMed EconPapers 00 00 0 NBM SVM Cosine TFIDF Jaccard Jaro JaroWin IQIS 005 47 8