Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:

Size: px
Start display at page:

Download "Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:"

Transcription

1 Outline Effective and Scalable Solutions for Mixed and Split Citation Problems in Digital Libraries Dongwon Lee, Byung-Won On Penn State University, USA Jaewoo Kang North Carolina State University, USA Motivation Mixed Citation (MC) Problem Split Citation (SC) Problem Problem Definition Our approach Preliminary Experimentation Summary Sanghyun Park Yonsei University, KOREA IQIS 005 Motivation Eg. : DBLP Digital Libraries (DL) often have many errors that negatively affect: Quality of DL Query results User experiences Bibliometric research We present specific problems that often occur in scientific literature DL IQIS IQIS Eg. : DBLP Different authors citations are mixed under the same name heading Mixed Citation (MC) Problem Eg. : ACM DL Portal Jeffrey D. Stanford Univ. Same authors citations are split into various name variants Split Citation (SC) Problem IQIS IQIS 005 6

2 Eg. 3: CiteSeer & Google Scholar Redundant citations with different formats co-exist in DLs Aftermath DO: how to automatically identify mixed and split citation cases DON T: how to handle the identified cases Aftermath: eg) DL system becomes aware of the name variants so that it can do: Change users query proactively User => give me all citation of Jeffrey D. Ullman DL => would you like to see the citations of J. D. Ullman too? Or, simply consolidate two citations in the storage, and update index properly IQIS IQIS Mixed Citation Problem Given a collection of citations (C) by an author (ai), can we identify false citations by another author (aj), when ai and aj have the identical name spellings (i.e., homonym)? Solution: Citation Labeling Algorithm Idea: for each citation in the collection, test if the citation really belongs to the given collection Citation Labeling Algorithm Idea: for each citation in the collection, test if the citation really belongs to the given collection Remove an author ai from the citation ci Guess back the removed author name using associated information If the guessed name <> removed name, then the citation ci is a false citation IQIS IQIS Citation Labeling Algorithm Citation Labeling Algorithm J. Propp, Daniel Ullman Name Disambiguation SIG 3 J. D. Ullman X Y Z Liwei Wang X` Y` Z` Daniel Ullman X`` Y`` Z`` Similarity: Measure the similarity between a citation C and an author A N Labeling Daniel Ullman J. D. Ullman Daniel Ullman J. D. Ullman Gravano (003) s sampling-based join approximation false citation X: Token vectors of coauthors from all the citations of author Y: Z: Token vectors of paper titles from all the citations of author Token vectors of venues from all the citations of author cc: token vectors of coauthors of the citation c ac: token vectors of coauthors from all citations of the author a IQIS 005 IQIS 005

3 Configuration MC: Scalability (EconPapers) DBLP (real examples) Chen Li, Dongwon Lee, Prasenjit Mitra, Wei Liu, and Wei Wang EconPapers Inject artificial false citations into each author s citation collection Types of token vectors Set Bag Evaluation metrics Time (Scalability) Percentage/rank ratio (Accuracy) Measure how much percentage of false citations are ranked in the bottom 0%, 0%, etc. IQIS IQIS MC: Accuracy (EconPapers). Split Citation (SC) Problem Given two lists of author names, X and Y, for each author name x ( X), find a set of author names, y, y,, yn ( Y) such that both x and yi ( i n) are variants IQIS IQIS Split Citation (SC) Problem tuple Given two lists of author names, X and Y, for each author name x ( X), find a set of author names, y, y,, yn ( Y) such that both x and yi ( i n) are variants = Database Join Problem. Split Citation (SC) Problem record Given two lists of author names, X and Y, for each author name x ( X), find a set of author names, y, y,, yn ( Y) such that both x and yi ( i n) are variants = Record Linkage Problem IQIS IQIS

4 Naïve Solution Challenge : Scalability For each author name x in X For each author name y in Y If x ~ y, name variant! dist(x,y) < t DB: Nested Loop Join RL: Pair-wise Record Match O( X Y ) O( X Y ) is too costly Solutions DB: Hashed Join RL: Blocking For each name x in X Assign x to block b ( B) For each name y in Y Assign y to block b ( B) For each block b ( B) Do naïve-solution DL ISI/SCI CAS Medline/ PubMed CiteSeer arxiv SPIRED HEP DBLP CSB Domain General Sciences Chemistry Life Science General Sciences/ Engineering Physics/Math Physics CompSci CompSci Size (in M) O( X + Y + B a) << O( X Y ) IQIS IQIS Challenge : Distance Name Disambiguation Algorithm Diverse name variations Jeffrey D. Ullman J. Ullman Alon Y. Levy Halevy, A. W. Wang X. Wang Sean Engelson Shlomo Argamon Solution Look at additional information of the author names Eg, Coauthor list Keywords used in title Venues to submit Year Affiliation dist(x,y) ~ W i *dist(c(x),c(y)) + W j *dist(t(x),t(y)) + W k *dist(v(x),v(y)) : Jeffrey Ullman m: Wei Wang Wei Wang s Block Measuring Distance Jeffrey Ullman s Block Measuring Distance 0550: W. Wang 50466: Jeffrey D. Ullman 35455: Liwei Wang n: J. D. Ullman Wei Wang: Rank ID Name Jeffrey D. Ullman n J. D. Ullman IQIS 005 IQIS 005 Step : Blocking Step : Measuring Distance Many blocking methods can be applied Sorted Window Token-based N-gram We applied Gravano (003) s samplingbased join approximation algorithm as a blocking method Comparison with other blocking methods => JCDL 005 Naïve Bayes Model Use Bayes Theorem to measure similarity between two names Support Vector Machine Use SVM Classifiers String-based Distance Metrics TFIDF/Jaccard (Token-based) Jaro/JaroWinkler (Edit distances) Vector-based Cosine Distance Cosine Similarity Supervised Un-supervised IQIS IQIS

5 Policy Variations Data sets Blocking Measuring Distance IQIS IQIS Configuration (eg, DBLP case) SC: Accuracy (DBLP) Authors, x, in X and authors, y, in Y Prepare an artificial name variant x for K randomlychosen x (eg, K=00): Abbreviation of the first name (85%): Ji-Woo K. Li J. K. Li Typo (5%): Ji-Woo K. Li Ji-Woo K. Lee x carries half of x s original citations x carries the other half Inject all x into Y Varying error types gave consistent results. For instance, Test: for each author x in X, find the corresponding name variants x in Y Evaluation metrics Time Accuracy Name Abbreviation: 30% Name Alternation: 30% First Name Misspelling: % Last Name Misspelling: % Contraction: % Middle Name Initial Omission: 4% Combination: 0% Accuracy TFIDF Jaccard Jaro -N -NN -NC -NH Method IQIS IQIS SC: Accuracy (All data sets) Related Work Accuracy DBLP e-print BioMed EconPapers Identity / Entity Matching Database Join Record Linkage Merge / Purge Ontology Matching Graph Matching Name Authority Control Problem in LIS NBM SVM Cosine TFIDF Jaccard Jaro JaroWin Please see the paper for details IQIS IQIS

6 Future Work Using additional information of author name Essentially, token comparison Better way: coauthor information as a Graph Graph matching / partitioning Sub-graph detection Conclusion SC Problem Using additional information (eg, coauthor) than name itself is better in distance measure -NC/-NH outperform -N/-NN SVM or Cosine shows the best accuracy (90-93%) IQIS IQIS Backup Slides Token vectors of coauthors of a citation Cj Token vectors of coauthors from all citations of D. Ullman R Pemantle, D Ullman D Ullman, D Maier, F Sadri, J Hopcroft ID Token Weight R 0.7 Pemantle 0.7 D 0.7 Ullman 0.7 ID Token Weight D 0.45 Ullman 0.89 D 0.45 Maier F 0.7 Rid Sid Weight Rank Citation ID Similarity 3 Sadri of coauthor string 0.7 Ci 4 J Hopcroft 0.7 N Cj IQIS IQIS Distance Metric NBM Use Bayes Theorem to measure the similarity between two author names. To calculate the similarity between Byung-Won On and On, B.-W. Training: estimate the probability per coauthor of Byung-Won On in terms of the Bayes rule. Testing: calculate the posterior probability of On, B.-W. with the coauthors probability values of Byung-Won On. Distance Metric SVM Preprocessing All coauthor info of an author are transformed into vectorspace representation Training Given training examples of author names labeled either YES ( J. Ullman & Jeffrey D. Ullman ) or NO ( J. Ullman & James Ullmann ), SVM creates a maximum-margin hyperplane that splits YES/NO training examples. Testing SVM classifies vectors by mapping them via kernel trick to a high dimensional space (two classes of equivalent pairs and different ones are separated by a hyperplane). Kernel Use Radial Basis Function (RBF kernel) IQIS IQIS

7 Distance Metric String-based Distance Metrics Distance Metric Vector-based Cosine Distance IQIS IQIS Accuracy w. error types [DBLP] SC: Scalability (DBLP with k=) Accuracy iffl Token 4-gram Time (sec) TFIDF Jaccard Jaro 0.4 Name Abbreviation: 30% NBM SVM Cosine TFIDF Jaccard Jaro JaroWin Name Alternation: 30% First Name Misspelling: % Last Name Misspelling: % Contraction: % Middle Name Initial Omission: 4% Combination: 0% N -NN -NC -NH Method IQIS IQIS Distribution in top-0 Step : Speed [DBLP] % TFIDF Jaccard Jaro JaroWin Cosine Rank Tim e (sec; Logarithm ic scale) Token 4- gram iffl NBM SVM VSM TFIDF Jaccard Jaro JaroWin IQIS IQIS

8 Overall: Accuracy [DBLP] MC: Accuracy (DBLP) Accuracy (%) iffl Token 4-gram NBM SVM Cosine TFIDF Jaccard Jaro JaroWin IQIS IQIS Naïve Solution MC problem can be naturally cast to K- Clustering Problem Cluster N data points into K clusters Issue: typical citation collections of an author in DBLP is short (ie, author-citation graph exhibits scale-free network) N is too small to apply clustering Name Disambiguation Algorithm Borrow solutions from RL community Step : Significantly reduce author names via blocking Step : Apply more expensive distance measures within each block IQIS IQIS SC: Processing time for Step Time (sec) 300 DBLP e-print BioMed EconPapers NBM SVM Cosine TFIDF Jaccard Jaro JaroWin IQIS

Practical maintenance of evolving metadata for digital preservation: algorithmic solution and system support

Practical maintenance of evolving metadata for digital preservation: algorithmic solution and system support Int J Digit Libr (2007) 6:313 326 DOI 10.1007/s00799-007-0014-9 REGULAR PAPER Practical maintenance of evolving metadata for digital preservation: algorithmic solution and system support Dongwon Lee Published

More information

AUTHOR NAME DISAMBIGUATION AND CROSS SOURCE DOCUMENT COREFERENCE

AUTHOR NAME DISAMBIGUATION AND CROSS SOURCE DOCUMENT COREFERENCE The Pennsylvania State University The Graduate School College of Information Sciences and Technology AUTHOR NAME DISAMBIGUATION AND CROSS SOURCE DOCUMENT COREFERENCE A Thesis in Information Sciences and

More information

System Support for Name Authority Control Problem in Digital Libraries: OpenDBLP Approach

System Support for Name Authority Control Problem in Digital Libraries: OpenDBLP Approach System Support for Name Authority Control Problem in Digital Libraries: OpenDBLP Approach Yoojin Hong 1, Byung-Won On 1, and Dongwon Lee 2 1 Department of Computer Science and Engineering, The Pennsylvania

More information

The Pennsylvania State University The Graduate School DATA CLEANING TECHNIQUES BY MEANS OF ENTITY RESOLUTION

The Pennsylvania State University The Graduate School DATA CLEANING TECHNIQUES BY MEANS OF ENTITY RESOLUTION The Pennsylvania State University The Graduate School DATA CLEANING TECHNIQUES BY MEANS OF ENTITY RESOLUTION A Thesis in Computer Science and Engineering by Byung-Won On c 2007 Byung-Won On Submitted in

More information

The Pennsylvania State University. The Graduate School EFFECTIVE SOLUTIONS FOR NAME LINKAGE AND THEIR APPLICATIONS. A Thesis in

The Pennsylvania State University. The Graduate School EFFECTIVE SOLUTIONS FOR NAME LINKAGE AND THEIR APPLICATIONS. A Thesis in The Pennsylvania State University The Graduate School EFFECTIVE SOLUTIONS FOR NAME LINKAGE AND THEIR APPLICATIONS A Thesis in Computer Science and Engineering by Ergin Elmacioglu c 2008 Ergin Elmacioglu

More information

Scalable Name Disambiguation using Multi-level Graph Partition

Scalable Name Disambiguation using Multi-level Graph Partition Scalable Name Disambiguation using Multi-level Graph Partition Byung-Won On Penn State University, USA on@cse.psu.edu Dongwon Lee Penn State University, USA dongwon@psu.edu Abstract When non-unique values

More information

Similarity Joins in MapReduce

Similarity Joins in MapReduce Similarity Joins in MapReduce Benjamin Coors, Kristian Hunt, and Alain Kaeslin KTH Royal Institute of Technology {coors,khunt,kaeslin}@kth.se Abstract. This paper studies how similarity joins can be implemented

More information

Improving Grouped-Entity Resolution using Quasi-Cliques

Improving Grouped-Entity Resolution using Quasi-Cliques Improving Grouped-Entity Resolution using Quasi-Cliques Byung-Won On, Ergin Elmacioglu, Dongwon Lee Jaewoo Kang Jian Pei The Pennsylvania State University NCSU & Korea Univ. Simon Fraser Univ. {on,ergin,dongwon}@psu.edu

More information

Scholarly Big Data: Leverage for Science

Scholarly Big Data: Leverage for Science Scholarly Big Data: Leverage for Science C. Lee Giles The Pennsylvania State University University Park, PA, USA giles@ist.psu.edu http://clgiles.ist.psu.edu Funded in part by NSF, Allen Institute for

More information

Search Engines. Information Retrieval in Practice

Search Engines. Information Retrieval in Practice Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Classification and Clustering Classification and clustering are classical pattern recognition / machine learning problems

More information

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park Link Mining & Entity Resolution Lise Getoor University of Maryland, College Park Learning in Structured Domains Traditional machine learning and data mining approaches assume: A random sample of homogeneous

More information

Determine the Entity Number in Hierarchical Clustering for Web Personal Name Disambiguation

Determine the Entity Number in Hierarchical Clustering for Web Personal Name Disambiguation Determine the Entity Number in Hierarchical Clustering for Web Personal Name Disambiguation Jun Gong Department of Information System Beihang University No.37 XueYuan Road HaiDian District, Beijing, China

More information

CSE 158. Web Mining and Recommender Systems. Midterm recap

CSE 158. Web Mining and Recommender Systems. Midterm recap CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158

More information

Collective Entity Resolution in Relational Data

Collective Entity Resolution in Relational Data Collective Entity Resolution in Relational Data I. Bhattacharya, L. Getoor University of Maryland Presented by: Srikar Pyda, Brett Walenz CS590.01 - Duke University Parts of this presentation from: http://www.norc.org/pdfs/may%202011%20personal%20validation%20and%20entity%20resolution%20conference/getoorcollectiveentityresolution

More information

TABLE OF CONTENTS PAGE TITLE NO.

TABLE OF CONTENTS PAGE TITLE NO. TABLE OF CONTENTS CHAPTER PAGE TITLE ABSTRACT iv LIST OF TABLES xi LIST OF FIGURES xii LIST OF ABBREVIATIONS & SYMBOLS xiv 1. INTRODUCTION 1 2. LITERATURE SURVEY 14 3. MOTIVATIONS & OBJECTIVES OF THIS

More information

Survey of String Similarity Join Algorithms on Large Scale Data

Survey of String Similarity Join Algorithms on Large Scale Data Survey of String Similarity Join Algorithms on Large Scale Data P.Selvaramalakshmi Research Scholar Dept. of Computer Science Bishop Heber College (Autonomous) Tiruchirappalli, Tamilnadu, India. Dr. S.

More information

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific

More information

Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function

Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function Aron Culotta, Pallika Kanani, Robert Hall, Michael Wick, Andrew McCallum Department of Computer Science University

More information

Representation Learning using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval

Representation Learning using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval Representation Learning using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval Xiaodong Liu 12, Jianfeng Gao 1, Xiaodong He 1 Li Deng 1, Kevin Duh 2, Ye-Yi Wang 1 1

More information

Based on Raymond J. Mooney s slides

Based on Raymond J. Mooney s slides Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit

More information

Visualization and text mining of patent and non-patent data

Visualization and text mining of patent and non-patent data of patent and non-patent data Anton Heijs Information Solutions Delft, The Netherlands http://www.treparel.com/ ICIC conference, Nice, France, 2008 Outline Introduction Applications on patent and non-patent

More information

Automatic Identification of User Goals in Web Search [WWW 05]

Automatic Identification of User Goals in Web Search [WWW 05] Automatic Identification of User Goals in Web Search [WWW 05] UichinLee @ UCLA ZhenyuLiu @ UCLA JunghooCho @ UCLA Presenter: Emiran Curtmola@ UC San Diego CSE 291 4/29/2008 Need to improve the quality

More information

NUS-I2R: Learning a Combined System for Entity Linking

NUS-I2R: Learning a Combined System for Entity Linking NUS-I2R: Learning a Combined System for Entity Linking Wei Zhang Yan Chuan Sim Jian Su Chew Lim Tan School of Computing National University of Singapore {z-wei, tancl} @comp.nus.edu.sg Institute for Infocomm

More information

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers James P. Biagioni Piotr M. Szczurek Peter C. Nelson, Ph.D. Abolfazl Mohammadian, Ph.D. Agenda Background

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

Clustering using Topic Models

Clustering using Topic Models Clustering using Topic Models Compiled by Sujatha Das, Cornelia Caragea Credits for slides: Blei, Allan, Arms, Manning, Rai, Lund, Noble, Page. Clustering Partition unlabeled examples into disjoint subsets

More information

Authorship Disambiguation and Alias Resolution in Data

Authorship Disambiguation and Alias Resolution in  Data Authorship Disambiguation and Alias Resolution in Email Data Freek Maes Johannes C. Scholtes Department of Knowledge Engineering Maastricht University, P.O. Box 616, 6200 MD Maastricht Abstract Given a

More information

Towards Breaking the Quality Curse. AWebQuerying Web-Querying Approach to Web People Search.

Towards Breaking the Quality Curse. AWebQuerying Web-Querying Approach to Web People Search. Towards Breaking the Quality Curse. AWebQuerying Web-Querying Approach to Web People Search. Dmitri V. Kalashnikov Rabia Nuray-Turan Sharad Mehrotra Dept of Computer Science University of California, Irvine

More information

Name Disambiguation Using Web Connection

Name Disambiguation Using Web Connection Name Disambiguation Using Web nection Yiming Lu 1*, Zaiqing Nie, Taoyuan Cheng *, Ying Gao *, Ji-Rong Wen Microsoft Research Asia, Beijing, China 1 University of California, Irvine, U.S.A. Renmin University

More information

Locality-Sensitive Hashing

Locality-Sensitive Hashing Locality-Sensitive Hashing & Image Similarity Search Andrew Wylie Overview; LSH given a query q (or not), how do we find similar items from a large search set quickly? Can t do all pairwise comparisons;

More information

Robust and Efficient Fuzzy Match for Online Data Cleaning. Motivation. Methodology

Robust and Efficient Fuzzy Match for Online Data Cleaning. Motivation. Methodology Robust and Efficient Fuzzy Match for Online Data Cleaning S. Chaudhuri, K. Ganjan, V. Ganti, R. Motwani Presented by Aaditeshwar Seth 1 Motivation Data warehouse: Many input tuples Tuples can be erroneous

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) DBMS Internals- Part VII Lecture 15, March 17, 2014 Mohammad Hammoud Today Last Session: DBMS Internals- Part VI Algorithms for Relational Operations Today s Session: DBMS

More information

Semantic Scholar. ICSTI Towards a More Efficient Review of Research Literature 11 September

Semantic Scholar. ICSTI Towards a More Efficient Review of Research Literature 11 September Semantic Scholar ICSTI Towards a More Efficient Review of Research Literature 11 September 2018 Allen Institute for Artificial Intelligence (https://allenai.org/) Non-profit Research Institute in Seattle,

More information

Clustering & Classification (chapter 15)

Clustering & Classification (chapter 15) Clustering & Classification (chapter 5) Kai Goebel Bill Cheetham RPI/GE Global Research goebel@cs.rpi.edu cheetham@cs.rpi.edu Outline k-means Fuzzy c-means Mountain Clustering knn Fuzzy knn Hierarchical

More information

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18 CSE 417T: Introduction to Machine Learning Lecture 22: The Kernel Trick Henry Chai 11/15/18 Linearly Inseparable Data What can we do if the data is not linearly separable? Accept some non-zero in-sample

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

Data Mining in Bioinformatics Day 1: Classification

Data Mining in Bioinformatics Day 1: Classification Data Mining in Bioinformatics Day 1: Classification Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tübingen and Eberhard Karls

More information

Naming Disambiguation Based on Approximate String Matching for Co- Authorship Networks

Naming Disambiguation Based on Approximate String Matching for Co- Authorship Networks Naming Disambig Ba on Approximate String Matching for Co- Authorship s Dr. V. Akila Dept. of Computer Science & Engg. akila@pec.edu Dr.V.Govindasamy Dept. of Information Technology, vgopu@pec.edu R. Kowsalya

More information

Classification. 1 o Semestre 2007/2008

Classification. 1 o Semestre 2007/2008 Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class

More information

Text Classification and Clustering Using Kernels for Structured Data

Text Classification and Clustering Using Kernels for Structured Data Text Mining SVM Conclusion Text Classification and Clustering Using, pgeibel@uos.de DGFS Institut für Kognitionswissenschaft Universität Osnabrück February 2005 Outline Text Mining SVM Conclusion 1 Text

More information

Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification

Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification Peter Christen Department of Computer Science, ANU College of Engineering and Computer Science, The Australian National University,

More information

PANDA: A Platform for Academic Knowledge Discovery and Acquisition

PANDA: A Platform for Academic Knowledge Discovery and Acquisition PANDA: A Platform for Academic Knowledge Discovery and Acquisition Zhaoan Dong 1 ; Jiaheng Lu 2,1 ; Tok Wang Ling 3 1.Renmin University of China 2.University of Helsinki 3.National University of Singapore

More information

PERSONALIZED MOBILE SEARCH ENGINE BASED ON MULTIPLE PREFERENCE, USER PROFILE AND ANDROID PLATFORM

PERSONALIZED MOBILE SEARCH ENGINE BASED ON MULTIPLE PREFERENCE, USER PROFILE AND ANDROID PLATFORM PERSONALIZED MOBILE SEARCH ENGINE BASED ON MULTIPLE PREFERENCE, USER PROFILE AND ANDROID PLATFORM Ajit Aher, Rahul Rohokale, Asst. Prof. Nemade S.B. B.E. (computer) student, Govt. college of engg. & research

More information

Entity Resolution, Clustering Author References

Entity Resolution, Clustering Author References , Clustering Author References Vlad Shchogolev vs299@columbia.edu May 1, 2007 Outline 1 What is? Motivation 2 Formal Definition Efficieny Considerations Measuring Text Similarity Other approaches 3 Clustering

More information

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines DATA MINING LECTURE 10B Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines NEAREST NEIGHBOR CLASSIFICATION 10 10 Illustrating Classification Task Tid Attrib1

More information

ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System

ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System ijade Reporter An Intelligent Multi-agent Based Context Aware Reporting System Eddie C.L. Chan and Raymond S.T. Lee The Department of Computing, The Hong Kong Polytechnic University, Hung Hong, Kowloon,

More information

ECG782: Multidimensional Digital Signal Processing

ECG782: Multidimensional Digital Signal Processing ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting

More information

Developing Focused Crawlers for Genre Specific Search Engines

Developing Focused Crawlers for Genre Specific Search Engines Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com

More information

Scopus. Information literacy in Chemistry. J une 14, 2011

Scopus. Information literacy in Chemistry. J une 14, 2011 Information literacy in Chemistry Scopus J une 14, 2011 BIBLIOGRAPHIC DATABASE electronic archive of bibliographic records that refer to published academic literature the records are structured and organized

More information

Spam Classification Documentation

Spam Classification Documentation Spam Classification Documentation What is SPAM? Unsolicited, unwanted email that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient. Objective:

More information

slide courtesy of D. Yarowsky Splitting Words a.k.a. Word Sense Disambiguation Intro to NLP - J. Eisner 1

slide courtesy of D. Yarowsky Splitting Words a.k.a. Word Sense Disambiguation Intro to NLP - J. Eisner 1 Splitting Words a.k.a. Word Sense Disambiguation 600.465 - Intro to NLP - J. Eisner Representing Word as Vector Could average over many occurrences of the word... Each word type has a different vector

More information

Finding Topic-centric Identified Experts based on Full Text Analysis

Finding Topic-centric Identified Experts based on Full Text Analysis Finding Topic-centric Identified Experts based on Full Text Analysis Hanmin Jung, Mikyoung Lee, In-Su Kang, Seung-Woo Lee, Won-Kyung Sung Information Service Research Lab., KISTI, Korea jhm@kisti.re.kr

More information

Contents. Preface to the Second Edition

Contents. Preface to the Second Edition Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................

More information

SPARK: Top-k Keyword Query in Relational Database

SPARK: Top-k Keyword Query in Relational Database SPARK: Top-k Keyword Query in Relational Database Wei Wang University of New South Wales Australia 20/03/2007 1 Outline Demo & Introduction Ranking Query Evaluation Conclusions 20/03/2007 2 Demo 20/03/2007

More information

Introduction to Automated Text Analysis. bit.ly/poir599

Introduction to Automated Text Analysis. bit.ly/poir599 Introduction to Automated Text Analysis Pablo Barberá School of International Relations University of Southern California pablobarbera.com Lecture materials: bit.ly/poir599 Today 1. Solutions for last

More information

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

SUPERVISED TERM WEIGHTING METHODS FOR URL CLASSIFICATION

SUPERVISED TERM WEIGHTING METHODS FOR URL CLASSIFICATION Journal of Computer Science 10 (10): 1969-1976, 2014 ISSN: 1549-3636 2014 doi:10.3844/jcssp.2014.1969.1976 Published Online 10 (10) 2014 (http://www.thescipub.com/jcs.toc) SUPERVISED TERM WEIGHTING METHODS

More information

CS229 Final Project: Predicting Expected Response Times

CS229 Final Project: Predicting Expected  Response Times CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time

More information

Basic Problem Addressed. The Approach I: Training. Main Idea. The Approach II: Testing. Why a set of vocabularies?

Basic Problem Addressed. The Approach I: Training. Main Idea. The Approach II: Testing. Why a set of vocabularies? Visual Categorization With Bags of Keypoints. ECCV,. G. Csurka, C. Bray, C. Dance, and L. Fan. Shilpa Gulati //7 Basic Problem Addressed Find a method for Generic Visual Categorization Visual Categorization:

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 53 INTRO. TO DATA MINING Locality Sensitive Hashing (LSH) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan Parthasarathy @OSU MMDS Secs. 3.-3.. Slides

More information

Database Applications (15-415)

Database Applications (15-415) Database Applications (15-415) DBMS Internals- Part VI Lecture 14, March 12, 2014 Mohammad Hammoud Today Last Session: DBMS Internals- Part V Hash-based indexes (Cont d) and External Sorting Today s Session:

More information

Data Linkage Methods: Overview of Computer Science Research

Data Linkage Methods: Overview of Computer Science Research Data Linkage Methods: Overview of Computer Science Research Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra,

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs

Behavioral Data Mining. Lecture 10 Kernel methods and SVMs Behavioral Data Mining Lecture 10 Kernel methods and SVMs Outline SVMs as large-margin linear classifiers Kernel methods SVM algorithms SVMs as large-margin classifiers margin The separating plane maximizes

More information

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1

Published by: PIONEER RESEARCH & DEVELOPMENT GROUP (  1 Cluster Based Speed and Effective Feature Extraction for Efficient Search Engine Manjuparkavi A 1, Arokiamuthu M 2 1 PG Scholar, Computer Science, Dr. Pauls Engineering College, Villupuram, India 2 Assistant

More information

Approximate String Joins

Approximate String Joins Approximate String Joins Divesh Srivastava AT&T Labs-Research The Need for String Joins Substantial amounts of data in existing RDBMSs are strings There is a need to correlate data stored in different

More information

Support Vector Machines + Classification for IR

Support Vector Machines + Classification for IR Support Vector Machines + Classification for IR Pierre Lison University of Oslo, Dep. of Informatics INF3800: Søketeknologi April 30, 2014 Outline of the lecture Recap of last week Support Vector Machines

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 9: Data Mining (3/4) March 8, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides

More information

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification

CAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification CAMCOS Report Day December 9 th, 2015 San Jose State University Project Theme: Classification On Classification: An Empirical Study of Existing Algorithms based on two Kaggle Competitions Team 1 Team 2

More information

SUPPORT VECTOR MACHINES

SUPPORT VECTOR MACHINES SUPPORT VECTOR MACHINES Today Reading AIMA 18.9 Goals (Naïve Bayes classifiers) Support vector machines 1 Support Vector Machines (SVMs) SVMs are probably the most popular off-the-shelf classifier! Software

More information

Jure Leskovec, Cornell/Stanford University. Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research

Jure Leskovec, Cornell/Stanford University. Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research Jure Leskovec, Cornell/Stanford University Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research Network: an interaction graph: Nodes represent entities Edges represent interaction

More information

Performance Improvement of Hardware-Based Packet Classification Algorithm

Performance Improvement of Hardware-Based Packet Classification Algorithm Performance Improvement of Hardware-Based Packet Classification Algorithm Yaw-Chung Chen 1, Pi-Chung Wang 2, Chun-Liang Lee 2, and Chia-Tai Chan 2 1 Department of Computer Science and Information Engineering,

More information

Classification of Tweets using Supervised and Semisupervised Learning

Classification of Tweets using Supervised and Semisupervised Learning Classification of Tweets using Supervised and Semisupervised Learning Achin Jain, Kuk Jang I. INTRODUCTION The goal of this project is to classify the given tweets into 2 categories, namely happy and sad.

More information

Lecture 10: Support Vector Machines and their Applications

Lecture 10: Support Vector Machines and their Applications Lecture 10: Support Vector Machines and their Applications Cognitive Systems - Machine Learning Part II: Special Aspects of Concept Learning SVM, kernel trick, linear separability, text mining, active

More information

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts.

Keywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts. Volume 5, Issue 3, March 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Advanced Preferred

More information

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS

TRIE BASED METHODS FOR STRING SIMILARTIY JOINS TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH

More information

A Feature Selection Method to Handle Imbalanced Data in Text Classification

A Feature Selection Method to Handle Imbalanced Data in Text Classification A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University

More information

ART 알고리즘특강자료 ( 응용 01)

ART 알고리즘특강자료 ( 응용 01) An Adaptive Intrusion Detection Algorithm Based on Clustering and Kernel-Method ART 알고리즘특강자료 ( 응용 01) DB 및데이터마이닝연구실 http://idb.korea.ac.kr 2009 년 05 월 01 일 1 Introduction v Background of Research v In

More information

A hybrid method to categorize HTML documents

A hybrid method to categorize HTML documents Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper

More information

Support Vector Machines

Support Vector Machines Support Vector Machines About the Name... A Support Vector A training sample used to define classification boundaries in SVMs located near class boundaries Support Vector Machines Binary classifiers whose

More information

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web

Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl

More information

Text Categorization (I)

Text Categorization (I) CS473 CS-473 Text Categorization (I) Luo Si Department of Computer Science Purdue University Text Categorization (I) Outline Introduction to the task of text categorization Manual v.s. automatic text categorization

More information

Column Stores vs. Row Stores How Different Are They Really?

Column Stores vs. Row Stores How Different Are They Really? Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background

More information

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17 Announcement CompSci 516 Database Systems Lecture 10 Query Evaluation and Join Algorithms Project proposal pdf due on sakai by 5 pm, tomorrow, Thursday 09/27 One per group by any member Instructor: Sudeepa

More information

Table Of Contents: xix Foreword to Second Edition

Table Of Contents: xix Foreword to Second Edition Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data

More information

Data Structures and Algorithms

Data Structures and Algorithms Data Structures and Algorithms Autumn 2018-2019 Outline Sorting Algorithms (contd.) 1 Sorting Algorithms (contd.) Quicksort Outline Sorting Algorithms (contd.) 1 Sorting Algorithms (contd.) Quicksort Quicksort

More information

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci

Part 12: Advanced Topics in Collaborative Filtering. Francesco Ricci Part 12: Advanced Topics in Collaborative Filtering Francesco Ricci Content Generating recommendations in CF using frequency of ratings Role of neighborhood size Comparison of CF with association rules

More information

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure

Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure Neelam Singh neelamjain.jain@gmail.com Neha Garg nehagarg.february@gmail.com Janmejay Pant geujay2010@gmail.com

More information

Templates. for scalable data analysis. 2 Synchronous Templates. Amr Ahmed, Alexander J Smola, Markus Weimer. Yahoo! Research & UC Berkeley & ANU

Templates. for scalable data analysis. 2 Synchronous Templates. Amr Ahmed, Alexander J Smola, Markus Weimer. Yahoo! Research & UC Berkeley & ANU Templates for scalable data analysis 2 Synchronous Templates Amr Ahmed, Alexander J Smola, Markus Weimer Yahoo! Research & UC Berkeley & ANU Running Example Inbox Spam Running Example Inbox Spam Spam Filter

More information

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning

Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning Overview T7 - SVM and s Christian Vögeli cvoegeli@inf.ethz.ch Supervised/ s Support Vector Machines Kernels Based on slides by P. Orbanz & J. Keuchel Task: Apply some machine learning method to data from

More information

Support vector machine (II): non-linear SVM. LING 572 Fei Xia

Support vector machine (II): non-linear SVM. LING 572 Fei Xia Support vector machine (II): non-linear SVM LING 572 Fei Xia 1 Linear SVM Maximizing the margin Soft margin Nonlinear SVM Kernel trick A case study Outline Handling multi-class problems 2 Non-linear SVM

More information

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Zhaoqi Chen Microsoft Corporation Redmond, USA schen@microsoft.com Dmitri V. Kalashnikov Dept. of Computer Science University

More information

String Vector based KNN for Text Categorization

String Vector based KNN for Text Categorization 458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research

More information

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task. Junjun Wang 2013/4/22

Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task. Junjun Wang 2013/4/22 Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task Junjun Wang 2013/4/22 Outline Introduction Related Word System Overview Subtopic Candidate Mining Subtopic Ranking Results and Discussion

More information

The FreeSearch System

The FreeSearch System Wolfgang Nejdl 03/05/12 1 The FreeSearch System Search engine for digital libraries Simple to use interface Intuitive functionalities Easily scalable Now with focus on Duplicate detection and duplicate

More information

Correlation Based Feature Selection with Irrelevant Feature Removal

Correlation Based Feature Selection with Irrelevant Feature Removal Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

A Learning Method for Entity Matching

A Learning Method for Entity Matching A Learning Method for Entity Matching Jie Chen Cheqing Jin Rong Zhang Aoying Zhou Shanghai Key Laboratory of Trustworthy Computing, Software Engineering Institute, East China Normal University, China 5500002@ecnu.cn,

More information

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Masaki Eto Gakushuin Women s College Tokyo, Japan masaki.eto@gakushuin.ac.jp Abstract. To improve the search performance

More information