Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:
|
|
- Julia Rice
- 6 years ago
- Views:
Transcription
1 Outline Effective and Scalable Solutions for Mixed and Split Citation Problems in Digital Libraries Dongwon Lee, Byung-Won On Penn State University, USA Jaewoo Kang North Carolina State University, USA Motivation Mixed Citation (MC) Problem Split Citation (SC) Problem Problem Definition Our approach Preliminary Experimentation Summary Sanghyun Park Yonsei University, KOREA IQIS 005 Motivation Eg. : DBLP Digital Libraries (DL) often have many errors that negatively affect: Quality of DL Query results User experiences Bibliometric research We present specific problems that often occur in scientific literature DL IQIS IQIS Eg. : DBLP Different authors citations are mixed under the same name heading Mixed Citation (MC) Problem Eg. : ACM DL Portal Jeffrey D. Stanford Univ. Same authors citations are split into various name variants Split Citation (SC) Problem IQIS IQIS 005 6
2 Eg. 3: CiteSeer & Google Scholar Redundant citations with different formats co-exist in DLs Aftermath DO: how to automatically identify mixed and split citation cases DON T: how to handle the identified cases Aftermath: eg) DL system becomes aware of the name variants so that it can do: Change users query proactively User => give me all citation of Jeffrey D. Ullman DL => would you like to see the citations of J. D. Ullman too? Or, simply consolidate two citations in the storage, and update index properly IQIS IQIS Mixed Citation Problem Given a collection of citations (C) by an author (ai), can we identify false citations by another author (aj), when ai and aj have the identical name spellings (i.e., homonym)? Solution: Citation Labeling Algorithm Idea: for each citation in the collection, test if the citation really belongs to the given collection Citation Labeling Algorithm Idea: for each citation in the collection, test if the citation really belongs to the given collection Remove an author ai from the citation ci Guess back the removed author name using associated information If the guessed name <> removed name, then the citation ci is a false citation IQIS IQIS Citation Labeling Algorithm Citation Labeling Algorithm J. Propp, Daniel Ullman Name Disambiguation SIG 3 J. D. Ullman X Y Z Liwei Wang X` Y` Z` Daniel Ullman X`` Y`` Z`` Similarity: Measure the similarity between a citation C and an author A N Labeling Daniel Ullman J. D. Ullman Daniel Ullman J. D. Ullman Gravano (003) s sampling-based join approximation false citation X: Token vectors of coauthors from all the citations of author Y: Z: Token vectors of paper titles from all the citations of author Token vectors of venues from all the citations of author cc: token vectors of coauthors of the citation c ac: token vectors of coauthors from all citations of the author a IQIS 005 IQIS 005
3 Configuration MC: Scalability (EconPapers) DBLP (real examples) Chen Li, Dongwon Lee, Prasenjit Mitra, Wei Liu, and Wei Wang EconPapers Inject artificial false citations into each author s citation collection Types of token vectors Set Bag Evaluation metrics Time (Scalability) Percentage/rank ratio (Accuracy) Measure how much percentage of false citations are ranked in the bottom 0%, 0%, etc. IQIS IQIS MC: Accuracy (EconPapers). Split Citation (SC) Problem Given two lists of author names, X and Y, for each author name x ( X), find a set of author names, y, y,, yn ( Y) such that both x and yi ( i n) are variants IQIS IQIS Split Citation (SC) Problem tuple Given two lists of author names, X and Y, for each author name x ( X), find a set of author names, y, y,, yn ( Y) such that both x and yi ( i n) are variants = Database Join Problem. Split Citation (SC) Problem record Given two lists of author names, X and Y, for each author name x ( X), find a set of author names, y, y,, yn ( Y) such that both x and yi ( i n) are variants = Record Linkage Problem IQIS IQIS
4 Naïve Solution Challenge : Scalability For each author name x in X For each author name y in Y If x ~ y, name variant! dist(x,y) < t DB: Nested Loop Join RL: Pair-wise Record Match O( X Y ) O( X Y ) is too costly Solutions DB: Hashed Join RL: Blocking For each name x in X Assign x to block b ( B) For each name y in Y Assign y to block b ( B) For each block b ( B) Do naïve-solution DL ISI/SCI CAS Medline/ PubMed CiteSeer arxiv SPIRED HEP DBLP CSB Domain General Sciences Chemistry Life Science General Sciences/ Engineering Physics/Math Physics CompSci CompSci Size (in M) O( X + Y + B a) << O( X Y ) IQIS IQIS Challenge : Distance Name Disambiguation Algorithm Diverse name variations Jeffrey D. Ullman J. Ullman Alon Y. Levy Halevy, A. W. Wang X. Wang Sean Engelson Shlomo Argamon Solution Look at additional information of the author names Eg, Coauthor list Keywords used in title Venues to submit Year Affiliation dist(x,y) ~ W i *dist(c(x),c(y)) + W j *dist(t(x),t(y)) + W k *dist(v(x),v(y)) : Jeffrey Ullman m: Wei Wang Wei Wang s Block Measuring Distance Jeffrey Ullman s Block Measuring Distance 0550: W. Wang 50466: Jeffrey D. Ullman 35455: Liwei Wang n: J. D. Ullman Wei Wang: Rank ID Name Jeffrey D. Ullman n J. D. Ullman IQIS 005 IQIS 005 Step : Blocking Step : Measuring Distance Many blocking methods can be applied Sorted Window Token-based N-gram We applied Gravano (003) s samplingbased join approximation algorithm as a blocking method Comparison with other blocking methods => JCDL 005 Naïve Bayes Model Use Bayes Theorem to measure similarity between two names Support Vector Machine Use SVM Classifiers String-based Distance Metrics TFIDF/Jaccard (Token-based) Jaro/JaroWinkler (Edit distances) Vector-based Cosine Distance Cosine Similarity Supervised Un-supervised IQIS IQIS
5 Policy Variations Data sets Blocking Measuring Distance IQIS IQIS Configuration (eg, DBLP case) SC: Accuracy (DBLP) Authors, x, in X and authors, y, in Y Prepare an artificial name variant x for K randomlychosen x (eg, K=00): Abbreviation of the first name (85%): Ji-Woo K. Li J. K. Li Typo (5%): Ji-Woo K. Li Ji-Woo K. Lee x carries half of x s original citations x carries the other half Inject all x into Y Varying error types gave consistent results. For instance, Test: for each author x in X, find the corresponding name variants x in Y Evaluation metrics Time Accuracy Name Abbreviation: 30% Name Alternation: 30% First Name Misspelling: % Last Name Misspelling: % Contraction: % Middle Name Initial Omission: 4% Combination: 0% Accuracy TFIDF Jaccard Jaro -N -NN -NC -NH Method IQIS IQIS SC: Accuracy (All data sets) Related Work Accuracy DBLP e-print BioMed EconPapers Identity / Entity Matching Database Join Record Linkage Merge / Purge Ontology Matching Graph Matching Name Authority Control Problem in LIS NBM SVM Cosine TFIDF Jaccard Jaro JaroWin Please see the paper for details IQIS IQIS
6 Future Work Using additional information of author name Essentially, token comparison Better way: coauthor information as a Graph Graph matching / partitioning Sub-graph detection Conclusion SC Problem Using additional information (eg, coauthor) than name itself is better in distance measure -NC/-NH outperform -N/-NN SVM or Cosine shows the best accuracy (90-93%) IQIS IQIS Backup Slides Token vectors of coauthors of a citation Cj Token vectors of coauthors from all citations of D. Ullman R Pemantle, D Ullman D Ullman, D Maier, F Sadri, J Hopcroft ID Token Weight R 0.7 Pemantle 0.7 D 0.7 Ullman 0.7 ID Token Weight D 0.45 Ullman 0.89 D 0.45 Maier F 0.7 Rid Sid Weight Rank Citation ID Similarity 3 Sadri of coauthor string 0.7 Ci 4 J Hopcroft 0.7 N Cj IQIS IQIS Distance Metric NBM Use Bayes Theorem to measure the similarity between two author names. To calculate the similarity between Byung-Won On and On, B.-W. Training: estimate the probability per coauthor of Byung-Won On in terms of the Bayes rule. Testing: calculate the posterior probability of On, B.-W. with the coauthors probability values of Byung-Won On. Distance Metric SVM Preprocessing All coauthor info of an author are transformed into vectorspace representation Training Given training examples of author names labeled either YES ( J. Ullman & Jeffrey D. Ullman ) or NO ( J. Ullman & James Ullmann ), SVM creates a maximum-margin hyperplane that splits YES/NO training examples. Testing SVM classifies vectors by mapping them via kernel trick to a high dimensional space (two classes of equivalent pairs and different ones are separated by a hyperplane). Kernel Use Radial Basis Function (RBF kernel) IQIS IQIS
7 Distance Metric String-based Distance Metrics Distance Metric Vector-based Cosine Distance IQIS IQIS Accuracy w. error types [DBLP] SC: Scalability (DBLP with k=) Accuracy iffl Token 4-gram Time (sec) TFIDF Jaccard Jaro 0.4 Name Abbreviation: 30% NBM SVM Cosine TFIDF Jaccard Jaro JaroWin Name Alternation: 30% First Name Misspelling: % Last Name Misspelling: % Contraction: % Middle Name Initial Omission: 4% Combination: 0% N -NN -NC -NH Method IQIS IQIS Distribution in top-0 Step : Speed [DBLP] % TFIDF Jaccard Jaro JaroWin Cosine Rank Tim e (sec; Logarithm ic scale) Token 4- gram iffl NBM SVM VSM TFIDF Jaccard Jaro JaroWin IQIS IQIS
8 Overall: Accuracy [DBLP] MC: Accuracy (DBLP) Accuracy (%) iffl Token 4-gram NBM SVM Cosine TFIDF Jaccard Jaro JaroWin IQIS IQIS Naïve Solution MC problem can be naturally cast to K- Clustering Problem Cluster N data points into K clusters Issue: typical citation collections of an author in DBLP is short (ie, author-citation graph exhibits scale-free network) N is too small to apply clustering Name Disambiguation Algorithm Borrow solutions from RL community Step : Significantly reduce author names via blocking Step : Apply more expensive distance measures within each block IQIS IQIS SC: Processing time for Step Time (sec) 300 DBLP e-print BioMed EconPapers NBM SVM Cosine TFIDF Jaccard Jaro JaroWin IQIS
Practical maintenance of evolving metadata for digital preservation: algorithmic solution and system support
Int J Digit Libr (2007) 6:313 326 DOI 10.1007/s00799-007-0014-9 REGULAR PAPER Practical maintenance of evolving metadata for digital preservation: algorithmic solution and system support Dongwon Lee Published
More informationAUTHOR NAME DISAMBIGUATION AND CROSS SOURCE DOCUMENT COREFERENCE
The Pennsylvania State University The Graduate School College of Information Sciences and Technology AUTHOR NAME DISAMBIGUATION AND CROSS SOURCE DOCUMENT COREFERENCE A Thesis in Information Sciences and
More informationSystem Support for Name Authority Control Problem in Digital Libraries: OpenDBLP Approach
System Support for Name Authority Control Problem in Digital Libraries: OpenDBLP Approach Yoojin Hong 1, Byung-Won On 1, and Dongwon Lee 2 1 Department of Computer Science and Engineering, The Pennsylvania
More informationThe Pennsylvania State University The Graduate School DATA CLEANING TECHNIQUES BY MEANS OF ENTITY RESOLUTION
The Pennsylvania State University The Graduate School DATA CLEANING TECHNIQUES BY MEANS OF ENTITY RESOLUTION A Thesis in Computer Science and Engineering by Byung-Won On c 2007 Byung-Won On Submitted in
More informationThe Pennsylvania State University. The Graduate School EFFECTIVE SOLUTIONS FOR NAME LINKAGE AND THEIR APPLICATIONS. A Thesis in
The Pennsylvania State University The Graduate School EFFECTIVE SOLUTIONS FOR NAME LINKAGE AND THEIR APPLICATIONS A Thesis in Computer Science and Engineering by Ergin Elmacioglu c 2008 Ergin Elmacioglu
More informationScalable Name Disambiguation using Multi-level Graph Partition
Scalable Name Disambiguation using Multi-level Graph Partition Byung-Won On Penn State University, USA on@cse.psu.edu Dongwon Lee Penn State University, USA dongwon@psu.edu Abstract When non-unique values
More informationSimilarity Joins in MapReduce
Similarity Joins in MapReduce Benjamin Coors, Kristian Hunt, and Alain Kaeslin KTH Royal Institute of Technology {coors,khunt,kaeslin}@kth.se Abstract. This paper studies how similarity joins can be implemented
More informationImproving Grouped-Entity Resolution using Quasi-Cliques
Improving Grouped-Entity Resolution using Quasi-Cliques Byung-Won On, Ergin Elmacioglu, Dongwon Lee Jaewoo Kang Jian Pei The Pennsylvania State University NCSU & Korea Univ. Simon Fraser Univ. {on,ergin,dongwon}@psu.edu
More informationScholarly Big Data: Leverage for Science
Scholarly Big Data: Leverage for Science C. Lee Giles The Pennsylvania State University University Park, PA, USA giles@ist.psu.edu http://clgiles.ist.psu.edu Funded in part by NSF, Allen Institute for
More informationSearch Engines. Information Retrieval in Practice
Search Engines Information Retrieval in Practice All slides Addison Wesley, 2008 Classification and Clustering Classification and clustering are classical pattern recognition / machine learning problems
More informationLink Mining & Entity Resolution. Lise Getoor University of Maryland, College Park
Link Mining & Entity Resolution Lise Getoor University of Maryland, College Park Learning in Structured Domains Traditional machine learning and data mining approaches assume: A random sample of homogeneous
More informationDetermine the Entity Number in Hierarchical Clustering for Web Personal Name Disambiguation
Determine the Entity Number in Hierarchical Clustering for Web Personal Name Disambiguation Jun Gong Department of Information System Beihang University No.37 XueYuan Road HaiDian District, Beijing, China
More informationCSE 158. Web Mining and Recommender Systems. Midterm recap
CSE 158 Web Mining and Recommender Systems Midterm recap Midterm on Wednesday! 5:10 pm 6:10 pm Closed book but I ll provide a similar level of basic info as in the last page of previous midterms CSE 158
More informationCollective Entity Resolution in Relational Data
Collective Entity Resolution in Relational Data I. Bhattacharya, L. Getoor University of Maryland Presented by: Srikar Pyda, Brett Walenz CS590.01 - Duke University Parts of this presentation from: http://www.norc.org/pdfs/may%202011%20personal%20validation%20and%20entity%20resolution%20conference/getoorcollectiveentityresolution
More informationTABLE OF CONTENTS PAGE TITLE NO.
TABLE OF CONTENTS CHAPTER PAGE TITLE ABSTRACT iv LIST OF TABLES xi LIST OF FIGURES xii LIST OF ABBREVIATIONS & SYMBOLS xiv 1. INTRODUCTION 1 2. LITERATURE SURVEY 14 3. MOTIVATIONS & OBJECTIVES OF THIS
More informationSurvey of String Similarity Join Algorithms on Large Scale Data
Survey of String Similarity Join Algorithms on Large Scale Data P.Selvaramalakshmi Research Scholar Dept. of Computer Science Bishop Heber College (Autonomous) Tiruchirappalli, Tamilnadu, India. Dr. S.
More informationA BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK
A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK Qing Guo 1, 2 1 Nanyang Technological University, Singapore 2 SAP Innovation Center Network,Singapore ABSTRACT Literature review is part of scientific
More informationAuthor Disambiguation using Error-driven Machine Learning with a Ranking Loss Function
Author Disambiguation using Error-driven Machine Learning with a Ranking Loss Function Aron Culotta, Pallika Kanani, Robert Hall, Michael Wick, Andrew McCallum Department of Computer Science University
More informationRepresentation Learning using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval
Representation Learning using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval Xiaodong Liu 12, Jianfeng Gao 1, Xiaodong He 1 Li Deng 1, Kevin Duh 2, Ye-Yi Wang 1 1
More informationBased on Raymond J. Mooney s slides
Instance Based Learning Based on Raymond J. Mooney s slides University of Texas at Austin 1 Example 2 Instance-Based Learning Unlike other learning algorithms, does not involve construction of an explicit
More informationVisualization and text mining of patent and non-patent data
of patent and non-patent data Anton Heijs Information Solutions Delft, The Netherlands http://www.treparel.com/ ICIC conference, Nice, France, 2008 Outline Introduction Applications on patent and non-patent
More informationAutomatic Identification of User Goals in Web Search [WWW 05]
Automatic Identification of User Goals in Web Search [WWW 05] UichinLee @ UCLA ZhenyuLiu @ UCLA JunghooCho @ UCLA Presenter: Emiran Curtmola@ UC San Diego CSE 291 4/29/2008 Need to improve the quality
More informationNUS-I2R: Learning a Combined System for Entity Linking
NUS-I2R: Learning a Combined System for Entity Linking Wei Zhang Yan Chuan Sim Jian Su Chew Lim Tan School of Computing National University of Singapore {z-wei, tancl} @comp.nus.edu.sg Institute for Infocomm
More informationTour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers
Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers James P. Biagioni Piotr M. Szczurek Peter C. Nelson, Ph.D. Abolfazl Mohammadian, Ph.D. Agenda Background
More informationCLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS
CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of
More informationClustering using Topic Models
Clustering using Topic Models Compiled by Sujatha Das, Cornelia Caragea Credits for slides: Blei, Allan, Arms, Manning, Rai, Lund, Noble, Page. Clustering Partition unlabeled examples into disjoint subsets
More informationAuthorship Disambiguation and Alias Resolution in Data
Authorship Disambiguation and Alias Resolution in Email Data Freek Maes Johannes C. Scholtes Department of Knowledge Engineering Maastricht University, P.O. Box 616, 6200 MD Maastricht Abstract Given a
More informationTowards Breaking the Quality Curse. AWebQuerying Web-Querying Approach to Web People Search.
Towards Breaking the Quality Curse. AWebQuerying Web-Querying Approach to Web People Search. Dmitri V. Kalashnikov Rabia Nuray-Turan Sharad Mehrotra Dept of Computer Science University of California, Irvine
More informationName Disambiguation Using Web Connection
Name Disambiguation Using Web nection Yiming Lu 1*, Zaiqing Nie, Taoyuan Cheng *, Ying Gao *, Ji-Rong Wen Microsoft Research Asia, Beijing, China 1 University of California, Irvine, U.S.A. Renmin University
More informationLocality-Sensitive Hashing
Locality-Sensitive Hashing & Image Similarity Search Andrew Wylie Overview; LSH given a query q (or not), how do we find similar items from a large search set quickly? Can t do all pairwise comparisons;
More informationRobust and Efficient Fuzzy Match for Online Data Cleaning. Motivation. Methodology
Robust and Efficient Fuzzy Match for Online Data Cleaning S. Chaudhuri, K. Ganjan, V. Ganti, R. Motwani Presented by Aaditeshwar Seth 1 Motivation Data warehouse: Many input tuples Tuples can be erroneous
More informationCS6375: Machine Learning Gautam Kunapuli. Mid-Term Review
Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes
More informationLink Prediction for Social Network
Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue
More informationDatabase Applications (15-415)
Database Applications (15-415) DBMS Internals- Part VII Lecture 15, March 17, 2014 Mohammad Hammoud Today Last Session: DBMS Internals- Part VI Algorithms for Relational Operations Today s Session: DBMS
More informationSemantic Scholar. ICSTI Towards a More Efficient Review of Research Literature 11 September
Semantic Scholar ICSTI Towards a More Efficient Review of Research Literature 11 September 2018 Allen Institute for Artificial Intelligence (https://allenai.org/) Non-profit Research Institute in Seattle,
More informationClustering & Classification (chapter 15)
Clustering & Classification (chapter 5) Kai Goebel Bill Cheetham RPI/GE Global Research goebel@cs.rpi.edu cheetham@cs.rpi.edu Outline k-means Fuzzy c-means Mountain Clustering knn Fuzzy knn Hierarchical
More informationCSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18
CSE 417T: Introduction to Machine Learning Lecture 22: The Kernel Trick Henry Chai 11/15/18 Linearly Inseparable Data What can we do if the data is not linearly separable? Accept some non-zero in-sample
More informationBetter Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web
Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl
More informationData Mining in Bioinformatics Day 1: Classification
Data Mining in Bioinformatics Day 1: Classification Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institute Tübingen and Eberhard Karls
More informationNaming Disambiguation Based on Approximate String Matching for Co- Authorship Networks
Naming Disambig Ba on Approximate String Matching for Co- Authorship s Dr. V. Akila Dept. of Computer Science & Engg. akila@pec.edu Dr.V.Govindasamy Dept. of Information Technology, vgopu@pec.edu R. Kowsalya
More informationClassification. 1 o Semestre 2007/2008
Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class
More informationText Classification and Clustering Using Kernels for Structured Data
Text Mining SVM Conclusion Text Classification and Clustering Using, pgeibel@uos.de DGFS Institut für Kognitionswissenschaft Universität Osnabrück February 2005 Outline Text Mining SVM Conclusion 1 Text
More informationAutomatic Record Linkage using Seeded Nearest Neighbour and SVM Classification
Automatic Record Linkage using Seeded Nearest Neighbour and SVM Classification Peter Christen Department of Computer Science, ANU College of Engineering and Computer Science, The Australian National University,
More informationPANDA: A Platform for Academic Knowledge Discovery and Acquisition
PANDA: A Platform for Academic Knowledge Discovery and Acquisition Zhaoan Dong 1 ; Jiaheng Lu 2,1 ; Tok Wang Ling 3 1.Renmin University of China 2.University of Helsinki 3.National University of Singapore
More informationPERSONALIZED MOBILE SEARCH ENGINE BASED ON MULTIPLE PREFERENCE, USER PROFILE AND ANDROID PLATFORM
PERSONALIZED MOBILE SEARCH ENGINE BASED ON MULTIPLE PREFERENCE, USER PROFILE AND ANDROID PLATFORM Ajit Aher, Rahul Rohokale, Asst. Prof. Nemade S.B. B.E. (computer) student, Govt. college of engg. & research
More informationEntity Resolution, Clustering Author References
, Clustering Author References Vlad Shchogolev vs299@columbia.edu May 1, 2007 Outline 1 What is? Motivation 2 Formal Definition Efficieny Considerations Measuring Text Similarity Other approaches 3 Clustering
More informationDATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines
DATA MINING LECTURE 10B Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines NEAREST NEIGHBOR CLASSIFICATION 10 10 Illustrating Classification Task Tid Attrib1
More informationijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System
ijade Reporter An Intelligent Multi-agent Based Context Aware Reporting System Eddie C.L. Chan and Raymond S.T. Lee The Department of Computing, The Hong Kong Polytechnic University, Hung Hong, Kowloon,
More informationECG782: Multidimensional Digital Signal Processing
ECG782: Multidimensional Digital Signal Processing Object Recognition http://www.ee.unlv.edu/~b1morris/ecg782/ 2 Outline Knowledge Representation Statistical Pattern Recognition Neural Networks Boosting
More informationDeveloping Focused Crawlers for Genre Specific Search Engines
Developing Focused Crawlers for Genre Specific Search Engines Nikhil Priyatam Thesis Advisor: Prof. Vasudeva Varma IIIT Hyderabad July 7, 2014 Examples of Genre Specific Search Engines MedlinePlus Naukri.com
More informationScopus. Information literacy in Chemistry. J une 14, 2011
Information literacy in Chemistry Scopus J une 14, 2011 BIBLIOGRAPHIC DATABASE electronic archive of bibliographic records that refer to published academic literature the records are structured and organized
More informationSpam Classification Documentation
Spam Classification Documentation What is SPAM? Unsolicited, unwanted email that was sent indiscriminately, directly or indirectly, by a sender having no current relationship with the recipient. Objective:
More informationslide courtesy of D. Yarowsky Splitting Words a.k.a. Word Sense Disambiguation Intro to NLP - J. Eisner 1
Splitting Words a.k.a. Word Sense Disambiguation 600.465 - Intro to NLP - J. Eisner Representing Word as Vector Could average over many occurrences of the word... Each word type has a different vector
More informationFinding Topic-centric Identified Experts based on Full Text Analysis
Finding Topic-centric Identified Experts based on Full Text Analysis Hanmin Jung, Mikyoung Lee, In-Su Kang, Seung-Woo Lee, Won-Kyung Sung Information Service Research Lab., KISTI, Korea jhm@kisti.re.kr
More informationContents. Preface to the Second Edition
Preface to the Second Edition v 1 Introduction 1 1.1 What Is Data Mining?....................... 4 1.2 Motivating Challenges....................... 5 1.3 The Origins of Data Mining....................
More informationSPARK: Top-k Keyword Query in Relational Database
SPARK: Top-k Keyword Query in Relational Database Wei Wang University of New South Wales Australia 20/03/2007 1 Outline Demo & Introduction Ranking Query Evaluation Conclusions 20/03/2007 2 Demo 20/03/2007
More informationIntroduction to Automated Text Analysis. bit.ly/poir599
Introduction to Automated Text Analysis Pablo Barberá School of International Relations University of Southern California pablobarbera.com Lecture materials: bit.ly/poir599 Today 1. Solutions for last
More informationText classification II CE-324: Modern Information Retrieval Sharif University of Technology
Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2015 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationSUPERVISED TERM WEIGHTING METHODS FOR URL CLASSIFICATION
Journal of Computer Science 10 (10): 1969-1976, 2014 ISSN: 1549-3636 2014 doi:10.3844/jcssp.2014.1969.1976 Published Online 10 (10) 2014 (http://www.thescipub.com/jcs.toc) SUPERVISED TERM WEIGHTING METHODS
More informationCS229 Final Project: Predicting Expected Response Times
CS229 Final Project: Predicting Expected Email Response Times Laura Cruz-Albrecht (lcruzalb), Kevin Khieu (kkhieu) December 15, 2017 1 Introduction Each day, countless emails are sent out, yet the time
More informationBasic Problem Addressed. The Approach I: Training. Main Idea. The Approach II: Testing. Why a set of vocabularies?
Visual Categorization With Bags of Keypoints. ECCV,. G. Csurka, C. Bray, C. Dance, and L. Fan. Shilpa Gulati //7 Basic Problem Addressed Find a method for Generic Visual Categorization Visual Categorization:
More informationCSE 5243 INTRO. TO DATA MINING
CSE 53 INTRO. TO DATA MINING Locality Sensitive Hashing (LSH) Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan Parthasarathy @OSU MMDS Secs. 3.-3.. Slides
More informationDatabase Applications (15-415)
Database Applications (15-415) DBMS Internals- Part VI Lecture 14, March 12, 2014 Mohammad Hammoud Today Last Session: DBMS Internals- Part V Hash-based indexes (Cont d) and External Sorting Today s Session:
More informationData Linkage Methods: Overview of Computer Science Research
Data Linkage Methods: Overview of Computer Science Research Peter Christen Research School of Computer Science, ANU College of Engineering and Computer Science, The Australian National University, Canberra,
More informationJoint Entity Resolution
Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute
More informationBehavioral Data Mining. Lecture 10 Kernel methods and SVMs
Behavioral Data Mining Lecture 10 Kernel methods and SVMs Outline SVMs as large-margin linear classifiers Kernel methods SVM algorithms SVMs as large-margin classifiers margin The separating plane maximizes
More informationPublished by: PIONEER RESEARCH & DEVELOPMENT GROUP ( 1
Cluster Based Speed and Effective Feature Extraction for Efficient Search Engine Manjuparkavi A 1, Arokiamuthu M 2 1 PG Scholar, Computer Science, Dr. Pauls Engineering College, Villupuram, India 2 Assistant
More informationApproximate String Joins
Approximate String Joins Divesh Srivastava AT&T Labs-Research The Need for String Joins Substantial amounts of data in existing RDBMSs are strings There is a need to correlate data stored in different
More informationSupport Vector Machines + Classification for IR
Support Vector Machines + Classification for IR Pierre Lison University of Oslo, Dep. of Informatics INF3800: Søketeknologi April 30, 2014 Outline of the lecture Recap of last week Support Vector Machines
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 9: Data Mining (3/4) March 8, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These slides
More informationCAMCOS Report Day. December 9 th, 2015 San Jose State University Project Theme: Classification
CAMCOS Report Day December 9 th, 2015 San Jose State University Project Theme: Classification On Classification: An Empirical Study of Existing Algorithms based on two Kaggle Competitions Team 1 Team 2
More informationSUPPORT VECTOR MACHINES
SUPPORT VECTOR MACHINES Today Reading AIMA 18.9 Goals (Naïve Bayes classifiers) Support vector machines 1 Support Vector Machines (SVMs) SVMs are probably the most popular off-the-shelf classifier! Software
More informationJure Leskovec, Cornell/Stanford University. Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research
Jure Leskovec, Cornell/Stanford University Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research Network: an interaction graph: Nodes represent entities Edges represent interaction
More informationPerformance Improvement of Hardware-Based Packet Classification Algorithm
Performance Improvement of Hardware-Based Packet Classification Algorithm Yaw-Chung Chen 1, Pi-Chung Wang 2, Chun-Liang Lee 2, and Chia-Tai Chan 2 1 Department of Computer Science and Information Engineering,
More informationClassification of Tweets using Supervised and Semisupervised Learning
Classification of Tweets using Supervised and Semisupervised Learning Achin Jain, Kuk Jang I. INTRODUCTION The goal of this project is to classify the given tweets into 2 categories, namely happy and sad.
More informationLecture 10: Support Vector Machines and their Applications
Lecture 10: Support Vector Machines and their Applications Cognitive Systems - Machine Learning Part II: Special Aspects of Concept Learning SVM, kernel trick, linear separability, text mining, active
More informationKeywords APSE: Advanced Preferred Search Engine, Google Android Platform, Search Engine, Click-through data, Location and Content Concepts.
Volume 5, Issue 3, March 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Advanced Preferred
More informationTRIE BASED METHODS FOR STRING SIMILARTIY JOINS
TRIE BASED METHODS FOR STRING SIMILARTIY JOINS Venkat Charan Varma Buddharaju #10498995 Department of Computer and Information Science University of MIssissippi ENGR-654 INFORMATION SYSTEM PRINCIPLES RESEARCH
More informationA Feature Selection Method to Handle Imbalanced Data in Text Classification
A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University
More informationART 알고리즘특강자료 ( 응용 01)
An Adaptive Intrusion Detection Algorithm Based on Clustering and Kernel-Method ART 알고리즘특강자료 ( 응용 01) DB 및데이터마이닝연구실 http://idb.korea.ac.kr 2009 년 05 월 01 일 1 Introduction v Background of Research v In
More informationA hybrid method to categorize HTML documents
Data Mining VI 331 A hybrid method to categorize HTML documents M. Khordad, M. Shamsfard & F. Kazemeyni Electrical & Computer Engineering Department, Shahid Beheshti University, Iran Abstract In this paper
More informationSupport Vector Machines
Support Vector Machines About the Name... A Support Vector A training sample used to define classification boundaries in SVMs located near class boundaries Support Vector Machines Binary classifiers whose
More informationBetter Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web
Better Contextual Suggestions in ClueWeb12 Using Domain Knowledge Inferred from The Open Web Thaer Samar 1, Alejandro Bellogín 2, and Arjen P. de Vries 1 1 Centrum Wiskunde & Informatica, {samar,arjen}@cwi.nl
More informationText Categorization (I)
CS473 CS-473 Text Categorization (I) Luo Si Department of Computer Science Purdue University Text Categorization (I) Outline Introduction to the task of text categorization Manual v.s. automatic text categorization
More informationColumn Stores vs. Row Stores How Different Are They Really?
Column Stores vs. Row Stores How Different Are They Really? Daniel J. Abadi (Yale) Samuel R. Madden (MIT) Nabil Hachem (AvantGarde) Presented By : Kanika Nagpal OUTLINE Introduction Motivation Background
More informationAnnouncement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17
Announcement CompSci 516 Database Systems Lecture 10 Query Evaluation and Join Algorithms Project proposal pdf due on sakai by 5 pm, tomorrow, Thursday 09/27 One per group by any member Instructor: Sudeepa
More informationTable Of Contents: xix Foreword to Second Edition
Data Mining : Concepts and Techniques Table Of Contents: Foreword xix Foreword to Second Edition xxi Preface xxiii Acknowledgments xxxi About the Authors xxxv Chapter 1 Introduction 1 (38) 1.1 Why Data
More informationData Structures and Algorithms
Data Structures and Algorithms Autumn 2018-2019 Outline Sorting Algorithms (contd.) 1 Sorting Algorithms (contd.) Quicksort Outline Sorting Algorithms (contd.) 1 Sorting Algorithms (contd.) Quicksort Quicksort
More informationPart 12: Advanced Topics in Collaborative Filtering. Francesco Ricci
Part 12: Advanced Topics in Collaborative Filtering Francesco Ricci Content Generating recommendations in CF using frequency of ratings Role of neighborhood size Comparison of CF with association rules
More informationDocument Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure
Document Clustering using Feature Selection Based on Multiviewpoint and Link Similarity Measure Neelam Singh neelamjain.jain@gmail.com Neha Garg nehagarg.february@gmail.com Janmejay Pant geujay2010@gmail.com
More informationTemplates. for scalable data analysis. 2 Synchronous Templates. Amr Ahmed, Alexander J Smola, Markus Weimer. Yahoo! Research & UC Berkeley & ANU
Templates for scalable data analysis 2 Synchronous Templates Amr Ahmed, Alexander J Smola, Markus Weimer Yahoo! Research & UC Berkeley & ANU Running Example Inbox Spam Running Example Inbox Spam Spam Filter
More informationSupervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning. Supervised vs. Unsupervised Learning
Overview T7 - SVM and s Christian Vögeli cvoegeli@inf.ethz.ch Supervised/ s Support Vector Machines Kernels Based on slides by P. Orbanz & J. Keuchel Task: Apply some machine learning method to data from
More informationSupport vector machine (II): non-linear SVM. LING 572 Fei Xia
Support vector machine (II): non-linear SVM LING 572 Fei Xia 1 Linear SVM Maximizing the margin Soft margin Nonlinear SVM Kernel trick A case study Outline Handling multi-class problems 2 Non-linear SVM
More informationExploiting Context Analysis for Combining Multiple Entity Resolution Systems
Exploiting Context Analysis for Combining Multiple Entity Resolution Systems Zhaoqi Chen Microsoft Corporation Redmond, USA schen@microsoft.com Dmitri V. Kalashnikov Dept. of Computer Science University
More informationString Vector based KNN for Text Categorization
458 String Vector based KNN for Text Categorization Taeho Jo Department of Computer and Information Communication Engineering Hongik University Sejong, South Korea tjo018@hongik.ac.kr Abstract This research
More informationUnderstanding the Query: THCIB and THUIS at NTCIR-10 Intent Task. Junjun Wang 2013/4/22
Understanding the Query: THCIB and THUIS at NTCIR-10 Intent Task Junjun Wang 2013/4/22 Outline Introduction Related Word System Overview Subtopic Candidate Mining Subtopic Ranking Results and Discussion
More informationThe FreeSearch System
Wolfgang Nejdl 03/05/12 1 The FreeSearch System Search engine for digital libraries Simple to use interface Intuitive functionalities Easily scalable Now with focus on Duplicate detection and duplicate
More informationCorrelation Based Feature Selection with Irrelevant Feature Removal
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,
More informationA Learning Method for Entity Matching
A Learning Method for Entity Matching Jie Chen Cheqing Jin Rong Zhang Aoying Zhou Shanghai Key Laboratory of Trustworthy Computing, Software Engineering Institute, East China Normal University, China 5500002@ecnu.cn,
More informationIncorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches
Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Masaki Eto Gakushuin Women s College Tokyo, Japan masaki.eto@gakushuin.ac.jp Abstract. To improve the search performance
More information