Towards Breaking the Quality Curse. AWebQuerying Web-Querying Approach to Web People Search.

Similar documents
Towards Breaking the Quality Curse. A Web-Querying Approach to Web People Search.

SEARCHING for entities is a common activity in Internet

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems

ACM Journal of Data and Information Quality (ACM JDIQ), 4(2), Adaptive Connection Strength Models for Relationship-based Entity Resolution

Adaptive Graphical Approach to Entity Resolution

Name Disambiguation Using Web Connection

Link Mining & Entity Resolution. Lise Getoor University of Maryland, College Park

IJMIE Volume 2, Issue 8 ISSN:

Assigning Vocation-Related Information to Person Clusters for Web People Search Results

Web People Search using Ontology Based Decision Tree Mrunal Patil 1, Sonam Khomane 2, Varsha Saykar, 3 Kavita Moholkar 4

Graph Mining and Social Network Analysis

Assigning Location Information to Display Individuals on a Map for Web People Search Results

Review: Identification of cell types from single-cell transcriptom. method

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity

TALP at WePS Daniel Ferrés and Horacio Rodríguez

Inferring User Search for Feedback Sessions

Entity Linking. David Soares Batista. November 11, Disciplina de Recuperação de Informação, Instituto Superior Técnico

Collective Entity Resolution in Relational Data

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Query-time Entity Resolution

RSDC 09: Tag Recommendation Using Keywords and Association Rules

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Identifying and Ranking Possible Semantic and Common Usage Categories of Search Engine Queries

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Automatic Identification of User Goals in Web Search [WWW 05]

Diversification of Query Interpretations and Search Results

Evaluation Methods for Focused Crawling

PRIS at TAC2012 KBP Track

Information Retrieval. Lecture 9 - Web search basics

Dynamically Building Facets from Their Search Results

Text Categorization (I)

Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm

A Novel Categorized Search Strategy using Distributional Clustering Neenu Joseph. M 1, Sudheep Elayidom 2

UMass at TREC 2006: Enterprise Track

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

LSI UNED at M-WePNaD: Embeddings for Person Name Disambiguation

Joint Entity Resolution

Bruno Martins. 1 st Semester 2012/2013

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

SCUBA DIVER: SUBSPACE CLUSTERING OF WEB SEARCH RESULTS

NUS-I2R: Learning a Combined System for Entity Linking

TREC 2016 Dynamic Domain Track: Exploiting Passage Representation for Retrieval and Relevance Feedback

Scholarly Big Data: Leverage for Science

Outline. Eg. 1: DBLP. Motivation. Eg. 2: ACM DL Portal. Eg. 2: DBLP. Digital Libraries (DL) often have many errors that negatively affect:

Entity Linking at TAC Task Description

MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI

Title: Artificial Intelligence: an illustration of one approach.

Supervised Models for Multimodal Image Retrieval based on Visual, Semantic and Geographic Information

NTU Approaches to Subtopic Mining and Document Ranking at NTCIR-9 Intent Task

ijade Reporter An Intelligent Multi-agent Based Context Aware News Reporting System

IRCE at the NTCIR-12 IMine-2 Task

Rishiraj Saha Roy and Niloy Ganguly IIT Kharagpur India. Monojit Choudhury and Srivatsan Laxman Microsoft Research India India

CS230: Lecture 3 Various Deep Learning Topics

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

AN EFFECTIVE SEARCH ON WEB LOG FROM MOST POPULAR DOWNLOADED CONTENT

Measuring Semantic Similarity between Words Using Page Counts and Snippets

Artificial Intelligence. Programming Styles

PERSONALIZED MOBILE SEARCH ENGINE BASED ON MULTIPLE PREFERENCE, USER PROFILE AND ANDROID PLATFORM

Efficient Strategies for Improving Partitioning-Based Author Coreference by Incorporating Web Pages as Graph Nodes

Link Analysis and Web Search

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

7. Mining Text and Web Data

SOURCERER: MINING AND SEARCHING INTERNET- SCALE SOFTWARE REPOSITORIES

Incorporating Hyperlink Analysis in Web Page Clustering

Categorizing Search Results Using WordNet and Wikipedia

Word Disambiguation in Web Search

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Automatically Constructing a Directory of Molecular Biology Databases

Clustering Results. Result List Example. Clustering Results. Information Retrieval

Accessing XML documents: The INEX initiative. Mounia Lalmas, Thomas Rölleke, Zoltán Szlávik, Tassos Tombros (+ Duisburg-Essen)

Approach Research of Keyword Extraction Based on Web Pages Document

Feature selection. LING 572 Fei Xia

Extracting Rankings for Spatial Keyword Queries from GPS Data

Introduction to Information Retrieval

GRAPE: A Graph-Based Framework for Disambiguating People Appearances in Web Search

Large Scale 3D Reconstruction by Structure from Motion

Automatic Summarization

Linking Entities in Chinese Queries to Knowledge Graph

UNIT-V WEB MINING. 3/18/2012 Prof. Asha Ambhaikar, RCET Bhilai.

Determine the Entity Number in Hierarchical Clustering for Web Personal Name Disambiguation

Gene Clustering & Classification

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Text Analytics (Text Mining)

CS145: INTRODUCTION TO DATA MINING

Multimedia Event Detection for Large Scale Video. Benjamin Elizalde

Query Difficulty Prediction for Contextual Image Retrieval

Cluster Analysis (b) Lijun Zhang

Introduction to Web Clustering

CSE 3. How Is Information Organized? Searching in All the Right Places. Design of Hierarchies

IMPROVING INFORMATION RETRIEVAL BASED ON QUERY CLASSIFICATION ALGORITHM

An Adaptive Agent for Web Exploration Based on Concept Hierarchies

Hierarchical Location and Topic Based Query Expansion

Improvement of Web Search Results using Genetic Algorithm on Word Sense Disambiguation

International Journal of Advance Engineering and Research Development. Survey of Web Usage Mining Techniques for Web-based Recommendations

I. INTRODUCTION. Fig Taxonomy of approaches to build specialized search engines, as shown in [80].

DATA MINING - 1DL105, 1DL111

LET:Towards More Precise Clustering of Search Results

LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION

Transcription:

Towards Breaking the Quality Curse. AWebQuerying Web-Querying Approach to Web People Search. Dmitri V. Kalashnikov Rabia Nuray-Turan Sharad Mehrotra Dept of Computer Science University of California, Irvine 1

Searching for People on the Web Heterogeneity Approx. 5-10% of internet searches Andrew ACM: Items for look for specific people by their names [Guha&Garg 04] McCallum s Homepage Associate Professor Andrew Kachites McCallum Author: McCallum, R. Andrew Challenges: Heterogeneity of representations: Same entity represented multiple ways. Ambiguity in representations: Same representation of multiple entities. WePS systems address both these challenges. DBLP: Andrew McCallum UMass professor UMass professor Ambiguity Mental Floss Andrew McCallum Musician News: Andrew McCallum Australia CROSS president 2

Search Engines v WePS WePS system return clusters of web pages wherein pages of the same entity are clustered together Google results for Andrew McCallum WePS results for Andrew McCallum Umass professor photographer physician 3

Clustering Search Engines v WePS Examples of clustering search engines: Clusty http://www.clusty.com Kartoo http://www.kartoo.com Search results of Andrew McCallum using Clusty WePS system results of Andrew McCallum Differences: While both return clustered search results, Clustering search engines cluster similar/related pages based on broad topics (e.g., research, photos) while WePS system cluster based on disambiguating amongst various namesakes. 4

WePS System Architecture Classified based on where references are disambiguated Client side solution the client resolves references Client Disamb. Approach Query Returned pages Server Web Proxy Solution Query Proxy Proxy resolves references Server side solution The server resolves references Client Client clusters clusters Disamb. Approach Query Web Returned pages Server Web Server Disamb. Approach 5

Current WePS Approaches Commercial Systems: Zoominfo http://www.zoominfo.com Spock http://www.spock.com Multiple research proposals [Kamha et al. 04; Artiles et. al 07 & 05; Bekkerman et. al. 05; Bollegala et. al. 06;Matsou et. al. 06] etc. Web people search task included in Sem-Eval workshop in 2007 16 teams participated in the task WePS techniques differ from each other in: data analyzed for disambiguation & Clustering approach used Type of data analyzed Baseline exploit information in result pages only Keywords, named entities, hyperlinks, emails, [Bekkerman et.al. 05], [Li, et. al. 05][Elmachioglu, et. al. 07] exploit external information Ontologies, Encyclopedia, [Kalashnikov, et. al, 07] Exploit Web content Crawling [Kanani et. al. 07] Search engine Statistics 6

Contributions of this paper Exploiting Co-occurrence occurrence statistics from the search engine for WePS. A novel classification approach based on skylines to support reference resolution. Outperforms existing techniques on multiple data sets 7

Approach Overview Top-K Preprocessing Post processing Webpages 1.Extract NEs 2. Filter Initial Clustering Preprocessed pages 1. Rank clusters 2. Extract cluster sketches 3. Rank pages in clusters Final clusters Person1 Refining Clusters using Web Queries Person3 Person2 1. Compute similarity 1. Web Query using TF/IDF on NE 2. Feature Extraction 2. Cluster based on Initial clusters 3. Skyline based classifier TF/IDF similarity 4. Cluster refinement 8

Preprocessing: Named Entity Extraction Locate query name on web page Extract M neighboring names entities organization people Names Stanford Named Entity Recognizer & GATE used People NEs Lise Getoor Ben Taskar John McCallum Charles Organization UMASS ACM Queue Location Amherst 9

Preprocessing: Filtering Named Input: Extracted NEs Entities Location Filter Remove location NEs by looking up a dictionary Ambiguous name Filter Remove one-word person names, such as John Ambiguous Org. Filter Remove common English words Same last name Filter Remove the person names with the same last name. Output: Cleaned NEs 10

Initial Clustering Resolved references 0.1 0.9 0.1 0.9 0.2 Threshold h = 0.5 0.2 0.3 0.3 0.8 0.8 Intuition: If two web pages significantly overlap in terms of associated NEs, then the two pages refer to the same individual. Approach: Compute similarity between web pages based on TF/IDF over Named Entities Merge all pairs of Web pages whose TF/IDF similarity exceeds a threshold Threshold learnt over the training dataset. 11

Cluster Refinement Initial Clustering based on TF/IDF similarity: Accurate references resolved are usually correct. Conservative too many unresolved references Andrew McCallum UMASS 0.01 Andrew McCallum CMU Key Observation: 0.5 0.7 Co-occurrence Andrew statistics can McCallum Job: UMASS Phd: CMU Web may contain additional information that may further resolve unresolved references. Co-occurrenceoccurrence between contexts associated with unresolved references be learnt by querying the search engine 12

Using Co-occurrence to Disambiguate Query N Andrew McCallum Search Results D1 Andrew McCallum UMASS D2 Number of web pages containing Andrew McCallum AND UMASS AND CMU Andrew McCallum CMU Context associated with Search results C1 = UMASS C2 = CMU N. C. C CMU Overlap What between if context t 1NEs 2 " AndrewMcCa llum "". UMASS "". CMU " Ci and Cj are very general contexts N " Andrew McCallum " over the Web The co-occurrence counts could be high for all namesakes! Solution -- Normalize N Ci Cj using Dice similarity overlap between contexts associated with two pages of the same entity expected to be higher compared to pages of unrelated entities. Merge Web pages Di and Dj based on N.Ci.Cj / N? 13

Dice Similarity Two possible versions of dice: will be large for general context words 14

Dice2 Similarity -- Example Let an N O 1 O 2 query be: Q: Andrew McCallum AND Yahoo AND Google Let counts be: Andrew McCallum AND Yahoo AND Google 522 Andrew McCallum AND Yahoo 2470 Andrew McCallum AND Google 8950 Andrew McCallum 38500 Yahoo AND Google 73000000 Dice: Dice 1 = 0.05 Dice 2 = 7.15e-6 Dice 2 captures that Yahoo and Google are too unspecific Organization Entities to be a good context for Andrew McCallum Dice 2 leads to visibly better results in practice! 15

Web Queries to Gather Co-Occurrence Statistics Context People NEs Lise Getoor Ben Taskar Organization UMASS ACM People NEs Charles A. Sutton Ben Wellner Organization DBLP KDD Association between context (L.G. OR B.T.) (L.G. OR B.T.) (UMASS OR ACM) AND AND AND (C.A.S. OR B.W.) (DBLP OR KDD) (C.A.S. OR B.W.) (UMASS OR ACM) AND (DBLP OR KDD) Q1 Q2 Q3 Q4 (#97) (#2,570) (#720) (#2,190K) N AND Q1 N AND Q2 N AND Q3 N AND Q4 relevant to original query N (#79) (#49) (#63) (#10,100) Association amongst Context within pages Ex: Q1 = ( Lise Getoor OR Ben Taskar ) AND ( Charles A. Sutton OR Ben Wellner ) 8 web queries 16

Creating Features from Web Co-Occurrence Counts 8-Feature Vector 1: N P i P j = 79/152K = 0.00052 2: Dice(N, P i P j ) = 0.000918 3: N P i O j = 49/152K = 0.00032 4: Dice(N, P i O j j) ) = 0.000561 5: N O i P j = 63/152K = 0.00014 6: D(N, O i P j ) = 0.00073 7: N O i O j = 10,100 = 0.066 8: D(N, O i O j ) = 0.008552 17

Role of a Classifier Initial Clusters Edges with 8D Features Final Clusters Web Co-Occurrence Feature Generation Edge Classification Classifier used to mark edges as merge (+) or do not merge (-) resulting in the final cluster. Since the features are similarity based, the classifier should satisfy the dominance property 18

Dominance Property Let f and g be 8D feature vectors f =(f 1, f 2,,f 8 ) g = (g 1,g 2,,g 8 ) Notation (f up dominates g) if f 1 g 1, f 2 g 2,, f 8 g 8 (f down dominates g) if f 1 g 1, f 2 g 2,, f 8 g 8 Dominance Property for classifier Let f and g be the features associated with edges (d i, d j ) and (d m, d n ) respectively. If f g and the classifier decides to merge (d i, d j ), then it must also merge (d m, d n ) 19

Skyline-Based Classification A skyline classifier consists of an up-dominating classification skyline All points above or on it are declared merge ( + ) The rest are don t merge ( - ) Automatically takes care of dominance Our goal is to learn a classification skyline that supports a good final clusters of search results. 20

Learning Classification Skyline Initial CS Greedy Approach Initialize classification skyline with the point that t dominates the entire space. Pool of points to add to CS Down-dominating skyline on active + edges Maintain current clustering Clustering induced d by current CS Choosing a point from pool to CS that Stage1: Minimizes loss of quality due to adding new - edges Stage2: Maximizes overall quality Iterate until done Output best skyline discovered 21

Example: Adding a point to CS Ground Truth Clustering: Initial Clustering: Current CS = {q,d,c} Current Clustering: Point Pool = {f,e,u,h} 22

Example: Considering f f k a What if add f to CS? v b e s g l w q u m o i h d t p c r j Stage 1: Adding edge k Fp = 0.805 Stage 2: Adding edge f Fp = 0.860 (a) 23

Example: Considering e f k a What if add e to CS? v b e s g l w q u m o i h d t p c r j Stage 1: Adding edges k and v Fp = 0.718 Stage 2: Adding edge e Fp = 0.774 (a) 24

Example: Considering u What if add u to CS? f k a v b e s g l w q u m o i h d t p c r j Stage 1: Adding edge i Fp = 0.805 Stage 2: Adding edge u Fp = 0.833 (a) 25

Example: Considering h What if add h to CS? f k a v b e s g l w q u m o i h d t p c r j Stage 1: Adding edges i and t Fp = 0.625 Stage 2: Adding edge h Fp = 0.741 (a) 26

Example: Stage2 f k a After Stage1 only f and u remain v b e q s g l m u i d h t r j w o p c (b) Stage 1 (for f): Fp = 0.805 Stage 2 (for f): Fp = 0.860 Stage 1 (for u): Fp = 0.805 Stage 2 (for u): Fp = 0.833 Thus adding f to CS! 27

Why Skyline-based classifier? Specialized classifier Designed specifically for the problem at hand Is aware of the clustering underneath Fine-tunes itself to a given clustering quality metric Such as F 1, F P, F B Takes into account the dominance in 8D data Explained shortly Takes into account the transitivity of merges Classifying one edge as + might lead to classifying multiple l edges as + For instance when two large clusters are merged 28

Post Processing Page Ranking Top-K Use the original ranking from search engine Preprocessing Post processing Webpages Cluster Ranking 1.Extract NEs each cluster we simply select the highest 2Fil 2. ranked Filter page and use that as the order of the cluster. Preprocessed pages Cluster Sketches coalesce all pages in the cluster into a single page removing Initial Clustering the stop words we compute the TF/IDF 1. Compute of the remaining similarity words Select using set of TF/IDF terms on above NE a certain 2. Cluster threshold based (or on top N terms) TF/IDF similarity Initial clusters 1. Rank clusters 2. Extract cluster sketches 3. Rank pages in clusters Final clusters 1. Web Query 2. Feature Extraction 3. Skyline based classifier 4. Cluster refinement Person1 Refining Clusters using Web Queries Person3 Person2 29

Experiments Datasets WWW 05 By Bekkerman and McCallum 12 different Person Names all related to each other. Each Person name has different number of namesakes. SIGIR 05 By Artiles et al 9 different Person Names, each with different number of namesakes. WEPS For Semeval Workshop on WEPS task (by Artiles et al) Training dataset ( 49 Different Person Names) Test Dataset (30 Different Person Names) 30

Quality Measures F P : Harmonic mean of purity and inverse purity F B : Harmonic mean of B-cubed precision and recall. 31

Experiments Baselines Named Entity based Clustering (NE) TF/IDF merging scheme that uses the NEs in the Web Pages Web-based Co-Occurrence Clustering (WebDice) Computes the similarity between Web pages by summing up the individual dice similarities of People-People and People-Organization similarities of Web pages. 32

Experiments on WWW & SIGIR Datasets 33

Experiments on the WEPS Dataset 34

Effect of Initial Clustering 35

Conclusion and Future Work Web co-occurrence statistics can significantly improve disambiguation quality in WePS task. Skyline classifier is effective in making merge decisions. However, current approach has limitations: large number of search engine queries to collect statistics will work only as a server side solution given limitations today. Based on the assumption that context of different entities is very different. May not work as well on the difficult case when contexts of different entities significantly overlap. 36