Towards Breaking the Quality Curse. AWebQuerying Web-Querying Approach to Web People Search.

Towards Breaking the Quality Curse. AWebQuerying Web-Querying Approach to Web People Search. Dmitri V. Kalashnikov Rabia Nuray-Turan Sharad Mehrotra Dept of Computer Science University of California, Irvine 1

Searching for People on the Web Heterogeneity Approx. 5-10% of internet searches Andrew ACM: Items for look for specific people by their names [Guha&Garg 04] McCallum s Homepage Associate Professor Andrew Kachites McCallum Author: McCallum, R. Andrew Challenges: Heterogeneity of representations: Same entity represented multiple ways. Ambiguity in representations: Same representation of multiple entities. WePS systems address both these challenges. DBLP: Andrew McCallum UMass professor UMass professor Ambiguity Mental Floss Andrew McCallum Musician News: Andrew McCallum Australia CROSS president 2

Search Engines v WePS WePS system return clusters of web pages wherein pages of the same entity are clustered together Google results for Andrew McCallum WePS results for Andrew McCallum Umass professor photographer physician 3

Clustering Search Engines v WePS Examples of clustering search engines: Clusty http://www.clusty.com Kartoo http://www.kartoo.com Search results of Andrew McCallum using Clusty WePS system results of Andrew McCallum Differences: While both return clustered search results, Clustering search engines cluster similar/related pages based on broad topics (e.g., research, photos) while WePS system cluster based on disambiguating amongst various namesakes. 4

WePS System Architecture Classified based on where references are disambiguated Client side solution the client resolves references Client Disamb. Approach Query Returned pages Server Web Proxy Solution Query Proxy Proxy resolves references Server side solution The server resolves references Client Client clusters clusters Disamb. Approach Query Web Returned pages Server Web Server Disamb. Approach 5

Current WePS Approaches Commercial Systems: Zoominfo http://www.zoominfo.com Spock http://www.spock.com Multiple research proposals [Kamha et al. 04; Artiles et. al 07 & 05; Bekkerman et. al. 05; Bollegala et. al. 06;Matsou et. al. 06] etc. Web people search task included in Sem-Eval workshop in 2007 16 teams participated in the task WePS techniques differ from each other in: data analyzed for disambiguation & Clustering approach used Type of data analyzed Baseline exploit information in result pages only Keywords, named entities, hyperlinks, emails, [Bekkerman et.al. 05], [Li, et. al. 05][Elmachioglu, et. al. 07] exploit external information Ontologies, Encyclopedia, [Kalashnikov, et. al, 07] Exploit Web content Crawling [Kanani et. al. 07] Search engine Statistics 6

Contributions of this paper Exploiting Co-occurrence occurrence statistics from the search engine for WePS. A novel classification approach based on skylines to support reference resolution. Outperforms existing techniques on multiple data sets 7

Approach Overview Top-K Preprocessing Post processing Webpages 1.Extract NEs 2. Filter Initial Clustering Preprocessed pages 1. Rank clusters 2. Extract cluster sketches 3. Rank pages in clusters Final clusters Person1 Refining Clusters using Web Queries Person3 Person2 1. Compute similarity 1. Web Query using TF/IDF on NE 2. Feature Extraction 2. Cluster based on Initial clusters 3. Skyline based classifier TF/IDF similarity 4. Cluster refinement 8

Preprocessing: Named Entity Extraction Locate query name on web page Extract M neighboring names entities organization people Names Stanford Named Entity Recognizer & GATE used People NEs Lise Getoor Ben Taskar John McCallum Charles Organization UMASS ACM Queue Location Amherst 9

Preprocessing: Filtering Named Input: Extracted NEs Entities Location Filter Remove location NEs by looking up a dictionary Ambiguous name Filter Remove one-word person names, such as John Ambiguous Org. Filter Remove common English words Same last name Filter Remove the person names with the same last name. Output: Cleaned NEs 10

Initial Clustering Resolved references 0.1 0.9 0.1 0.9 0.2 Threshold h = 0.5 0.2 0.3 0.3 0.8 0.8 Intuition: If two web pages significantly overlap in terms of associated NEs, then the two pages refer to the same individual. Approach: Compute similarity between web pages based on TF/IDF over Named Entities Merge all pairs of Web pages whose TF/IDF similarity exceeds a threshold Threshold learnt over the training dataset. 11

Cluster Refinement Initial Clustering based on TF/IDF similarity: Accurate references resolved are usually correct. Conservative too many unresolved references Andrew McCallum UMASS 0.01 Andrew McCallum CMU Key Observation: 0.5 0.7 Co-occurrence Andrew statistics can McCallum Job: UMASS Phd: CMU Web may contain additional information that may further resolve unresolved references. Co-occurrenceoccurrence between contexts associated with unresolved references be learnt by querying the search engine 12

Using Co-occurrence to Disambiguate Query N Andrew McCallum Search Results D1 Andrew McCallum UMASS D2 Number of web pages containing Andrew McCallum AND UMASS AND CMU Andrew McCallum CMU Context associated with Search results C1 = UMASS C2 = CMU N. C. C CMU Overlap What between if context t 1NEs 2 " AndrewMcCa llum "". UMASS "". CMU " Ci and Cj are very general contexts N " Andrew McCallum " over the Web The co-occurrence counts could be high for all namesakes! Solution -- Normalize N Ci Cj using Dice similarity overlap between contexts associated with two pages of the same entity expected to be higher compared to pages of unrelated entities. Merge Web pages Di and Dj based on N.Ci.Cj / N? 13

Dice Similarity Two possible versions of dice: will be large for general context words 14

Dice2 Similarity -- Example Let an N O 1 O 2 query be: Q: Andrew McCallum AND Yahoo AND Google Let counts be: Andrew McCallum AND Yahoo AND Google 522 Andrew McCallum AND Yahoo 2470 Andrew McCallum AND Google 8950 Andrew McCallum 38500 Yahoo AND Google 73000000 Dice: Dice 1 = 0.05 Dice 2 = 7.15e-6 Dice 2 captures that Yahoo and Google are too unspecific Organization Entities to be a good context for Andrew McCallum Dice 2 leads to visibly better results in practice! 15

Web Queries to Gather Co-Occurrence Statistics Context People NEs Lise Getoor Ben Taskar Organization UMASS ACM People NEs Charles A. Sutton Ben Wellner Organization DBLP KDD Association between context (L.G. OR B.T.) (L.G. OR B.T.) (UMASS OR ACM) AND AND AND (C.A.S. OR B.W.) (DBLP OR KDD) (C.A.S. OR B.W.) (UMASS OR ACM) AND (DBLP OR KDD) Q1 Q2 Q3 Q4 (#97) (#2,570) (#720) (#2,190K) N AND Q1 N AND Q2 N AND Q3 N AND Q4 relevant to original query N (#79) (#49) (#63) (#10,100) Association amongst Context within pages Ex: Q1 = ( Lise Getoor OR Ben Taskar ) AND ( Charles A. Sutton OR Ben Wellner ) 8 web queries 16

Creating Features from Web Co-Occurrence Counts 8-Feature Vector 1: N P i P j = 79/152K = 0.00052 2: Dice(N, P i P j ) = 0.000918 3: N P i O j = 49/152K = 0.00032 4: Dice(N, P i O j j) ) = 0.000561 5: N O i P j = 63/152K = 0.00014 6: D(N, O i P j ) = 0.00073 7: N O i O j = 10,100 = 0.066 8: D(N, O i O j ) = 0.008552 17

Role of a Classifier Initial Clusters Edges with 8D Features Final Clusters Web Co-Occurrence Feature Generation Edge Classification Classifier used to mark edges as merge (+) or do not merge (-) resulting in the final cluster. Since the features are similarity based, the classifier should satisfy the dominance property 18

Dominance Property Let f and g be 8D feature vectors f =(f 1, f 2,,f 8 ) g = (g 1,g 2,,g 8 ) Notation (f up dominates g) if f 1 g 1, f 2 g 2,, f 8 g 8 (f down dominates g) if f 1 g 1, f 2 g 2,, f 8 g 8 Dominance Property for classifier Let f and g be the features associated with edges (d i, d j ) and (d m, d n ) respectively. If f g and the classifier decides to merge (d i, d j ), then it must also merge (d m, d n ) 19

Skyline-Based Classification A skyline classifier consists of an up-dominating classification skyline All points above or on it are declared merge ( + ) The rest are don t merge ( - ) Automatically takes care of dominance Our goal is to learn a classification skyline that supports a good final clusters of search results. 20

Learning Classification Skyline Initial CS Greedy Approach Initialize classification skyline with the point that t dominates the entire space. Pool of points to add to CS Down-dominating skyline on active + edges Maintain current clustering Clustering induced d by current CS Choosing a point from pool to CS that Stage1: Minimizes loss of quality due to adding new - edges Stage2: Maximizes overall quality Iterate until done Output best skyline discovered 21

Example: Adding a point to CS Ground Truth Clustering: Initial Clustering: Current CS = {q,d,c} Current Clustering: Point Pool = {f,e,u,h} 22

Example: Considering f f k a What if add f to CS? v b e s g l w q u m o i h d t p c r j Stage 1: Adding edge k Fp = 0.805 Stage 2: Adding edge f Fp = 0.860 (a) 23

Example: Considering e f k a What if add e to CS? v b e s g l w q u m o i h d t p c r j Stage 1: Adding edges k and v Fp = 0.718 Stage 2: Adding edge e Fp = 0.774 (a) 24

Example: Considering u What if add u to CS? f k a v b e s g l w q u m o i h d t p c r j Stage 1: Adding edge i Fp = 0.805 Stage 2: Adding edge u Fp = 0.833 (a) 25

Example: Considering h What if add h to CS? f k a v b e s g l w q u m o i h d t p c r j Stage 1: Adding edges i and t Fp = 0.625 Stage 2: Adding edge h Fp = 0.741 (a) 26

Example: Stage2 f k a After Stage1 only f and u remain v b e q s g l m u i d h t r j w o p c (b) Stage 1 (for f): Fp = 0.805 Stage 2 (for f): Fp = 0.860 Stage 1 (for u): Fp = 0.805 Stage 2 (for u): Fp = 0.833 Thus adding f to CS! 27

Why Skyline-based classifier? Specialized classifier Designed specifically for the problem at hand Is aware of the clustering underneath Fine-tunes itself to a given clustering quality metric Such as F 1, F P, F B Takes into account the dominance in 8D data Explained shortly Takes into account the transitivity of merges Classifying one edge as + might lead to classifying multiple l edges as + For instance when two large clusters are merged 28

Post Processing Page Ranking Top-K Use the original ranking from search engine Preprocessing Post processing Webpages Cluster Ranking 1.Extract NEs each cluster we simply select the highest 2Fil 2. ranked Filter page and use that as the order of the cluster. Preprocessed pages Cluster Sketches coalesce all pages in the cluster into a single page removing Initial Clustering the stop words we compute the TF/IDF 1. Compute of the remaining similarity words Select using set of TF/IDF terms on above NE a certain 2. Cluster threshold based (or on top N terms) TF/IDF similarity Initial clusters 1. Rank clusters 2. Extract cluster sketches 3. Rank pages in clusters Final clusters 1. Web Query 2. Feature Extraction 3. Skyline based classifier 4. Cluster refinement Person1 Refining Clusters using Web Queries Person3 Person2 29

Experiments Datasets WWW 05 By Bekkerman and McCallum 12 different Person Names all related to each other. Each Person name has different number of namesakes. SIGIR 05 By Artiles et al 9 different Person Names, each with different number of namesakes. WEPS For Semeval Workshop on WEPS task (by Artiles et al) Training dataset ( 49 Different Person Names) Test Dataset (30 Different Person Names) 30

Quality Measures F P : Harmonic mean of purity and inverse purity F B : Harmonic mean of B-cubed precision and recall. 31

Experiments Baselines Named Entity based Clustering (NE) TF/IDF merging scheme that uses the NEs in the Web Pages Web-based Co-Occurrence Clustering (WebDice) Computes the similarity between Web pages by summing up the individual dice similarities of People-People and People-Organization similarities of Web pages. 32

Experiments on WWW & SIGIR Datasets 33

Experiments on the WEPS Dataset 34

Effect of Initial Clustering 35

Conclusion and Future Work Web co-occurrence statistics can significantly improve disambiguation quality in WePS task. Skyline classifier is effective in making merge decisions. However, current approach has limitations: large number of search engine queries to collect statistics will work only as a server side solution given limitations today. Based on the assumption that context of different entities is very different. May not work as well on the difficult case when contexts of different entities significantly overlap. 36