Lecture 3: Improving Ranking with

Modeling User Behavior and Interactions Lecture 3: Improving Ranking with Behavior Data 1

* &!" ) & $ & 6+ $ & 6+ + "& & 8 > " + & 2 Lecture 3 Plan $ ( ' # $ #! & $ #% #"! #! -( #", # + 4 / 0 3 21 0 / 0. $ #! *7 5 # &# (" -( 8* ( ( # 5 # < ; : 9 $ # # %!? (" # %, = &

. -, " 98 ' : 7-2 =#? C : ) BE =? K L PN O KJ d c d c [Z Y W X U K L Q N K M f Z m U l Review: Learning to Rank / 0 1 + # * & )( ' & &% $ #! $; 2 $% $ " * -, "# 4 6 4"5 - " # # # 3, % : : ) & $ @ ;? ) 9 ) $; & > $ $% 8 3 < 7 % $ A? ' $ @ $ > 2)%. D 7 F H K PR J Q H K H K L MN HG I `a b b e _ `a b b ^_ ]\ U V W X S T U V H j h i K O ghk K NM J N T U V S T U V [ kx 3

* & $ $! = ; 8 ) ) 7 6 0/ K G J J F F RM JN OL M J JN OL N M L G KK G JIH \ F M G L l G S lg S N J i J PM M O D N M G JI OL h M M J } zy M Q G JI OL O J qt ˆ M \ G JN J N R O\ i hj M M J } f T V N J RM Ordinal Regression Approaches # "! + ( ' ) " $ % # "! B A( @? > < 9 : 8 ) ) 5-43 21 1,-. D C Z [ Y X T W T UV V FL S O NF Q GF PM E FG t s sr gp qpz o S O l m n k [ j N h J gf e cd b _`a ] ^ C w v u s} k T p GRF N G q PM JEF O g N~ PM L O G S VQ NO GF GPL [ c{ b x d a ƒ v u N M K P ~M N~ J J l Š~L l S O S ~ R O l ~ S n L G J N T ~ F j N ~ ~\ Z M ~ j N O g N~ PM L O G S VQ NO G F Q GPL [ y x d a Žd Œ d s sr Z p o MFV ~ [ 4

( ' # # " & &! @ 6-9- 2.-, B A A 2.- F 9? C # "! ', / 2 - / E 4 2 U 4 5. A.? 4 5. Learning to Rank Summary % " $ /?. >7 9 =* :;< 7 8 10 / )*+ 54 32 D9.- 9 >?. H4 1G- E? 98 L " % K J " I! 7 ST 54 Q R N OP /? 7 9 54 >? 5 3 F?. > 5 8 M - / 93 > 1 E 84 V 54 /?. > 5 8 E - 5

# # " 1 4 1 < 4 # " # 4 / - 7? 5 4 9. < A 4 / A A 1 7 1 A Approaches to Use Behavior Data J! & $ > 3 9 A? > 3 9 A? >V E7> 59 4! " K ' J E 1> < E > 5 - / - 1> 8 / - > 1 7 54 V? E - 5 > - 5 >9 54 2 6

Recap: Available Behavior Data 7

Training Examples from Click Data [ Joachims 2002 ] 8

Loss Function [ Joachims 2002 ] 9

Learned Retrieval Function [ Joachims 2002 ] 10

Features [ Joachims 2002 ] 11

Results [ Joachims 2002 ] Summary: Learned outperforms all base methods in experiment Learning from clickthrough data is possible Relative preferences are useful training data. 12

Extension: Query Chains [Radlinski & Joachims, KDD 2005] 13

Query Chains (Cont d) [Radlinski & Joachims, KDD 2005] 14

Query Chains (Results) [Radlinski & Joachims, KDD 2005] 15

G S LG G R O\ L G O K G PM M OL G PM j J M LR J S O G i J PM S J F F M G GFL i J PM S J % PM M J S NG J J FL G L PM M O g F \ G O P I M OM 16 Lecture 3 Plan S F S O N J GRM O N~ [ h J j N GROl F N[! Z R j N O G Ol NO F O Ij N O G $ # " L O G R IJ K ~ h J Nn [

Incorporating Behavior for Static Rank [Richardson et al., WWW2006] Web Crawl Build Index Answer Queries Which pages to crawl Efficient index order Informs dynamic ranking Static Rank 17

# " # # $ # I % " ' $ % L I ' [Richardson et al., WWW2006] frank Machine Learning Model 18 Web

K # " J Features: Summary [Richardson et al., WWW2006] 19

[ c \ \ \ ] ] e \ \ Features: Popularity [Richardson et al., WWW2006] Function Exact URL No Params Page URL-1 URL-2 Domain Domain+1 Example cnn.com/2005/tech/wikipedia.html?v=mobile cnn.com/2005/tech/wikipedia.html wikipedia.html cnn.com/2005/tech cnn.com/2005 cnn.com cnn.com/2005 Eugene Agichtein, Emory University, RuSSIR 2009 (Petrozavodsk, Russia) 20

1 > 52 /.? 1 54 F E / 4 ;?.? A V 7.? 7 G / 4 29-. 9 5 - D A F? 3-2 5 V / 2 Features: Anchor, Page, Domain [Richardson et al., WWW2006] / / -+ 54-2 / -+ /? 52? 34 / - 9 A E>5 F>5? G -. 32 5 9 A 7 G? > 5?. E? 4 7? 5 - -, 84 4-9 / - 3 A / -.? 5 3 / F 3 9?? -. / - 9 A E>5? - 84 H4.- > 54D 7? 3 > 5 9-84 H.- 21

F? 3..-2 1-8 - 5 1 4 -. M 9, 1/ M C F< >E 2 4 < 29 9-4 >G 9 52 = [Richardson et al., WWW2006] Data E 2 9.- 9-9?? 7? 3 54 < > 5 G 9 C < /? - 9? 1 =.-2 / 4 F?.? / / - 4 < - 54 E - H - > 5 8. A 7 -..? 7 -+ > 5 /? E,, 4 E/ E -, 34 7 9 - >.2 9- > 5.8-7 -. E?. 4. 22

U O Q U TPU P Becoming Query Independent [Richardson et al., WWW2006] OU S U N R N N PO R PU 23

[Richardson et al., WWW2006] Measure p H S p pairwise accuracy = H p 24

RankNet, Burges et al., ICML 2005 [Richardson et al., WWW2006] Feature Vector Label NN output Error is function of label and output 25

RankNet [Burges et al. 2005] [Richardson et al., WWW2006] Feature Vector1 Label1 NN output 1 26

RankNet [Burges et al. 2005] [Richardson et al., WWW2006] Feature Vector2 Label2 NN output 1 NN output 2 27

RankNet [Burges et al. 2005] [Richardson et al., WWW2006] NN output 1 NN output 2 Error is function of both outputs (Desire output1 > output2) 28

RankNet [Burges et al. 2005] [Richardson et al., WWW2006] Feature Vector1 NN output 29

Experimental Methodology [Richardson et al., WWW2006] 30

Accuracy of Each Feature Set [Richardson et al., WWW2006] 70 65 60 55 50 PageRank Popularity Anchor Page Domain (1) (14) (5) (8) (7) All Feature Set Accuracy (%) PageRank 56.70 Popularity 60.82 Anchor 59.09 Page 63.93 Domain 59.03 All Features 67.43 31

! #! % # %! & Qualitative Evaluation [Richardson et al., WWW2006] # #! "! # $ #! % % #! " $! $!! $ $ Technology Oriented Consumer Oriented 32

Behavior for Dynamic Ranking [Agichtein et al., SIGIR2006] ResultPosition QueryTitleOverlap DeliberationTime ClickFrequency ClickDeviation DwellTime DwellTimeDeviation Presentation Position of the URL in Current ranking Fraction of query terms in result Title Clickthrough Seconds between query and first click Fraction of all clicks landing on page Deviation from expected click frequency Browsing Result page dwell time Deviation from expected dwell time for query Sample Behavior Features (from Lecture 2) 33

! 9 = 41 + 7 3 6 3 Feature Merging: Details [Agichtein et al., SIGIR2006] Query: SIGIR, fake results w/ fake feature values Result URL BM25 PageRank Clicks DwellTime sigir2007.org 2.4 0.5?? Sigir2006.org 1.4 1.1 150 145.2 acm.org/sigs/sigir/ 1.2 2 60 23.5 60 "!# ( ) ( ) % ' & $ % $ # - - 8 1 5. <+ ;. - : + 1 2 / 0. -, * + 34

Results for Incorporating Behavior into Ranking [Agichtein et al., SIGIR2006] NDCG 0.74 0.72 0.7 0.68 0.66 0.64 0.62 0.6 0.58 0.56 RN Rerank-All RN+All 1 2 3 4 5 6 7 8 9 10 K MAP Gain RN 0.270 RN+ALL 0.321 0.052 (19.13%) BM25 0.236 BM25+ALL 0.292 0.056 (23.71%) 35

Which Queries Benefit Most [Agichtein et al., SIGIR2006] 350 300 250 200 150 100 50 0 Frequency Average Gain 0.1 0.2 0.3 0.4 0.5 0.6 0.2 0.15 0.1 0.05 0-0.05-0.1-0.15-0.2-0.25-0.3-0.35-0.4 Most gains are for queries with poor original ranking 36

! Lecture 3 Plan Automatic relevance labels Enriching feature space Dealing with data sparseness Dealing with Scale " Active learning Ranking for diversity Fun and games 37

Extension to Unseen [Bilenko and White, WWW 2008] Queries/Documents: Search Trails Trails start with a search engine query Continue until a terminating event Another search Visit to an unrelated site (social networks, webmail) Timeout, browser homepage, browser closing 38

Probabilistic Model [Bilenko and White, WWW 2008] IR via language modeling [Zhai-Lafferty, Lavrenko] Query-term distribution gives more mass to rare terms: Term-website weights 39

Results: Learning to Rank [Bilenko and White, WWW 2008] Add 0.72 0.7 0.68 0.66 0.64 0.62 0.6 ( ) as a feature to RankNet Baseline Baseline+Heuristic Baseline+Probabilistic 0.58 NDCG@1 NDCG@3 NDCG@10 Baseline+Probabilistic+RW 40

Scalability: (Peta?)bytes of Click Data BBM: Bayesian Browsing Model from Petabyte-scale Data, Liu et al, KDD 2009 query URL 1 URL 2 URL 3 URL 4 S 1 S 2 S 3 S 4 Relevance E 1 E 2 E 3 E 4 Examine Snippet C 1 C 2 C 3 C 4 ClickThroughs 41

Training BBM: One-Pass Counting BBM: Bayesian Browsing Model from Petabyte-scale Data, Liu et al, KDD 2009 Find R j 42

Training BBM on MapReduce BBM: Bayesian Browsing Model from Petabyte-scale Data, Liu et al, KDD 2009 Map: emit((q,u), idx) Reduce: construct the count vector 43

Model Comparison on Efficiency BBM: Bayesian Browsing Model from Petabyte-scale Data, Liu et al, KDD 2009 57 times faster 44

Large-Scale Experiment BBM: Bayesian Browsing Model from Petabyte-scale Data, Liu et al, KDD 2009 Setup: 8 weeks data, 8 jobs Job k takes first k- week data Experiment platform SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets [Chaiken et al, VLDB 08] 45

Scalability of BBM BBM: Bayesian Browsing Model from Petabyte-scale Data, Liu et al, KDD 2009 Increasing computation load more queries, more URLs, more impressions Near-constant elapsed time 3 hours Scan 265 terabyte data Full posteriors for 1.15 billion (query, url) pairs Computation Overload Elapse Time on SCOPE 46

Lecture 3 Plan Automatic relevance labels Enriching feature space Dealing with data sparseness Dealing with Scale Active learning Ranking for diversity 47

New Direction: Active Learning Goal: Learn the relevanceswith as little training data as possible. Search involves a three step process: [Radlinski & Joachims, KDD 2007] 2. Given a ranking, users provide feedback: User clicks provide pairwise relevance judgments. 3. Given feedback, update the relevance estimates. 48

Overview of Approach [Radlinski & Joachims, KDD 2007] Available information: 1. Have an estimate of the relevance of each result. 2. Can obtain pairwise comparisons of the top few results. 3. Do have absolute relevance information. Goal: Learn the document relevance quickly. Will address four questions: 1. How to represent knowledge about doc relevance. 2. How to maintain this knowledge as we collect data. 3. Given our knowledge, what is the best ranking? 4. What rankings do we show users to get useful data? 49

1: Representing Document Relevance Given a query, q, let M = (µ 1,..., µ C ) M be the true relevance values of the documents. Model knowledge of M with a Bayesian: P (M D) = P (D M ) P (M )/P (D) [Radlinski & Joachims, KDD 2007] Assume P (M D) is spherical multivariate normal: P (M D) = N (ν 1,..., ν C ; σ 12,..., σ C 2 ) Eugene Agichtein, Emory University, RuSSIR 2009 (Petrozavodsk, Russia) 50

1: Representing Document Relevance [Radlinski & Joachims, KDD 2007] Given a fixed query, maintain knowledge about relevance as clicks are observed. This tells us which documents we are sure about, and which ones need more data. Eugene Agichtein, Emory University, RuSSIR 2009 (Petrozavodsk, Russia) 51

2: Maintaining P(M D) [Radlinski & Joachims, KDD 2007] Model noisy pairwise judgments w [Bradley-Terry 52] Adding a Gaussian prior, apply off-the-shelf algorithm to maintain, commonly used for chess [Glickman 1999] Eugene Agichtein, Emory University, RuSSIR 2009 (Petrozavodsk, Russia) 52

3: Ranking (Inference) [Radlinski & Joachims, KDD 2007] Want to assign relevancesm = (µ 1,..., µ C ) such that L(M, M ) is small, but M is unknown. Minimize loss (pairwise): Eugene Agichtein, Emory University, RuSSIR 2009 (Petrozavodsk, Russia) 53

4: Getting Useful Data [Radlinski & Joachims, KDD 2007] : could present the ranking based on best estimate of relevance. Then the data we get would always be about the documents already ranked highly. Instead, ranking shown users: 1. Pick top two docs to minimize 2. Append current best estimate ranking. Eugene Agichtein, Emory University, RuSSIR 2009 (Petrozavodsk, Russia) 54

4: Exploration Strategies [Radlinski & Joachims, KDD 2007] Eugene Agichtein, Emory University, RuSSIR 2009 (Petrozavodsk, Russia) 55

4: Loss Functions [Radlinski & Joachims, KDD 2007] Eugene Agichtein, Emory University, RuSSIR 2009 (Petrozavodsk, Russia) 56

Results: TREC Data [Radlinski & Joachims, KDD 2007] Optimizing for better than for Eugene Agichtein, Emory University, RuSSIR 2009 (Petrozavodsk, Russia) 57

Need for Diversity (in IR) Ambiguous Queries Users with different information needs issuing the same textual query ( Jaguar ) Informational (Exploratory) Queries: [Predicting Diverse Subsets Using Structural SVMs, Y. Yue and Joachims, ICML 2008] User interested in a specific detail or entire breadth of knowledge available [Swaminathan et al., 2008] Want results with high information diversity 58

Optimizing for Diversity Long interest in IR community Requires [Predicting Diverse Subsets Using Structural SVMs, Y. Yue and Joachims, ICML 2008] Impossible given current learning to rank methods Problem: no consensus on how to measure diversity. Formulate as predicting diverse subsets Experiment: Use training data with explicitly labeled subtopics (TREC 6-8 Interactive Track) Use loss function to encode subtopic loss Train using structural SVMs [Tsochantaridis et al., 2005] 59

Representing Diversity Existing datasets with manual subtopic labels E.g., Use of robots in the world today Nanorobots Space mission robots Underwater robots [Predicting Diverse Subsets Using Structural SVMs, Y. Yue and Joachims, ICML 2008] Manual partitioning of the total information regarding a query Relatively reliable 60

Example [Predicting Diverse Subsets Using Structural SVMs, Y. Yue and Joachims, ICML 2008] Choose K documents with maximal information coverage. For K = 3, optimal set is {D1, D2, D10} 61

Maximizing Subtopic Coverage select K documents which collectively cover as many subtopics as possible. [Predicting Diverse Subsets Using Structural SVMs, Y. Yue and Joachims, ICML 2008] Perfect selection takes n choose K time. Set cover problem. Greedy gives (1-1/e)-approximation bound. Special case of Max Coverage (Khulleret al, 1997) 62

Weighted Word Coverage More distinct words = more information Weight word importance Does not depend on human labels select K documents which collectively cover as many distinct (weighted) words as possible [Predicting Diverse Subsets Using Structural SVMs, Y. Yue and Joachims, ICML 2008] Greedy selection also yields (1-1/e) bound. Need to find good weighting function (learning problem). 63

Example [Predicting Diverse Subsets Using Structural SVMs, Y. Yue and Joachims, ICML 2008] Document Word Counts Word Benefit V1 V2 V3 V4 V5 V1 1 D1 X X X V2 2 D2 X X X V3 3 D3 X X X X V4 4 V5 5 Marginal Benefit D1 D2 D3 Best Iter 1 12 11 10 D1 Iter 2 64

[Predicting Diverse Subsets Using Structural SVMs Y. Yue and Joachims, ICML 2008] Document Word Counts Word Benefit V1 V2 V3 V4 V5 V1 1 D1 X X X V2 2 D2 X X X V3 3 D3 X X X X V4 4 V5 5 Marginal Benefit D1 D2 D3 Best Iter 1 12 11 10 D1 Iter 2 -- 2 3 D3 65

12/4/1 train/valid/test split Approx 500 documents in training set [Predicting Diverse Subsets Using Structural SVMs Y. Yue and Joachims, ICML 2008] Permuted until all 17 queries were tested once Set K=5 (some queries have very few documents) SVM-div uses term frequency thresholds to define importance levels SVM-div2 in addition uses TFIDF thresholds 66

[Predicting Diverse Subsets Using Structural SVMs Y. Yue and Joachims, ICML 2008] Method Loss Random 0.469 Okapi 0.472 Unweighted Model 0.471 Essential Pages 0.434 SVM-div 0.349 SVM-div2 0.382 Methods SVM-div vs Ess. Pages SVM-div2 vs Ess. Pages SVM-div vs SVM-div2 W / T / L 14 / 0 / 3 ** 13 / 0 / 4 9 / 6 / 2 67

[Predicting Diverse Subsets Using Structural SVMs Y. Yue and Joachims, ICML 2008] Can expect further benefit from having more training data. Eugene Agichtein, Emory University, RuSSIR 2009 (Petrozavodsk, Russia) 68

Predicting Diverse Subsets Using Structural SVMs Y. Yue and Joachims, ICML 2008 Formulated diversified retrieval as predicting diverse subsets Efficient training and prediction algorithms Used weighted word coverage as proxy to information coverage. Encode diversity criteria using loss function Weighted subtopic loss http://projects.yisongyue.com/svmdiv/ 69

Lecture 3 Summary Automatic relevance labels Enriching feature space Dealing with data sparseness Dealing with Scale Active learning Ranking for diversity 70

# $ * %! )! 4 3 75 9 / + 8 2 Key References and Further Reading, T. 2002., KDD 2002, E., Brill, E., Dumais, S. # $ # "!, SIGIR 2006 %! % +, F. and Joachims, T., KDD 2005, F. and Joachims, T. ) (' & # % -, ) (' &, KDD 2007 % " + % % /, M and White, R, (., WWW 2008 % # "! % $ $ 0 5 6 # 5!,Yand Joachims,, ICML 2008,C., Guo, F., and Faloutsos, C., KDD 2009 1 2 / / " 9 $ 9 9 4 5 : # $ 3 % 71