Search Engines and Learning to Rank

Size: px

Start display at page:

Download "Search Engines and Learning to Rank"

Alannah Pierce
5 years ago
Views:

1 Search Engines and Learning to Rank Joseph (Yossi) Keshet

2 Query processor Ranker Cache Forward index Inverted index Link analyzer Indexer Parser Web graph Crawler

3 Representations

4 TF-IDF To get an effective vector representation of the query and the documents, TF-IDF weighting has been widely used. The TF of a term t in a vector is defined as the normalized number of its occurrences in the document TF(t, d) =f(t, d) frequency of term t in document d The IDF of it is defined as follows: the total number of documents in the collection IDF(t) = log N {d 2 D : t 2 d} the number of documents containing term t

5 BM25 given a query q =(t 1,...,t M ) the length (num. of words) of document d BM25(d, q) = MX i=1 IDF(t i ) TF(t i,d) (k 1 + 1) TF(t i,d)+k 1 (1 b + b len(d) avdl ) the average document length in the text collection from which documents are drawn

6 LMIR language model for documents the background language model for term t_i p(t i d) =(1 ) TF(t i,d) len(d) + p(t i C) where 2 [0, 1] is a smoothing factor

7 PageRank goal: rank documents based on their own importance PageRank uses the probability that a surfer randomly clicking on links will arrive at a particular webpage to rank the webpages. PR(d u )= X PR(d v ) U(d v ) d u 2B u PageRank of document d_u all pages linking to document d_u the number of outlinks from document d_v PR(d u )= X d u 2B u PR(d v ) U(d v ) + 1 N

8 Training data and features

9 ID Feature description Category P 1 q i 2q\d cðq i; dþ in body Q-D P 2 q i 2q\d cðq i; dþ in anchor Q-D P 3 q i 2q\d cðq i; dþ in title Q-D P 4 q i 2q\d cðq i; dþ in URL Q-D P 5 q i 2q\d cðq i; dþ in whole document Q-D P 6 q i 2q idf ðq iþ in body Q P 7 q i 2q idf ðq iþ in anchor Q P 8 q i 2q idf ðq iþ in title Q P 9 q i 2q idf ðq iþ in URL Q 10 P q i 2q idf ðq iþ in whole document Q 11 P q i 2q\d cðq i; dþidf ðq i Þ in body Q-D 12 P q i 2q\d cðq i; dþidf ðq i Þ in anchor Q-D 13 P q i 2q\d cðq i; dþidf ðq i Þ in title Q-D 14 P q i 2q\d cðq i; dþidf ðq i Þ in URL Q-D 15 P q i 2q\d cðq i; dþidf ðq i Þ in whole document Q-D 16 d of body D 17 d of anchor D 18 d of title D 19 d of URL D 20 d of whole document D 21 BM25 of body Q-D 22 BM25 of anchor Q-D 23 BM25 of title Q-D 24 BM25 of URL Q-D 25 BM25 of whole document Q-D 26 LMIR.ABS of body Q-D 27 LMIR.ABS of anchor Q-D 28 LMIR.ABS of title Q-D 29 LMIR.ABS of URL Q-D 30 LMIR.ABS of whole document Q-D 31 LMIR.DIR of body Q-D 32 LMIR.DIR of anchor Q-D 33 LMIR.DIR of title Q-D 34 LMIR.DIR of URL Q-D 35 LMIR.DIR of whole document Q-D 36 LMIR.JM of body Q-D 37 LMIR.JM of anchor Q-D 38 LMIR.JM of title Q-D 39 LMIR.JM of URL Q-D 40 LMIR.JM of whole document Q-D 41 Sitemap based term propagation Q-D 42 Sitemap based score propagation Q-D 43 Hyperlink based score propagation: weighted in-link Q-D 44 Hyperlink based score propagation: weighted out-link Q-D ID Feature description Category 45 Hyperlink based score propagation: uniform out-link Q-D 46 Hyperlink based propagation: weighted in-link Q-D 47 Hyperlink based feature propagation: weighted out-link 48 Hyperlink based feature propagation: uniform out-link Q-D Q-D 49 HITS authority Q-D 50 HITS hub Q-D 51 PageRank D 52 HostRank D 53 Topical PageRank Q-D 54 Topical HITS authority Q-D 55 Topical HITS hub Q-D 56 Inlink number D 57 Outlink number D 58 Number of slash in URL D 59 Length of URL D 60 Number of child page D 61 BM25 of extracted title Q-D 62 LMIR.ABS of extracted title Q-D 63 LMIR.DIR of extracted title Q-D 64 LMIR.JM of extracted title Q-D

10 Microsoft (LETOR)

11 Evaluation metrics

12 Precision-Recall Precision (P)isthefractionofretrieveddocumentsthatarerelevant Precision = #(relevant items retrieved) #(retrieved items) = P(relevant retrieved) Recall (R) isthefractionofrelevantdocumentsthatareretrieved Recall = #(relevant items retrieved) #(relevant items) = P(retrieved relevant) These notions can be made clear by examining the following contingency table: Relevant Nonrelevant Retrieved true positives (tp) false positives (fp) Not retrieved false negatives (fn) true negatives (tn) Then: P = tp/(tp+ fp) R = tp/(tp+ fn)

13 F measure Asinglemeasurethattradesoffprecisionversusrecallisthe Fmeasure, which is the weighted harmonic mean of precision and recall: F = 1 α 1 P +(1 α) 1 R = (β2 + 1)PR β 2 P + R where β 2 = 1 α α where α [0, 1] and thus β 2 [0, ]. Thedefaultbalanced F measure equally weights precision and recall, which means making α = 1/2 or β = 1. It is commonly written as F 1,whichisshortforF β=1,eventhoughtheformulation in terms of α more transparently exhibits the F measure as a weighted harmonic mean. When using β = 1, the formulaon theright simplifies to: F β=1 = 2PR P + R

14 Evaluation of ranked retrieval results Precision, recall, and the F measure are set-based measures. They are computed using unordered sets of documents. We need to extend these measures (or to define new measures) if we are to evaluate the ranked retrieval results that are now standard with search engines.

15 Calculate precision and recall values at every rank position = the relevant documents Ranking #1 Recall Precision Ranking #2 Recall Precision

16 j Calculate the precision at rank position j Ranking #1 Recall Precision Ranking #2 Recall Precision why only precision? If precision at position j is higher than the precision than the precision for another ranking, the recall will be higher as well.

17 Average precision average the precision values from the rank position where a relevant document was retrieved. Ranking #1 Recall Precision ( )/6 = 0.78 Ranking #2 Recall Precision ( )/6 = 0.52 Mean average precision (MAP) - mean over many queries

18 Discounted commutative gain (DCG) DCG p = rel 1 + p i=2 rel i 2 i rel i i 0 rel i 5 rel example: consider the gain of the documents at each rank the discounted gain The DCG at each rank is form by accumulating these numbers:

19 Algorithms

20 Learning-to-rank framework

21 Approaches The pointwise approach input: feature vector of a single document output: relevance degree The pairwise approach input: pair of documents output: pairwise preference in {-1,+1} The listwise approach input: a set of document associated with a query output: ranked list (or permutation) of the documents.

22 The pointwise approach Regression-based predict the relevant using regression Classification-based binary classification of relevant/non-relevant Multi-class Classification for Ranking Ordinal regression-base PRank

23 PRank Direction w, Thresholds w Thresholds Rank Levels

24 PRank Direction w, Thresholds Rank a new instance x w

25 PRank Correct Rank Interval Direction w, Thresholds Rank a new instance x Get the correct rank y w

26 PRank w Direction w, Thresholds Rank a new instance x Get the correct rank y Compute Error-Set E

27 PRank Update w Direction w, Thresholds Rank a new instance x Get the correct rank y Compute Error-Set E Update :

28 PRank Update w x x Direction w, Thresholds Rank a new instance x Get the correct rank y Compute Error-Set E Update : w

29 PRank Summary of Update x w x x Direction w, Thresholds Rank a new instance x Get the correct rank y Compute Error-Set E Update : w

30 The PRank Algorithm Maintain Get an instance x Update Predict : Yes No? Get the true rank y Compute Error set :

31 Mistake Bound Given : Input sequence, Norm of instances is bounded Ranked correctly by a normalized ranker with Margin>0 Then : Number of Mistakes PRank Makes

32 The pairwise approach RankNet P u,v (f ) = exp(f (x u) f(x v )) 1 + exp(f (x u ) f(x v )). L(f ; x u,x v,y u,v ) = P u,v log P u,v (f ) (1 P u,v ) log ( 1 P u,v (f ) ). RankBoost Ranking SVM, SVM-Rank LambdaRank

33 relevance label document query SVM-Rank 3 qid:1 1:1 2:1 3:0 4:0.2 5:0 # 1A 2 qid:1 1:0 2:0 3:1 4:0.1 5:1 # 1B 1 qid:1 1:0 2:1 3:0 4:0.4 5:0 # 1C 1 qid:1 1:0 2:0 3:1 4:0.3 5:0 # 1D 1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2A 2 qid:2 1:1 2:0 3:1 4:0.4 5:0 # 2B 1 qid:2 1:0 2:0 3:1 4:0.1 5:0 # 2C 1 qid:2 1:0 2:0 3:1 4:0.2 5:0 # 2D 2 qid:3 1:0 2:0 3:1 4:0.1 5:1 # 3A 3 qid:3 1:1 2:1 3:0 4:0.3 5:0 # 3B 4 qid:3 1:1 2:0 3:0 4:0.4 5:1 # 3C 1 qid:3 1:0 2:1 3:1 4:0.5 5:0 # 3D the 4th feature is 0.3 1A>1B, 1A>1C, 1A>1D, 1B>1C, 1B>1D, 2B>2A, 2B>2C, 2B>2D, 3C>3A, 3C>3B, 3C>3D, 3B>3A, 3B>3D, 3A>3D

34 The listwise approach Loss-base minimization SoftRank SmoothRank SVM-MAP AdaRank Minimization of Non-measure-Specific Loss ListMLE ListNet

35 a query with its documents SVM-MAP h(x; w) = argmax y2y F (x, y; w) =w T F (x, y; w). (x, y). their relevance score ( labels ) features of document d_j and the term of query x (x, y) = 1 C x C x i:di 2C x j:d j 2C x [y ij ( (x,d i ) (x,d j ))] w = 2 kwk n nx i=1 max ŷ h i MAP (y i, ŷ) w T (x i, y i )+w T (x i, ŷ) +

Information Retrieval

Information Retrieval Learning to Rank Ilya Markov i.markov@uva.nl University of Amsterdam Ilya Markov i.markov@uva.nl Information Retrieval 1 Course overview Offline Data Acquisition Data Processing Data