Information Retrieval Learning to Rank Ilya Markov i.markov@uva.nl University of Amsterdam Ilya Markov i.markov@uva.nl Information Retrieval 1
Course overview Offline Data Acquisition Data Processing Data Storage Online Query Processing Ranking Evaluation Advanced Aggregated Search Click Models Present and Future of IR Ilya Markov i.markov@uva.nl Information Retrieval 2
This lecture Offline Data Acquisition Data Processing Data Storage Online Query Processing Ranking Evaluation Advanced Aggregated Search Click Models Present and Future of IR Ilya Markov i.markov@uva.nl Information Retrieval 3
Outline 1 Current trends in IR 2 Ilya Markov i.markov@uva.nl Information Retrieval 4
IR conferences ACM Conference on Research and Development in Information Retrieval (SIGIR) ACM Conference on Information Knowledge and Management (CIKM) ACM Conference on Web Search and Data Mining (WSDM) European Conference on Information Retrieval (ECIR) Ilya Markov i.markov@uva.nl Information Retrieval 5
IR journals ACM Transactions o Information Systems (TOIS) Information Retrieval Journal (IRJ) Information Processing and Management (IPM) Ilya Markov i.markov@uva.nl Information Retrieval 6
Surveys Foundations and Trends in Information Retrieval (FnTIR) Synthesis Lectures on Information Concepts, Retrieval, and Services by Morgan&Claypool Publishers Ilya Markov i.markov@uva.nl Information Retrieval 7
SIGIR 2016 Evaluation Efficiency Retrieval models, learning-to-rank, web search Users, user needs, search behavior Novelty and diversity Speech, conversation systems, question answering Recommendation systems Entities and knowledge graphs Ilya Markov i.markov@uva.nl Information Retrieval 8
WSDM 2016 Communities, social interaction, social networks Search and semantics Observing users, leveraging users Big data algorithms Entities and structure Ilya Markov i.markov@uva.nl Information Retrieval 9
Ranking methods 1 Content-based Term-based Semantic 2 Link-based (web search) 3 Ilya Markov i.markov@uva.nl Information Retrieval 10
Outline 1 Current trends in IR 2 Machine learning Features LTR approaches Experimental comparison Summary Ilya Markov i.markov@uva.nl Information Retrieval 11
Machine learning Traditional ML solves a prediction problem (classification or regression) on a single instance at a time. Input {x i } n i=1 Output {y i } n i=1 Learn a model h(x) that optimizes a loss function L(h(x), y) For a new instance x new predict the output y = h(x new ) Ilya Markov i.markov@uva.nl Information Retrieval 12
the training data. This is also highly demanding for real search engines, because everyday these search engines will receive a lot of user feedback and usage logs indicating poor ranking for some queries or documents. It is very important to automatically learn from feedback and constantly improve the ranking mechanism. Due to the aforementioned two characteristics, learning to rank has been widely used in commercial search engines, 13 and has also attracted The aimgreat of LTR attention isfrom to the come academic up research with optimal community. ordering of items, Figure 1.1 shows the typical learning-to-rank flow. From the figure where the relative ordering among the items is more important we can see that since learning to rank is a kind of supervised learning, a training thanset the is needed. exact Thescore creationthat of a training eachset item is verygets. similar to T.-Y. Liu, Learning to Rank for Information Retrieval Fig. 1.1 Learning-to-rank framework. Ilya Markov i.markov@uva.nl Information Retrieval 13
Outline 2 Machine learning Features LTR approaches Experimental comparison Summary Ilya Markov i.markov@uva.nl Information Retrieval 14
Machine learning Input {x i } n i=1 Output {y i } n i=1 Learn a model h(x) that optimizes a loss function L(h(x), y) Examples Linear model h(x i ) = w T x i = lk=1 w kx ik Quadratic loss function L(h(x i ), y i ) = h(x i ) y i 2 How to learn the model h(x), i.e., how to estimate its parameters? Ilya Markov i.markov@uva.nl Information Retrieval 15
Learning the model h(x) If there is a closed form solution for the parameters of h(x) 1 Compute the derivative of the loss function L with respect to some parameter w k 2 L Equate this derivative to zero: w k = 0 3 Find the optimal value of the parameter w k If there is no closed form solution, use gradient descent 1 Compute or approximate the gradient of the loss function [ L = parameters L w 1,..., L w l ] using the current values of the 2 Update the model parameters by taking a small step in the opposite direction of the gradient: w w η L Ilya Markov i.markov@uva.nl Information Retrieval 16
Gradient descent Picture taken from https://en.wikipedia.org/wiki/gradient_descent Ilya Markov i.markov@uva.nl Information Retrieval 17
Outline 2 Machine learning Features LTR approaches Experimental comparison Summary Ilya Markov i.markov@uva.nl Information Retrieval 18
been widely used in commercial search engines, 13 and has also attracted great attention from the academic research community. Figure 1.1 shows the typical learning-to-rank flow. From the figure we can see that since learning to rank is a kind of supervised learning, a training set is needed. The creation of a training set is very similar to Fig. 1.1 Learning-to-rank T.-Y. Liu, framework. Learning to Rank for Information Retrieval 13 See http://blog.searchenginewatch.com/050622-082709, Ilya Markov http://blogs.msdn.com/msnsearch/archive/2005/06/21/431288.aspx, i.markov@uva.nl Information Retrieval 19
Query-document representation Each query-document pair (q (n), x i ) is represented as a vector of features x (n) i Features Content-based Link-based User-based = [x (n) i1, x (n) i2,..., x (n) il ] Ilya Markov i.markov@uva.nl Information Retrieval 20
document pair, as shown in Table 6.2. For the OHSUMED corpus, 40 features were extracted in total, as shown in Table 6.3. Content-based features Table 6.2 Learning features of TREC. ID Feature description 1 Term frequency (TF) of body 2 TF of anchor 3 TF of title 4 TF of URL 5 TF of whole document 6 Inverse document frequency (IDF) of body 7 IDF of anchor 8 IDF of title 9 IDF of URL 10 IDF of whole document 11 TF*IDF of body 12 TF*IDF of anchor 13 TF*IDF of title 14 TF*IDF of URL 15 TF*IDF of whole document 16 Document length (DL) of body 17 DL of anchor 18 DL of title 19 DL of URL 20 DL of whole document 21 BM25 of body 22 BM25 of anchor 23 BM25 of title 24 BM25 of URL 25 BM25 of whole document T.-Y. 26 LMIR.ABS Liu, Learning of body to Rank for Information Retrieval 27 LMIR.ABS of anchor 28 LMIR.ABS of title Ilya Markov 29 i.markov@uva.nl LMIR.ABS of URL Information Retrieval 21
Link-based features 6.1 The LETOR Collection 291 Table 6.2 (Continued) ID Feature description 40 LMIR.JM of whole document 41 Sitemap based term propagation 42 Sitemap based score propagation 43 Hyperlink base score propagation: weighted in-link 44 Hyperlink base score propagation: weighted out-link 45 Hyperlink base score propagation: uniform out-link 46 Hyperlink base feature propagation: weighted in-link 47 Hyperlink base feature propagation: weighted out-link 48 Hyperlink base feature propagation: uniform out-link 49 HITS authority 50 HITS hub 51 PageRank 52 HostRank 53 Topical PageRank 54 Topical HITS authority 55 Topical HITS hub 56 Inlink number 57 Outlink number 58 Number of slash in URL 59 Length of URL 60 Number of child page 61 BM25 of extracted title 62 LMIR.ABS of extracted title 63 T.-Y. LMIR.DIR Liu, Learning of extracted to Ranktitle for Information Retrieval 64 LMIR.JM of extracted title Table 6.3 Learning features of OHSUMED. Ilya Markov i.markov@uva.nl Information Retrieval 22
User-based features Type of interaction Clicks Time Queries Online metric Click-through rate for (q (n), x i ) Avg. click rank for (q (n), x i ) Avg. dwell time for (q (n), x i ) Avg. time to first click, when this click is on x i Avg. time to last click, when this click is on x i Number of reformulations before/after q (n) Number of times q (n) is abandoned Ilya Markov i.markov@uva.nl Information Retrieval 23
Outline 2 Machine learning Features LTR approaches Experimental comparison Summary Ilya Markov i.markov@uva.nl Information Retrieval 24
been widely used in commercial search engines, 13 and has also attracted great attention from the academic research community. Figure 1.1 shows the typical learning-to-rank flow. From the figure we can see that since learning to rank is a kind of supervised learning, a training set is needed. The creation of a training set is very similar to Fig. 1.1 Learning-to-rank T.-Y. Liu, framework. Learning to Rank for Information Retrieval 13 See http://blog.searchenginewatch.com/050622-082709, Ilya Markov http://blogs.msdn.com/msnsearch/archive/2005/06/21/431288.aspx, i.markov@uva.nl Information Retrieval 25
approaches LambdaMART LambdaRank T.-Y. Liu, Learning to Rank for Information Retrieval Ilya Markov i.markov@uva.nl Information Retrieval 26
Pointwise LTR query h() Rel(red) Rel(gray) Rel(orange) h(blue) h(yellow) Rel(red) h(green) query h() Re Re Re h(green) pointwise LTR h(white) h( pointwise LTR Ilya Markov i.markov@uva.nl Information Retrieval 27
Pointwise LTR Reduces to traditional ML Input: query-document feature vectors x (n) i = [x (n) i1, x (n) i2,..., x (n) il ] Output: relevance labels y i Objective: learn a model h(x) that correctly predicts labels y Ilya Markov i.markov@uva.nl Information Retrieval 28
Regression Picture taken from https://en.wikipedia.org/wiki/regression_analysis Ilya Markov i.markov@uva.nl Information Retrieval 29
Classification Picture taken from https://en.wikipedia.org/wiki/statistical_classification Ilya Markov i.markov@uva.nl Information Retrieval 30
icit constraints on the thresholds to the optimization problem. Current trends in IR cit constraint simply takes the form of b k 1 b k, while the imp traint Ordinal uses regression redundant training examples to guarantee the ord ionship among thresholds..2 Sum of margin strategy. T.-Y. Liu, Learning to Rank for Information Retrieval Ilya Markov i.markov@uva.nl Information Retrieval 31
Pointwise LTR Pros + Intuitive interpretation of relevance + Clear, how to get relevance judgements Cons Has a different optimization objective compared to IR (e.g., finding a correct class) Ilya Markov i.markov@uva.nl Information Retrieval 32
Pairwise LTR query query Pref(red>gray) h() Pref(gray>green) g() h() Pref(green>red) h(red>blue) pairwise LTR pairwis Ilya Markov i.markov@uva.nl Information Retrieval 33
RankNet Pointwise scoring function f (x i ) with parameters {w k } l k=1 Pairwise ground-truth P ij = I(x i > x j ) Probability of x i > x j is modeled using logistic regression P ij = P(x i > x j ) = Pairwise loss function (cross entropy) 1 1 + e σ(f i f j ) C = P ij log P ij (1 P ij ) log(1 P ij ) = (1 P ij )σ(f i f j ) + log(1 + e σ(f i f j ) ) RankNet optimizes the total number of pairwise errors Ilya Markov i.markov@uva.nl Information Retrieval 34
RankNet (cont d) Optimize the cost C C f i = σ [ ] 1 (1 P ij ) 1 + e σ(f = C i f j ) f j Update parameter w k of the function f (x i ) ( C f i w k w k η + C ) f j f i w k f j w k Ilya Markov i.markov@uva.nl Information Retrieval 35
Speeding up RankNet training Define λ ij as λ ij = C f i [ ] 1 = σ (1 P ij ) 1 + e σ(f i f j ) Let I denote the set of pairs of indices {i, j}, for which x i should be ranked differently from x j for a given query q I = {i, j x i > x j } Sum all contributions to update parameter w k δw k = η ( ) f i f j λ ij λ ij w k w k {i,j} I = η f i λ ij w k i j:{i,j} I j:{j,i} I f i λ ij w k = η i λ i f i w k Ilya Markov i.markov@uva.nl Information Retrieval 36
Interpreting λ s λ i = λ ij j:{i,j} I j:{j,i} I λ ij λ i is a sum of forces applied to document x i shown for query q All documents x j, that should be ranked below x i, push it up with the force λ ij All documents x j, that should be ranked above x i, push it down with the force λ ij Ilya Markov i.markov@uva.nl Information Retrieval 37
Pairwise LTR Pros + Easy to get preference judgements + Comes closer to optimizing the ranking Cons Still does not optimize the whole ranking Higher computational complexity compared to pointwise LTR Ilya Markov i.markov@uva.nl Information Retrieval 38
Listwise LTR R(blue) R(yellow) R(red) R(green) R(white) q1 NDCG(q1) R(gray) R(red) R(orange) R(white) R(yellow) q2 h() NDCG(q2) q3 R(orange) R(green) R(blue) R(gray) R(red) NDCG(q3) q4 listwise LTR Ilya Markov i.markov@uva.nl Information Retrieval 39
From RankNet From RankNet to to LambdaRank LambdaRank to LambdaMART: An Overview 7 The black arrows denote the RankNet gradients (which increase Fig. 1 A set of urls ordered for a given query using a binary relevance measure. The light gray with the bars represent number urls that of are not pairwise relevant to theerrors) query, while the dark blue bars represent urls that are relevant to the query. Left: the total number of pairwise errors is thirteen. Right: by moving top RankNet url down cost three rank decreases levels, and thefrom bottom relevant 13 url onupthe five, theleft total number to 11 of pairwise on the errorsright has been reduced to eleven. However for IR measures like NDCG and ERR that emphasize the top The actual few results, ranking this is not what gets we want. worse The (black) arrows on the left denote the RankNet gradients (which increase with the number of pairwise errors), whereas what we d really like are the (red) The red arrows arrows on the right. is what we would actually like to see C. Burges, From RankNet to LambdaRank to LambdaMART: An Overview 4 LambdaRank Ilya Markov i.markov@uva.nl Information Retrieval 40
LambdaRank λ ij in RankNet [ ] 1 λ ij = σ (1 P ij ) 1 + e σ(f i f j ) λ ij in LambdaRank λ ij = σ 1 + e σ(f i f j ) NDCG NDCG = NDCG(orig. ranking) NDCG(x i and x j are swapped) LambdaRank directly uses the ranking to compute gradients (i.e., λ ij s) instead of computing and optimizing a cost function Ilya Markov i.markov@uva.nl Information Retrieval 41
LambdaRank (cont d) Proceed similarly to RankNet: Sum all λ ij s for document x i and query q λ i = λ ij j:{i,j} I j:{j,i} I Update parameter w k of the function f (x i ) λ ij w k w k η i λ i f i w k Ilya Markov i.markov@uva.nl Information Retrieval 42
LambdaMART Multiple Additive Regression Trees (MART) MART does not need a cost function but gradients Adopts gradients (λ ij s) from LambdaRank Hence the name: Lambda + MART Ilya Markov i.markov@uva.nl Information Retrieval 43
Listwise LTR Pros + Directly optimizes the whole ranking Cons Needs many judgements High computational complexity Ilya Markov i.markov@uva.nl Information Retrieval 44
Outline 2 Machine learning Features LTR approaches Experimental comparison Summary Ilya Markov i.markov@uva.nl Information Retrieval 45
LEarning TO Rank datasets (LETOR) Query-document pairs precomputed feature vectors Relevance judgements http://research.microsoft.com/en-us/um/beijing/projects/letor/ Ilya Markov i.markov@uva.nl Information Retrieval 46
document and judge whether it is relevant to a given query. Therefore, the pooling strategy as introduced in Section 1 was used [35]. Many research papers [97, 101, 131, 133] have been published using the three tasks on the Gov corpus as their experimental platform. Current trends in IR Historical LTR datasets TREC 2003, 2004 Web IR track 6.1.1.2 Gov The corpus OHSUMED with 1, 053, corpus 110 pages Tasks The OHSUMED TD topic corpus distillation [64] is a subset of MEDLINE, a database on medical HP publications. homepageit finding consists of 348,566 records (out of over 7 million) from NP 270 named medical page finding journals during the years of 1987 1991. Table 6.1 Number of queries in TREC web track. 4 http://trec.nist.gov/tracks.html. Task TREC2003 TREC2004 Topic distillation 50 75 Homepage finding 150 75 Named page finding 150 75 Figure: Number of queries T.-Y. Liu, Learning to Rank for Information Retrieval Ilya Markov i.markov@uva.nl Information Retrieval 47
Results on NP2003 ListNet 0.360 0.357 0.317 0.360 0.360 0.256 0.223 AdaRank 0.413 0.376 0.328 0.413 0.369 0.249 0.219 SVM map 0.293 0.304 0.291 0.293 0.302 0.247 0.205 le 6.7 Results on NP2003. Algorithm N@1 N@3 N@10 P@1 P@3 P@10 MAP Regression 0.447 0.614 0.665 0.447 0.220 0.081 0.564 RankSVM 0.580 0.765 0.800 0.580 0.271 0.092 0.696 RankBoost 0.600 0.764 0.807 0.600 0.269 0.094 0.707 FRank 0.540 0.726 0.776 0.540 0.253 0.090 0.664 ListNet 0.567 0.758 0.801 0.567 0.267 0.092 0.690 AdaRank 0.580 0.729 0.764 0.580 0.251 0.086 0.678 SVM map 0.560 0.767 0.798 0.560 0.269 0.089 0.687 tp://svmrank.yisongyue.com/svmmap.php T.-Y. Liu, Learning to Rank for Information Retrieval Ilya Markov i.markov@uva.nl Information Retrieval 48
FRank 0.653 0.743 0.797 0.653 0.289 0.106 0.710 ListNet 0.720 0.813 0.837 0.720 0.320 0.106 0.766 AdaRank 0.733 0.805 0.838 0.733 0.309 0.106 0.771 SVM map 0.713 0.779 0.799 0.713 0.309 0.100 0.742 Current trends in IR Results on HP2004 ble 6.10 Results on HP2004. Algorithm N@1 N@3 N@10 P@1 P@3 P@10 MAP Regression 0.387 0.575 0.646 0.387 0.213 0.08 0.526 RankSVM 0.573 0.715 0.768 0.573 0.267 0.096 0.668 RankBoost 0.507 0.699 0.743 0.507 0.253 0.092 0.625 FRank 0.600 0.729 0.761 0.600 0.262 0.089 0.682 ListNet 0.600 0.721 0.784 0.600 0.271 0.098 0.690 AdaRank 0.613 0.816 0.832 0.613 0.298 0.094 0.722 SVM map 0.627 0.754 0.806 0.627 0.280 0.096 0.718 ble 6.11 Results on OHSUMED. Algorithm N@1 T.-Y. Liu, N@3 Learning N@10 to Rank for Information P@1 Retrieval P@3 P@10 MAP Regression 0.446 0.443 0.411 0.597 0.577 0.466 0.422 RankSVM 0.496 0.421 0.414 0.597 0.543 0.486 0.433 RankBoost 0.463 0.456 0.430 0.558 0.561 0.497 0.441 FRank 0.530 0.481 0.443 0.643 0.593 0.501 0.444 Ilya Markov i.markov@uva.nl Information Retrieval 49
Experimental comparison Listwise ranking algorithms perform very well on most datasets ListNet seems to be better than the others Pairwise ranking algorithms obtain good ranking accuracy on some (although not all) datasets Linear regression performs worse than the pairwise and listwise ranking algorithms T.-Y. Liu, Learning to Rank for Information Retrieval Ilya Markov i.markov@uva.nl Information Retrieval 50
Outline 2 Machine learning Features LTR approaches Experimental comparison Summary Ilya Markov i.markov@uva.nl Information Retrieval 51
summary Features Content-based Link-based User-based Approaches Pointwise (regression, classification, ordinal regression) Pairwise (RankNet) Listwise (LambdaRank, LambdaMART) Ilya Markov i.markov@uva.nl Information Retrieval 52
Materials Tie-Yan Liu Learning to Rank for Information Retrieval Foundations and Trends in Information Retrieval, 2009 Christopher J.C. Burges From RankNet to LambdaRank to LambdaMART: An Overview Microsoft Research Technical Report, 2010 Ilya Markov i.markov@uva.nl Information Retrieval 53
Course overview Offline Data Acquisition Data Processing Data Storage Online Query Processing Ranking Evaluation Advanced Aggregated Search Click Models Present and Future of IR Ilya Markov i.markov@uva.nl Information Retrieval 54
See you tomorrow! Offline Data Acquisition Data Processing Data Storage Online Query Processing Ranking Evaluation Advanced Aggregated Search Click Models Present and Future of IR Ilya Markov i.markov@uva.nl Information Retrieval 55