Learning to Rank. from heuristics to theoretic approaches. Hongning Wang

Similar documents
WebSci and Learning to Rank for IR

Fall Lecture 16: Learning-to-rank

Learning to Rank for Information Retrieval. Tie-Yan Liu Lead Researcher Microsoft Research Asia

Information Retrieval

Learning to Rank. Tie-Yan Liu. Microsoft Research Asia CCIR 2011, Jinan,

Boolean Model. Hongning Wang

A Few Things to Know about Machine Learning for Web Search

Search Engines and Learning to Rank

One-Pass Ranking Models for Low-Latency Product Recommendations

Retrieval Evaluation. Hongning Wang

Structured Ranking Learning using Cumulative Distribution Networks

Learning to Rank for Faceted Search Bridging the gap between theory and practice

Learning to rank, a supervised approach for ranking of documents Master Thesis in Computer Science - Algorithms, Languages and Logic KRISTOFER TAPPER

Lizhe Sun. November 17, Florida State University. Ranking in Statistics and Machine Learning. Lizhe Sun. Introduction

Query Likelihood with Negative Query Generation

Arama Motoru Gelistirme Dongusu: Siralamayi Ogrenme ve Bilgiye Erisimin Degerlendirilmesi. Retrieval Effectiveness and Learning to Rank

A General Approximation Framework for Direct Optimization of Information Retrieval Measures

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

Active Evaluation of Ranking Functions based on Graded Relevance (Extended Abstract)

Modern Retrieval Evaluations. Hongning Wang

Document indexing, similarities and retrieval in large scale text collections

Learning to Rank for Information Retrieval

A Machine Learning Approach for Information Retrieval Applications. Luo Si. Department of Computer Science Purdue University

arxiv: v1 [cs.ir] 19 Sep 2016

UMass at TREC 2017 Common Core Track

Learning to Rank: A New Technology for Text Processing

Diversification of Query Interpretations and Search Results

Consistency Theory in Machine Learning

Fall CS646: Information Retrieval. Lecture 2 - Introduction to Search Result Ranking. Jiepu Jiang University of Massachusetts Amherst 2016/09/12

A Stochastic Learning-To-Rank Algorithm and its Application to Contextual Advertising

Learning to Rank Only Using Training Data from Related Domain

Chapter 8. Evaluating Search Engine

High Accuracy Retrieval with Multiple Nested Ranker

Risk Minimization and Language Modeling in Text Retrieval Thesis Summary

Improving Recommendations Through. Re-Ranking Of Results

arxiv: v2 [cs.ir] 4 Dec 2018

Adapting Ranking Functions to User Preference

A Comparing Pointwise and Listwise Objective Functions for Random Forest based Learning-to-Rank

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Linking Entities in Tweets to Wikipedia Knowledge Base

Lecture 3: Improving Ranking with

Learning-to-rank with Prior Knowledge as Global Constraints

arxiv: v2 [cs.ir] 27 Feb 2019

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Text Categorization (I)

Learning Dense Models of Query Similarity from User Click Logs

Ranking with Query-Dependent Loss for Web Search

Optimizing Search Engines using Click-through Data

arxiv: v1 [cs.ir] 16 Sep 2018

Introduction to Deep Learning

Learning Socially Optimal Information Systems from Egoistic Users

LETOR: Benchmark Dataset for Research on Learning to Rank for Information Retrieval Tie-Yan Liu 1, Jun Xu 1, Tao Qin 2, Wenying Xiong 3, and Hang Li 1

CS249: ADVANCED DATA MINING

arxiv: v1 [stat.ap] 14 Mar 2018

Adaptive Search Engines Learning Ranking Functions with SVMs

Learning Temporal-Dependent Ranking Models

Personalized Web Search

Large Scale Data Analysis Using Deep Learning

Part 5: Structured Support Vector Machines

Learning Ranking Functions with SVMs

Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites from User Activity Data

Session Based Click Features for Recency Ranking

Learning Ranking Functions with Implicit Feedback

Estimating Credibility of User Clicks with Mouse Movement and Eye-tracking Information

S-MART: Novel Tree-based Structured Learning Algorithms Applied to Tweet Entity Linking

Concept of Curve Fitting Difference with Interpolation

University of Delaware at Diversity Task of Web Track 2010

arxiv: v1 [cs.ir] 16 Oct 2017

Large-Scale Validation and Analysis of Interleaved Search Evaluation

Predicting Query Performance on the Web

Structured Learning. Jun Zhu

Learning to Rank (part 2)

Information Retrieval

Multi-label Classification. Jingzhou Liu Dec

Clickthrough Log Analysis by Collaborative Ranking

Learning Ranking Functions with SVMs

Apache Solr Learning to Rank FTW!

Network Traffic Measurements and Analysis

A Machine Learning Approach for Improved BM25 Retrieval

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

Linear Regression and K-Nearest Neighbors 3/28/18

Support Vector Machines 290N, 2015

Scaling Up Decision Tree Ensembles

A probabilistic description-oriented approach for categorising Web documents

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

RMIT University at TREC 2006: Terabyte Track

Northeastern University in TREC 2009 Million Query Track

Fast Learning of Document Ranking Functions with the Committee Perceptron

A new click model for relevance prediction in web search

Minimally Invasive Randomization for Collecting Unbiased Preferences from Clickthrough Logs

Instance-based Learning

Link Analysis. Hongning Wang

IE598 Big Data Optimization Summary Nonconvex Optimization

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

A Deep Relevance Matching Model for Ad-hoc Retrieval

Search Engines Chapter 8 Evaluating Search Engines Felix Naumann

Lecture 9. Support Vector Machines

Introduction to Information Retrieval

Learning with Humans in the Loop

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Transcription:

Learning to Rank from heuristics to theoretic approaches Hongning Wang

Congratulations Job Offer from Bing Core Ranking team Design the ranking module for Bing.com CS 6501: Information Retrieval 2

How should I rank documents? Answer: Rank by relevance! CS 6501: Information Retrieval 3

Relevance?! CS 6501: Information Retrieval 4

The Notion of Relevance (Rep(q), Rep(d)) Similarity Relevance P(r=1 q,d) r {0,1} Probability of Relevance Relevance constraints [Fang et al. 04] Div. from Randomness (Amati & Rijsbergen 02) P(d q) or P(q d) Probabilistic inference Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) Regression Model (Fuhr 89) Learn. To Rank (Joachims 02, Berges et al. 05) Doc generation Generative Model Classical prob. Model (Robertson & Sparck Jones, 76) Query generation LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Different inference system Prob. concept space model (Wong & Yao, 95) Inference network model (Turtle & Croft, 91) CS 6501: Information Retrieval 5

Relevance Estimation Query matching Language model BM25 Vector space cosine similarity Document importance PageRank HITS CS 6501: Information Retrieval 6

Did I do a good job of ranking PageRank BM25 documents? CS 6501: Information Retrieval 7

Did I do a good job of ranking documents? IR evaluations metrics Precision@K MAP NDCG CS 6501: Information Retrieval 8

Take advantage of different relevance estimator? Ensemble the cues Linear? aa 1 BBBBBB + αα 2 LLLL + αα 3 PPPPPPPPPPPPPPPP + αα 4 HHHHHHHH Non-linear? {αα 1 = 0.4, αα 2 = 0.2, αα 3 = 0.1, αα 4 = 0.1} {MMMMMM = 0.20, NNNNNNNN = 0.6} {αα 1 = 0.2, αα 2 = 0.2, αα 3 = 0.3, αα 4 = 0.1} {MMMMMM = 0.12, NNNNNNNN = 0.5 } Decision tree-like {αα 1 = 0.1, αα 2 = 0.1, αα 3 = 0.5, αα 4 = 0.2} {MMMMMM = 0.18, NNNNNNNN = 0.7} BM25>0.5 True False True LM > 0.1 False True PageRank > 0.3 False r= 1.0 r= 0.7 r= 0.4 r= 0.1 CS 6501: Information Retrieval 9

What if we have thousands of features? Is there any way I can do better? Optimizing the metrics automatically! Where to find those tree structures? How to determine those ααs? CS 6501: Information Retrieval 10

Rethink the task Given: (query, document) pairs represented by a set of relevance estimators, a.k.a., features DocID BM25 LM PageRank Label 0001 1.6 1.1 0.9 0 0002 2.7 1.9 0.2 1 Needed: a way of combining the estimators DD ff qq, dd ii=1 DD ordered dd ii=1 Criterion: optimize IR metrics P@k, MAP, NDCG, etc. Key! CS 6501: Information Retrieval 11

Machine Learning Input: { XX 1, YY 1, XX 2, YY 2,, (XX nn, YY nn )}, where XX ii RR NN, YY ii RR MM Objective function : OO(YY, YY) Output: ff XX YY, such that ff = argmax ff FF OO(ff (XX), YY) NOTE: We will only talk about supervised learning. Classification OO YY, YY = δδ(yy = YY) http://en.wikipedia.org/wiki/stat istical_classification Regression OO YY, YY = YY YY http://en.wikipedia.org/wiki/regress ion_analysis CS 6501: Information Retrieval 12

Learning to Rank General solution in optimization framework Input: (qq ii, dd 1 ), yy 1, (qq ii, dd 2 ), yy 2,, (qq ii, dd nn ), yy nn, where d n RR NN, yy ii {0,, LL} Objective: O = {P@k, MAP, NDCG} Output: ff qq, dd YY, s.t., ff = argmax ff FF OO(ff (qq, dd), YY) DocID BM25 LM PageRank Label 0001 1.6 1.1 0.9 0 0002 2.7 1.9 0.2 1 CS 6501: Information Retrieval 13

Challenge: how to optimize? Evaluation metric recap Average Precision DCG Order is essential! ff order metric Not continuous with respect to f(x)! CS 6501: Information Retrieval 14

Approximating the objective function! Pointwise Fit the relevance labels individually Pairwise Fit the relative orders Listwise Fit the whole order CS 6501: Information Retrieval 15

Pointwise Learning to Rank Ideally perfect relevance prediction leads to perfect ranking ff score order metric Reducing ranking problem to Regression OO ff QQ, DD, YY = ii ff qq ii, dd ii yy ii Subset Ranking using Regression, D.Cossock and T.Zhang, COLT 2006 (multi-)classification OO ff QQ, DD, YY = ii δδ(ff qq ii, dd ii = yy ii ) Ranking with Large Margin Principles, A. Shashua and A. Levin, NIPS 2002 CS 6501: Information Retrieval 16

Subset Ranking using Regression D.Cossock and T.Zhang, COLT 2006 Fit relevance labels via regression Emphasize more on relevant documents Weights on each document Most positive document http://en.wikipedia.org/wiki/regression_analysis CS 6501: Information Retrieval 17

Ranking with Large Margin Principles A. Shashua and A. Levin, NIPS 2002 Goal: correctly placing the documents in the corresponding category and maximize the margin Y=1 Y=0 Y=2 Reduce the violations αααα CS 6501: Information Retrieval 18

Ranking with Large Margin Principles A. Shashua and A. Levin, NIPS 2002 Ranking lost is consistently decreasing with more training data CS 6501: Information Retrieval 19

What did we learn Machine learning helps! Derive something optimizable More efficient and guided CS 6501: Information Retrieval 20

Deficiency Cannot directly optimize IR metrics (0 1, 2 0) worse than (0-2, 2 4) Position of documents are ignored Penalty on documents 1 at higher positions should dd 0 1 be larger -2 Favor the queries with more documents dd 1-1 dd 1 dd 2 2 CS 6501: Information Retrieval 21

Pairwise Learning to Rank Ideally perfect partial order leads to perfect ranking ff partial order order metric Ordinal regression OO ff QQ, DD, YY = ii jj δδ(yy ii > yy jj )δδ(ff qq ii, dd ii < ff(qq ii, dd ii )) Relative ordering between different documents is significant E.g., (0-2, 2 4) is better than (0 1, 2 0) Large body of research -2-1 0 dd 1 1 dd 2 2 dd 1 CS 6501: Information Retrieval dd 1 22

Optimizing Search Engines using Clickthrough Data Thorsten Joachims, KDD 02 Minimizing the number of mis-ordered pairs yy 1 > yy 2, yy 2 > yy 3, yy 1 > yy 4 1 0 linear combination of features yy 1 > yy 2 ff qq, dd = ww TT XX qq,dd Keep the relative orders RankSVM CS 6501: Information Retrieval 23

Optimizing Search Engines using Clickthrough Data Thorsten Joachims, KDD 02 How to use it? ff scccccccc order yy 1 > yy 2 > yy 3 > yy 4 CS 6501: Information Retrieval 24

Optimizing Search Engines using Clickthrough Data Thorsten Joachims, KDD 02 What did it learn from the data? Linear correlations Positive correlated features Negative correlated features CS 6501: Information Retrieval 25

Optimizing Search Engines using Clickthrough Data Thorsten Joachims, KDD 02 How good is it? Test on real system CS 6501: Information Retrieval 26

An Efficient Boosting Algorithm for Combining Preferences Y. Freund, R. Iyer, et al. JMLR 2003 Smooth the loss on mis-ordered pairs yyii >yy jj PPPP dd ii, dd jj eeeeee[ff qq, dd jj ff qq, dd ii ] square loss hinge loss (RankSVM) 0/1 loss exponential loss CS 6501: Information Retrieval from Pattern Recognition 27 and Machine Learning, P337

An Efficient Boosting Algorithm for Combining Preferences Y. Freund, R. Iyer, et al. JMLR 2003 RankBoost: optimize via boosting Vote by a committee BM25 PageRank Cosine Updating Pr(dd ii, dd jj ) Credibility of each committee member (ranking feature) from Pattern Recognition and Machine Learning, P658 CS 6501: Information Retrieval 28

An Efficient Boosting Algorithm for Combining Preferences Y. Freund, R. Iyer, et al. JMLR 2003 How good is it? CS 6501: Information Retrieval 29

A Regression Framework for Learning Ranking Functions Using Relative Relevance Judgments Zheng et al. SIRIG 07 Non-linear ensemble of features Object: yyii >yy jj max 0, ff qq, dd jj ff qq, dd ii 2 Gradient descent boosting tree Boosting tree Using regression tree to minimize the residuals rr tt (qq, dd, yy) = OO tt (qq, dd, yy) ff tt 1 (qq, dd, yy) BM25>0.5 True False PageRank LM > 0.1 > 0.3 True False True False r= 1.0 r= 0.7 CS 6501: Information Retrieval 30 r= 0.4 r= 0.1

A Regression Framework for Learning Ranking Functions Using Relative Relevance Judgments Zheng et al. SIRIG 07 Non-linear v.s. linear Comparing with RankSVM CS 6501: Information Retrieval 31

Where do we get the relative orders Human annotations Small scale, expensive to acquire Clickthroughs Large amount, easy to acquire CS 6501: Information Retrieval 32

What did we learn Predicting relative order Getting closer to the nature of ranking Promising performance in practice Pairwise preferences from click-throughs CS 6501: Information Retrieval 33

Listwise Learning to Rank Can we directly optimize the ranking? ff order metric Tackle the challenge Optimization without gradient CS 6501: Information Retrieval 34

From RankNet to LambdaRank to LambdaMART: An Overview Christopher J.C. Burges, 2010 Minimizing mis-ordered pair => maximizing IR metrics? Mis-ordered pairs: 6 Mis-ordered pairs: 4 AP: 5 8 AP: 5 12 DCG: 1.333 DCG: 0.931 Position is crucial! CS 6501: Information Retrieval 35

From RankNet to LambdaRank to Weight the mis-ordered pairs? LambdaMART: An Overview Christopher J.C. Burges, 2010 Some pairs are more important to be placed in the right order Inject into object function yyii >yy jj Ω dd ii, dd jj Inject into gradient λλ iiii = OO aaaaaaaaaa ΔOO Depend on the ranking of document i, j in the whole list exp wwδφ qq nn, dd ii, dd jj Change in original object, e.g., NDCG, if we switch the documents i and j, leaving the other documents unchanged Gradient with respect to approximated objective, i.e., exponential loss on mis-ordered pairs CS 6501: Information Retrieval 36

Lambda functions Gradient? Yes, it meets the sufficient and necessary condition of being partial derivative Lead to optimal solution of original problem? Empirically From RankNet to LambdaRank to LambdaMART: An Overview Christopher J.C. Burges, 2010 CS 6501: Information Retrieval 37

Evolution From RankNet to LambdaRank to LambdaMART: An Overview Christopher J.C. Burges, 2010 RankNet LambdaRank LambdaMART Objective Cross entropy over the pairs Unknown Unknown Gradient (λλ function) Gradient of cross entropy Gradient of cross entropy times pairwise change in target metric Gradient of cross entropy times pairwise change in target metric Optimization method neural network stochastic gradient descent Multiple Additive Regression Trees (MART) As we discussed in RankBoost Optimize solely by gradient Non-linear combination CS 6501: Information Retrieval 38

A Lambda tree From RankNet to LambdaRank to LambdaMART: An Overview Christopher J.C. Burges, 2010 splitting Combination of features CS 6501: Information Retrieval 39

A Support Vector Machine for Optimizing Average Precision Yisong Yue, et al., SIGIR 07 RankSVM Minimizing the pairwise loss SVM-MAP Minimizing the structural loss MAP difference Loss defined on the number of mis-ordered document pairs Loss defined on the quality of the whole list of ordered documents CS 6501: Information Retrieval 40

A Support Vector Machine for Optimizing Max margin principle Average Precision Yisong Yue, et al., SIGIR 07 Push the ground-truth far away from any mistakes one might make Finding the most likely violated constraints CS 6501: Information Retrieval 41

A Support Vector Machine for Optimizing Average Precision Yisong Yue, et al., SIGIR 07 Finding the most violated constraints MAP is invariant to permutation of (ir)relevant documents Maximize MAP over a series of swaps between relevant and irrelevant documents Right-hand side of constraints Start from the reverse order of ideal ranking Greedy solution CS 6501: Information Retrieval 42

A Support Vector Machine for Optimizing Experiment results Average Precision Yisong Yue, et al., SIGIR 07 CS 6501: Information Retrieval 43

Other listwise solutions Soften the metrics to make them differentiable Michael Taylor et al., SoftRank: optimizing nonsmooth rank metrics, WSDM'08 Minimize a loss function defined on permutations Zhe Cao et al., Learning to rank: from pairwise approach to listwise approach, ICML'07 CS 6501: Information Retrieval 44

What did we learn Taking a list of documents as a whole Positions are visible for the learning algorithm Directly optimizing the target metric Limitation The search space is huge! CS 6501: Information Retrieval 45

Summary Learning to rank Automatic combination of ranking features for optimizing IR evaluation metrics Approaches Pointwise Fit the relevance labels individually Pairwise Fit the relative orders Listwise Fit the whole order CS 6501: Information Retrieval 46

Experimental Comparisons My experiments 1.2k queries, 45.5K documents with 1890 features 800 queries for training, 400 queries for testing MAP P@1 ERR MRR NDCG@5 ListNET 0.2863 0.2074 0.1661 0.3714 0.2949 LambdaMART 0.4644 0.4630 0.2654 0.6105 0.5236 RankNET 0.3005 0.2222 0.1873 0.3816 0.3386 RankBoost 0.4548 0.4370 0.2463 0.5829 0.4866 RankSVM 0.3507 0.2370 0.1895 0.4154 0.3585 AdaRank 0.4321 0.4111 0.2307 0.5482 0.4421 plogistic 0.4519 0.3926 0.2489 0.5535 0.4945 Logistic 0.4348 0.3778 0.2410 0.5526 0.4762 CS 6501: Information Retrieval 47

Connection with Traditional IR People have foreseen this topic long time ago Recall: probabilistic ranking principle CS 6501: Information Retrieval 48

Conditional models for P(R=1 Q,D) Basic idea: relevance depends on how well a query matches a document P(R=1 Q,D)=g(Rep(Q,D) θ) Rep(Q,D): feature representation of query-doc pair E.g., #matched terms, highest IDF of a matched term, doclen Using training data (with known relevance judgments) to estimate parameter θ Apply the model to rank new documents Special case: logistic regression a functional form CS 6501: Information Retrieval 49

Broader Notion of Relevance Social network Language Model Query Cosine BM25 Query relation Language Model Query Cosine BM25 Query relation Language Model Query Cosine BM25 Documents Documents Documents Click/View Linkage structure Visual Structure Linkage structure likes CS 6501: Information Retrieval 50

Future Tighter bounds Faster solution Larger scale Wider application scenario CS 6501: Information Retrieval 51

Resources Books Liu, Tie-Yan. Learning to rank for information retrieval. Vol. 13. Springer, 2011. Li, Hang. "Learning to rank for information retrieval and natural language processing." Synthesis Lectures on Human Language Technologies 4.1 (2011): 1-113. Helpful pages http://en.wikipedia.org/wiki/learning_to_rank Packages RankingSVM: http://svmlight.joachims.org/ RankLib: http://people.cs.umass.edu/~vdang/ranklib.html Data sets LETOR http://research.microsoft.com/enus/um/beijing/projects/letor// Yahoo! Learning to rank challenge http://learningtorankchallenge.yahoo.com/ CS 6501: Information Retrieval 52

References Liu, Tie-Yan. "Learning to rank for information retrieval." Foundations and Trends in Information Retrieval 3.3 (2009): 225-331. Cossock, David, and Tong Zhang. "Subset ranking using regression." Learning theory (2006): 605-619. Shashua, Amnon, and Anat Levin. "Ranking with large margin principle: Two approaches." Advances in neural information processing systems 15 (2003): 937-944. Joachims, Thorsten. "Optimizing search engines using clickthrough data." Proceedings of the eighth ACM SIGKDD. ACM, 2002. Freund, Yoav, et al. "An efficient boosting algorithm for combining preferences." The Journal of Machine Learning Research 4 (2003): 933-969. Zheng, Zhaohui, et al. "A regression framework for learning ranking functions using relative relevance judgments." Proceedings of the 30th annual international ACM SIGIR. ACM, 2007. CS 6501: Information Retrieval 53

References Joachims, Thorsten, et al. "Accurately interpreting clickthrough data as implicit feedback." Proceedings of the 28th annual international ACM SIGIR. ACM, 2005. Burges, C. "From ranknet to lambdarank to lambdamart: An overview." Learning 11 (2010): 23-581. Xu, Jun, and Hang Li. "AdaRank: a boosting algorithm for information retrieval." Proceedings of the 30th annual international ACM SIGIR. ACM, 2007. Yue, Yisong, et al. "A support vector method for optimizing average precision." Proceedings of the 30th annual international ACM SIGIR. ACM, 2007. Taylor, Michael, et al. "Softrank: optimizing non-smooth rank metrics." Proceedings of the international conference WSDM. ACM, 2008. Cao, Zhe, et al. "Learning to rank: from pairwise approach to listwise approach." Proceedings of the 24th ICML. ACM, 2007. CS 6501: Information Retrieval 54

Ranking with Large Margin Principles A. Shashua and A. Levin, NIPS 2002 Maximizing the sum of margins Y=0 Y=1 Y=2 αααα CS 6501: Information Retrieval 55

AdaRank: a boosting algorithm for Loss defined by IR metrics qq QQ PPPP qq eeeeee[ OO(qq)] Optimizing by boosting information retrieval Jun Xu & Hang Li, SIGIR 07 BM25 PageRank Cosine Target metrics: MAP, NDCG, MRR Updating Pr(qq) Credibility of each committee member (ranking feature) from Pattern Recognition and Machine Learning, P658 CS 6501: Information Retrieval 56

Analysis of the Approaches What are they really optimizing? Relation with IR metrics g a p CS 6501: Information Retrieval 57

Pointwise Approaches Regression based Discount coefficients in DCG Classification based Regression loss Discount coefficients in DCG Classification loss CS 6501: Information Retrieval 58

From CS 6501: Information Retrieval 59

Discount coefficients in DCG From CS 6501: Information Retrieval 60

Listwise Approaches No general analysis Method dependent Directness and consistency CS 6501: Information Retrieval 61