Learning to Rank. Tie-Yan Liu. Microsoft Research Asia CCIR 2011, Jinan,

Learning to Rank Tie-Yan Liu Microsoft Research Asia CCIR 2011, Jinan, 2011.10

History of Web Search Search engines powered by link analysis Traditional text retrieval engines 2011/10/22 Tie-Yan Liu @ CCIR 2011 2

Typical Search Engines Structure Query User Interface Query-time computing Caching Ranking Inverted Index Index Builder Page Authority Link Analysis Cached Pages Web Page Parser Pages Links & Anchors Link Map Link Graph Builder Link Graph Page & Site Statistics Crawler Offline computing Web 2011/10/22 Tie-Yan Liu @ CCIR 2011 3

Challenges to New Search Engines The same structure is shared by many search engines; which one can succeed? Those search engines with a longer history have accumulated many experiences in system tuning, and have accumulated a lot of heuristics in ranking. It is hard for newly-born search engines to compete with market leaders, because of the lack of experiences and domain knowledge. 2011/10/22 Tie-Yan Liu @ CCIR 2011 4

Challenges to New Search Engines Question: Can a new search engine get effective ranking heuristics and well tune its system without going through the long history? New Ranking Mechanism: Learning to Rank Answer: Heuristics Automatically can be learn accumulated, effective ranking and models can also from be learned; examples Systems using machine can be manually learning technologies! tuned, and can also be automatically optimized. 2011/10/22 Tie-Yan Liu @ CCIR 2011 5

Many Search Engines Employ Learning to Rank Technologies! Started from 2003. Ranking model trained using a machine learning method called RankNet (LambdaRank and LambdaMart later on). Bing is catching up with Google very quickly. Till 2011, Bing has gained about 30% market share. 2011/10/22 Tie-Yan Liu @ CCIR 2011 6

History of Web Search Search engines powered by link analysis Search engines powered by learning to rank Traditional text retrieval engines 2011/10/22 Tie-Yan Liu @ CCIR 2011 7

Outline What is learning to rank What s unique in learning to rank Future of learning to rank 2011/10/22 Tie-Yan Liu @ CCIR 2011 8

General sense Learning to Rank Discriminative training is also demanding. Everyday search engines receive a lot of user feedback; The capability of combining a large number of features is It is hard to describe user s feedback in a generative very promising: it can easily incorporate any new progress on retrieval feedback model, and by constantly including improve the output the of ranking the model as a feature. Any machine learning technologies that can be used to learn a ranking model Narrow sense manner; but it is definitely important to learn from the mechanism. In most recent works, learning to rank is defined as the methodology that learns how to combine features by means of discriminative training. 2011/10/22 Tie-Yan Liu @ CCIR 2011 9

Learning to Rank Use the Model to Answer Online Queries Learning the Ranking Model by Minimizing a Loss Function on the Training Data Feature Extraction for Query-document Pairs Collect Training Data (Queries and their labeled documents) 2011/10/22 Tie-Yan Liu @ CCIR 2011 10

Learning to Rank Algorithms Least Square Retrieval Function Query refinement (WWW 2008) (TOIS 1989) SVM-MAP (SIGIR 2007) Nested Ranker (SIGIR 2006) ListNet (ICML 2007) Pranking (NIPS 2002) LambdaRank (NIPS 2006) MPRank (ICML 2007) Frank (SIGIR 2007) MHR (SIGIR 2007) RankBoost (JMLR 2003) Learning to retrieval info (SCC 1995) LDM (SIGIR 2005) Large margin ranker (NIPS 2002) RankNet (ICML 2005) Ranking SVM (ICANN 1999) IRSVM (SIGIR 2006) Discriminative model for IR (SIGIR 2004) SVM Structure (JMLR 2005) OAP-BPM (ICML 2003) Subset Ranking (COLT 2006) GPRank (LR4IR 2007) QBRank (NIPS 2007) GBRank (SIGIR 2007) Constraint Ordinal Regression (ICML 2005) McRank (NIPS 2007) SoftRank (LR4IR 2007) AdaRank (SIGIR 2007) CCA (SIGIR 2007) ListMLE (ICML 2008) RankCosine (IP&M 2007) Supervised Rank Aggregation (WWW 2007) Relational ranking (WWW 2008) Learning to order things (NIPS 1998) Round robin ranking (ECML 2003) 2011/10/22 Tie-Yan Liu @ CCIR 2011 11

Learning to Rank Algorithms Revisited Many early work on learning to rank regard ranking as an application, and try to adopt existing machine learning algorithms to solve the ranking problem. Regression: treat relevance degree as real values Classification: treat relevance degree as categories Pairwise classification: reduce ranking to classifying the order between each pair of documents. 2011/10/22 Tie-Yan Liu @ CCIR 2011 12

Example: Subset Ranking (D. Cossock and T. Zhang, COLT 2006) Regard relevance degree as real number, and use regression to learn the ranking function. f ( x ) y 2 L( f ; x j, y j) j j Regression-based 2011/10/22 Tie-Yan Liu @ CCIR 2011 13

Example: McRank (P. Li, et al. NIPS 2007) Multi-class classification is used to learn the ranking function. For document x, the output of the classifier is ŷ j. Loss function: surrogate function of Ranking is produced by combining the outputs of the classifiers. pˆ j j, k P( yˆ j k), f ( x j) pˆ K k1 I yˆ j y { j j, k k } Classification-based 2011/10/22 Tie-Yan Liu @ CCIR 2011 14

Example: Ranking SVM (R. Herbrich, et al., Advances in Large Margin Classifiers, 2000; T. Joachims, KDD 2002) Ranking SVM is rooted in the framework of SVM Kernel tricks can also be applied to Ranking SVM, so as to handle complex non-linear problems. min w T ( i) uv 1 2 x ( i) u w 0, i x 2 ( i) v ( 1 1,..., n. n ( i) u, v i1 ( ) u, v: y, 1 C i) u, v u i v,if y ( i) u, v 1. x u -x v as positive instance of learning Use SVM to perform binary classification on these instances, to learn model parameter w Pairwise Classification-based 2011/10/22 Tie-Yan Liu @ CCIR 2011 15

Are They the Right Approaches? The reductions have not reflected the full nature of ranking In ranking, one cares about the order among documents, but not absolute scores or categories Top positions in the ranked list are more important Notion of query plays an important role Only documents associated with the same query can be compared to each other and ranked one after another Each query contributes equally to the overall evaluation measure (see the definition of MAP, NDCG) 2011/10/22 Tie-Yan Liu @ CCIR 2011 16

New Research is Needed New Algorithms To capture unique properties of ranking (relative order, position, query, etc.) in a principled manner Listwise Approach to Learning to Rank New theorems To understand theoretical nature of learning to rank algorithms and guarantee their performances Statistical Learning Theory for Ranking 2011/10/22 Tie-Yan Liu @ CCIR 2011 17

The Listwise Approach

Defining Ranking Loss is Non-trivial! An example: Model f: f(a)=3, f(b)=0, f(c)=1 ACB Model h: h(a)=4, h(b)=6, h(c)=3 BAC ground truth g: g(a)=6, g(b)=4, g(c)=3 ABC Question: which model is better (closer to ground truth)? Based on Euclidean distance: sim(f,g) < sim(g,h). Based on pairwise comparison: sim(f,g) = sim(g,h) However, according to NDCG, f should be closer to g! 2011/10/22 Tie-Yan Liu @ CCIR 2011 19

Listwise Loss Functions Ranked list Permutation probability distribution More informative representation for ranked list: permutation and ranked list has 1-1 correspondence. P( f ) 2011/10/22 Tie-Yan Liu @ CCIR 2011 20

Defining Permutation Probability Probability of a permutation is defined with Plackett-Luce Model Example: P PL ( f ) m ( j) m 1 exp( f ( x k j ( k ) j exp( f ( x )) )) P PL ABC f exp exp f (A) f (A) exp f (B) exp f (A) exp exp f (B) f (B) exp f (C) exp exp f f (C) (C) P(A ranked No.1) P(B ranked No.2 A ranked No.1) = P(B ranked No.1)/(1- P(A ranked No.1)) P(C ranked No.3 A ranked No.1, B ranked No.2) 2011/10/22 Tie-Yan Liu @ CCIR 2011 21

Distance between Ranked Lists Using KL-divergence to measure difference between distributions dis(f,g) = 0.46 dis(g,h) = 2.56 2011/10/22 Tie-Yan Liu @ CCIR 2011 22

K-L Divergence Loss ListNet (ICML 2007) ListMLE (ICML 2008) An efficient variant of ListNet g P f ( ) L( f ; x, g) D P x PL 1 g ListNet and PPLListMLE ( g) are Pg ( regarded ) 0 as otherwise among the most effective learning to rank algorithms. L( f ; x, g) log P ( f ( x)) PL PL g 2011/10/22 Tie-Yan Liu @ CCIR 2011 23

Other Work on Listwise Ranking Listwise loss functions AdaRank: a boosting approach to listwise ranking (SIGIR 2007) PermuRank: a structured SVM approach to listwise ranking (SIGIR 2008) Listwise ranking functions C-CRF: define listwise ranking function using conditional random fields (NIPS 2008) R-RSVM: define listwise ranking function using relational SVM (WWW 2008) 2011/10/22 Tie-Yan Liu @ CCIR 2011 24

Listwise Ranking Has Become an Important Branch of Learning to Rank 2011/10/22 Tie-Yan Liu @ CCIR 2011 25

Statistical Learning Theory for Ranking

Why Theory? In practice, one can only observe experimental results on relatively small datasets. Such empirical results might not be reliable, because Small training set cannot fully realize the potential of a learning algorithm. Small test set cannot reflect the true performance of an algorithm, since the real query space is too huge. Statistical learning theory analyzes the performance of an algorithm when the training data is infinite and the test data is randomly sampled. 2011/10/22 Tie-Yan Liu @ CCIR 2011 27

Generalization Analysis In the training phase, one learn a model by minimizing the empirical risk on the training data. In the test phase, we evaluate the expected risk of the model on any sample. Generalization analysis is concerned with the bound of the difference between the expected and empirical risks, when the number of training data approaches infinity. 2011/10/22 Tie-Yan Liu @ CCIR 2011 28

Generalization in Learning to Rank Loss on finite data (e.g., likelihood loss) Training Process Ranking Model Test Process Measure on infinite data e.g. (1-NDCG) Can this process generalize? Test Measure Training Loss + ε(n, m, F) query n Doc 1 query query Label Doc 1 query Label 1 1 Doc 1 Label 1 Doc n Doc 1 Label 1 Label n Doc n Label n Doc n Label n Doc m Label m Queries Web Documents 2011/10/22 Tie-Yan Liu @ CCIR 2011 29

How to Get There Test Measure Training Loss + ε(n, m, F) Test Measure Test Loss Training Loss + ε 2 2011/10/22 Tie-Yan Liu @ CCIR 2011 30 1

1 Test Loss Training Loss + ε n, m, F? This is generalization in terms of loss. To perform this generalization analysis, we need to make probabilistic assumptions on the data generation.

Previous Assumptions Document Ranking (Agarwal et.al., 2005; Clemencon et.al.,2007) Documents Doc 1 Label 1 Doc 2 Doc 3 Label 2 Label 3 Doc m Label m No notion of query! Test is conducted at query level in learning to rank! Deep and shallow training sets correspond to the same generalization ability. 2011/10/22 Tie-Yan Liu @ CCIR 2011 32

Previous Assumptions Subset Ranking (Lan et. al.,2008; Lan et. al.,2009) Queries query 1 Represent query by a deterministic subset of m documents and their labels Doc Set 1 Deterministic! Training documents are sampled, and different number of training documents will lead to different performance of the ranking model! Label Set 1 query 2 Doc Set 2 Label Set 2 query n More training documents will not enhance and even hurt generalization ability. Doc Set n Label Set n 2011/10/22 Tie-Yan Liu @ CCIR 2011 33

Two-layer Sampling (NIPS 2010) query 1 Different from document ranking, there is sampling of queries, and documents associated with different queries Queries are sampled Web according to different Doc 1 Doc distributions. 2 Doc 3 Documents Label 1 Label 2 Label 3 Doc m Different from subset ranking, the sampling of documents for each query is considered. Label m Elements in two-layer Feature sampling Feature are Feature neither Vector 1 Vector 2 Vector 3 independent nor identically distributed. Label 1 Label 2 Label 3 Feature Vector m Label m 2011/10/22 Tie-Yan Liu @ CCIR 2011 34

decomposition Concentration Two-layer Generalization Bound Test Loss Training Loss + ε (n, m, F) Two-layer error Querylayer error Introduce ghost query samples and fixed-size pseudo doc samples Doc-layer reduced two-layer RA Twolayer RA Doc-layer error Conditioned on query sample Introduce ghost doc sample for each query Query-layer reduced two-layer RA 2011/10/22 Tie-Yan Liu @ CCIR 2011 35

Discussion Deep or shallow? With budget to only label C documents, there is an optimal tradeoff between n and m. For example, if the ranking function class satisfies and, the optimal tradeoff is: 2011/10/22 Tie-Yan Liu @ CCIR 2011 36

2 Ranking Measure Loss Function?

Loss Function vs. Ranking Measure Loss Function in ListMLE L( f ; x, ) log P ( f ( x)) y Based on the scores produced by the ranking model. PL y 1- NDCG (Normalized Discounted Cumulative Gain) Normalization Cumulating Gain Position discount Based on the ranked list by sorting the scores. 2011/10/22 Tie-Yan Liu @ CCIR 2011 38

Challenge Relationship between loss and measure in ranking is unclear due to their different mathematical forms. In contrast, for classification, both loss and measure are defined regarding Loss individual Functions documents and their relationship is clear. Ranking Measures 2011/10/22 Tie-Yan Liu @ CCIR 2011 39

Essential Loss for Ranking (NIPS 2009) Model ranking as a sequence of classifications 2011/10/22 Tie-Yan Liu @ CCIR 2011 40 Ground truth permutation: Prediction of the ranking function f: { D} C B A y Classifier C D D C y C D B D C B y C D A B D C B A y { C} D A B Output the document with the largest ranking score The weighted classification error for each step in the sequence

Essential Loss vs. Ranking Measures 1) Both (1-NDCG) and (1-MAP) are upper bounded by the essential loss. 2) The zero value of the essential loss is a necessary and sufficient condition for the zero values of (1-NDCG) and (1-MAP). 2011/10/22 Tie-Yan Liu @ CCIR 2011 41

Essential Loss vs. Surrogate Losses 1) Many pairwise and listwise loss functions are upper bounds of the essential loss. 2) Therefore, the pairwise and listwise loss functions are also upper bounds of (1-NDCG) and (1-MAP). 2011/10/22 Tie-Yan Liu @ CCIR 2011 42

Learning Theory for Ranking (1) + (2) build the foundation of statistical learning theory for ranking. Guarantee on the test performance (in terms of the ranking measure) given the training performance (in terms of the loss function). Many people have started to look into this important field, inspired by our work. 2011/10/22 Tie-Yan Liu @ CCIR 2011 43

Summary and Outlook

Learning to Rank is Really Hot! Hundreds of publications at SIGIR, ICML, NIPS, etc. Several benchmark datasets released. 1~2 sessions at SIGIR every recent year. Several workshops at SIGIR, ICML, NIPS, etc. Several tutorials at SIGIR, WWW, ACL, etc. Special issue at IR Journal. Yahoo! Learning to rank challenges. Several books published on the topic. 2011/10/22 Tie-Yan Liu @ CCIR 2011 45

Wide Applications of Learning to Rank Document retrieval Question answering Multimedia retrieval Text summarization Online advertising Collaborative filtering Machine translation 2011/10/22 Tie-Yan Liu @ CCIR 2011 46

Future Work Challenges of Theories Tighter generalization bound / convergence rate Statistical consistency Coverage of more learning to rank algorithms Sampling selection bias 2011/10/22 Tie-Yan Liu @ CCIR 2011 47

Future Work Challenges from Real Applications Large scale learning to rank Robust learning to rank Online, incremental, active learning to rank Transfer learning to rank Structural learning to rank (diversity, whole-page relevance) 2011/10/22 Tie-Yan Liu @ CCIR 2011 48

References 2011/10/22 Tie-Yan Liu @ CCIR 2011 49

tyliu@microsoft.com http://research.microsoft.com/people/tyliu/ http://weibo.com/tieyanliu/ 2011/10/22 Tie-Yan Liu @ CCIR 2011 50