Learning to Rank. Tie-Yan Liu. Microsoft Research Asia CCIR 2011, Jinan,

Similar documents
WebSci and Learning to Rank for IR

Learning to Rank for Information Retrieval. Tie-Yan Liu Lead Researcher Microsoft Research Asia

A Few Things to Know about Machine Learning for Web Search

Search Engines and Learning to Rank

Information Retrieval

Learning to Rank for Information Retrieval

Ranking with Query-Dependent Loss for Web Search

Fall Lecture 16: Learning-to-rank

Learning to Rank. from heuristics to theoretic approaches. Hongning Wang

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Rank: A New Technology for Text Processing

Structured Ranking Learning using Cumulative Distribution Networks

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Learning Ranking Functions with Implicit Feedback

Lizhe Sun. November 17, Florida State University. Ranking in Statistics and Machine Learning. Lizhe Sun. Introduction

Learning to Rank with Deep Neural Networks

arxiv: v1 [cs.ir] 19 Sep 2016

Active Evaluation of Ranking Functions based on Graded Relevance (Extended Abstract)

Advanced Topics in Information Retrieval. Learning to Rank. ATIR July 14, 2016

Self-tuning ongoing terminology extraction retrained on terminology validation decisions

Learning Dense Models of Query Similarity from User Click Logs

Fractional Similarity : Cross-lingual Feature Selection for Search

Combine the PA Algorithm with a Proximal Classifier

Learning to rank, a supervised approach for ranking of documents Master Thesis in Computer Science - Algorithms, Languages and Logic KRISTOFER TAPPER

Improving Recommendations Through. Re-Ranking Of Results

Boolean Model. Hongning Wang

A Deep Relevance Matching Model for Ad-hoc Retrieval

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 6 of Data Mining by I. H. Witten and E. Frank

Structured Learning. Jun Zhu

Classification: Linear Discriminant Functions

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Introduction to Information Retrieval. Hongning Wang

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

Query Independent Scholarly Article Ranking

Retrieval Evaluation. Hongning Wang

Learning to Rank for Faceted Search Bridging the gap between theory and practice

Ranking and Learning. Table of Content. Weighted scoring for ranking Learning to rank: A simple example Learning to ranking as classification.

DATA MINING - 1DL105, 1DL111

Lecture 9: Support Vector Machines

An Investigation of Basic Retrieval Models for the Dynamic Domain Task

DATA MINING II - 1DL460. Spring 2014"

Data Mining: Concepts and Techniques. Chapter 9 Classification: Support Vector Machines. Support Vector Machines (SVMs)

Automatic Domain Partitioning for Multi-Domain Learning

Automated Online News Classification with Personalization

Divide and Conquer Kernel Ridge Regression

THIS LECTURE. How do we know if our results are any good? Results summaries: Evaluating a search engine. Making our good results usable to a user

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

A General Approximation Framework for Direct Optimization of Information Retrieval Measures

Text Categorization (I)

Personalized Web Search

Part I: Data Mining Foundations

A Stochastic Learning-To-Rank Algorithm and its Application to Contextual Advertising

Stat 602X Exam 2 Spring 2011

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Multi-label Classification. Jingzhou Liu Dec

Northeastern University in TREC 2009 Million Query Track

IALP 2016 Improving the Effectiveness of POI Search by Associated Information Summarization

Clustering. Chapter 10 in Introduction to statistical learning

A Taxonomy of Semi-Supervised Learning Algorithms

Unsupervised Learning. Presenter: Anil Sharma, PhD Scholar, IIIT-Delhi

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Machine Learning / Jan 27, 2010

Learning Ranking Functions with SVMs

Transductive Learning: Motivation, Model, Algorithms

ECG782: Multidimensional Digital Signal Processing

Learning Socially Optimal Information Systems from Egoistic Users

Adaptive Dropout Training for SVMs

Easy Samples First: Self-paced Reranking for Zero-Example Multimedia Search

ECS289: Scalable Machine Learning

One-Pass Ranking Models for Low-Latency Product Recommendations

Opportunities and challenges in personalization of online hotel search

Bayesian model ensembling using meta-trained recurrent neural networks

Supervised Clustering of Label Ranking Data

Supplementary A. Overview. C. Time and Space Complexity. B. Shape Retrieval. D. Permutation Invariant SOM. B.1. Dataset

Learning Better Data Representation using Inference-Driven Metric Learning

Performance Measures for Multi-Graded Relevance

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Applying Supervised Learning

Classification. 1 o Semestre 2007/2008

Information Retrieval

Combining PGMs and Discriminative Models for Upper Body Pose Detection

Machine Learning Techniques for Data Mining

ECS289: Scalable Machine Learning

Application of Support Vector Machine Algorithm in Spam Filtering

Slides for Data Mining by I. H. Witten and E. Frank

Detecting Malicious Activity with DNS Backscatter Kensuke Fukuda John Heidemann Proc. of ACM IMC '15, pp , 2015.

CSE 573: Artificial Intelligence Autumn 2010

MODEL SELECTION AND REGULARIZATION PARAMETER CHOICE

arxiv: v2 [cs.ir] 27 Feb 2019

Instructor: Stefan Savev

Predicting Query Performance on the Web

Lecture 5: Information Retrieval using the Vector Space Model

Evaluating search engines CE-324: Modern Information Retrieval Sharif University of Technology

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

High Accuracy Retrieval with Multiple Nested Ranker

Applied Bayesian Nonparametrics 5. Spatial Models via Gaussian Processes, not MRFs Tutorial at CVPR 2012 Erik Sudderth Brown University

Regularization and model selection

SVM in Oracle Database 10g: Removing the Barriers to Widespread Adoption of Support Vector Machines

The Offset Tree for Learning with Partial Labels

Transcription:

Learning to Rank Tie-Yan Liu Microsoft Research Asia CCIR 2011, Jinan, 2011.10

History of Web Search Search engines powered by link analysis Traditional text retrieval engines 2011/10/22 Tie-Yan Liu @ CCIR 2011 2

Typical Search Engines Structure Query User Interface Query-time computing Caching Ranking Inverted Index Index Builder Page Authority Link Analysis Cached Pages Web Page Parser Pages Links & Anchors Link Map Link Graph Builder Link Graph Page & Site Statistics Crawler Offline computing Web 2011/10/22 Tie-Yan Liu @ CCIR 2011 3

Challenges to New Search Engines The same structure is shared by many search engines; which one can succeed? Those search engines with a longer history have accumulated many experiences in system tuning, and have accumulated a lot of heuristics in ranking. It is hard for newly-born search engines to compete with market leaders, because of the lack of experiences and domain knowledge. 2011/10/22 Tie-Yan Liu @ CCIR 2011 4

Challenges to New Search Engines Question: Can a new search engine get effective ranking heuristics and well tune its system without going through the long history? New Ranking Mechanism: Learning to Rank Answer: Heuristics Automatically can be learn accumulated, effective ranking and models can also from be learned; examples Systems using machine can be manually learning technologies! tuned, and can also be automatically optimized. 2011/10/22 Tie-Yan Liu @ CCIR 2011 5

Many Search Engines Employ Learning to Rank Technologies! Started from 2003. Ranking model trained using a machine learning method called RankNet (LambdaRank and LambdaMart later on). Bing is catching up with Google very quickly. Till 2011, Bing has gained about 30% market share. 2011/10/22 Tie-Yan Liu @ CCIR 2011 6

History of Web Search Search engines powered by link analysis Search engines powered by learning to rank Traditional text retrieval engines 2011/10/22 Tie-Yan Liu @ CCIR 2011 7

Outline What is learning to rank What s unique in learning to rank Future of learning to rank 2011/10/22 Tie-Yan Liu @ CCIR 2011 8

General sense Learning to Rank Discriminative training is also demanding. Everyday search engines receive a lot of user feedback; The capability of combining a large number of features is It is hard to describe user s feedback in a generative very promising: it can easily incorporate any new progress on retrieval feedback model, and by constantly including improve the output the of ranking the model as a feature. Any machine learning technologies that can be used to learn a ranking model Narrow sense manner; but it is definitely important to learn from the mechanism. In most recent works, learning to rank is defined as the methodology that learns how to combine features by means of discriminative training. 2011/10/22 Tie-Yan Liu @ CCIR 2011 9

Learning to Rank Use the Model to Answer Online Queries Learning the Ranking Model by Minimizing a Loss Function on the Training Data Feature Extraction for Query-document Pairs Collect Training Data (Queries and their labeled documents) 2011/10/22 Tie-Yan Liu @ CCIR 2011 10

Learning to Rank Algorithms Least Square Retrieval Function Query refinement (WWW 2008) (TOIS 1989) SVM-MAP (SIGIR 2007) Nested Ranker (SIGIR 2006) ListNet (ICML 2007) Pranking (NIPS 2002) LambdaRank (NIPS 2006) MPRank (ICML 2007) Frank (SIGIR 2007) MHR (SIGIR 2007) RankBoost (JMLR 2003) Learning to retrieval info (SCC 1995) LDM (SIGIR 2005) Large margin ranker (NIPS 2002) RankNet (ICML 2005) Ranking SVM (ICANN 1999) IRSVM (SIGIR 2006) Discriminative model for IR (SIGIR 2004) SVM Structure (JMLR 2005) OAP-BPM (ICML 2003) Subset Ranking (COLT 2006) GPRank (LR4IR 2007) QBRank (NIPS 2007) GBRank (SIGIR 2007) Constraint Ordinal Regression (ICML 2005) McRank (NIPS 2007) SoftRank (LR4IR 2007) AdaRank (SIGIR 2007) CCA (SIGIR 2007) ListMLE (ICML 2008) RankCosine (IP&M 2007) Supervised Rank Aggregation (WWW 2007) Relational ranking (WWW 2008) Learning to order things (NIPS 1998) Round robin ranking (ECML 2003) 2011/10/22 Tie-Yan Liu @ CCIR 2011 11

Learning to Rank Algorithms Revisited Many early work on learning to rank regard ranking as an application, and try to adopt existing machine learning algorithms to solve the ranking problem. Regression: treat relevance degree as real values Classification: treat relevance degree as categories Pairwise classification: reduce ranking to classifying the order between each pair of documents. 2011/10/22 Tie-Yan Liu @ CCIR 2011 12

Example: Subset Ranking (D. Cossock and T. Zhang, COLT 2006) Regard relevance degree as real number, and use regression to learn the ranking function. f ( x ) y 2 L( f ; x j, y j) j j Regression-based 2011/10/22 Tie-Yan Liu @ CCIR 2011 13

Example: McRank (P. Li, et al. NIPS 2007) Multi-class classification is used to learn the ranking function. For document x, the output of the classifier is ŷ j. Loss function: surrogate function of Ranking is produced by combining the outputs of the classifiers. pˆ j j, k P( yˆ j k), f ( x j) pˆ K k1 I yˆ j y { j j, k k } Classification-based 2011/10/22 Tie-Yan Liu @ CCIR 2011 14

Example: Ranking SVM (R. Herbrich, et al., Advances in Large Margin Classifiers, 2000; T. Joachims, KDD 2002) Ranking SVM is rooted in the framework of SVM Kernel tricks can also be applied to Ranking SVM, so as to handle complex non-linear problems. min w T ( i) uv 1 2 x ( i) u w 0, i x 2 ( i) v ( 1 1,..., n. n ( i) u, v i1 ( ) u, v: y, 1 C i) u, v u i v,if y ( i) u, v 1. x u -x v as positive instance of learning Use SVM to perform binary classification on these instances, to learn model parameter w Pairwise Classification-based 2011/10/22 Tie-Yan Liu @ CCIR 2011 15

Are They the Right Approaches? The reductions have not reflected the full nature of ranking In ranking, one cares about the order among documents, but not absolute scores or categories Top positions in the ranked list are more important Notion of query plays an important role Only documents associated with the same query can be compared to each other and ranked one after another Each query contributes equally to the overall evaluation measure (see the definition of MAP, NDCG) 2011/10/22 Tie-Yan Liu @ CCIR 2011 16

New Research is Needed New Algorithms To capture unique properties of ranking (relative order, position, query, etc.) in a principled manner Listwise Approach to Learning to Rank New theorems To understand theoretical nature of learning to rank algorithms and guarantee their performances Statistical Learning Theory for Ranking 2011/10/22 Tie-Yan Liu @ CCIR 2011 17

The Listwise Approach

Defining Ranking Loss is Non-trivial! An example: Model f: f(a)=3, f(b)=0, f(c)=1 ACB Model h: h(a)=4, h(b)=6, h(c)=3 BAC ground truth g: g(a)=6, g(b)=4, g(c)=3 ABC Question: which model is better (closer to ground truth)? Based on Euclidean distance: sim(f,g) < sim(g,h). Based on pairwise comparison: sim(f,g) = sim(g,h) However, according to NDCG, f should be closer to g! 2011/10/22 Tie-Yan Liu @ CCIR 2011 19

Listwise Loss Functions Ranked list Permutation probability distribution More informative representation for ranked list: permutation and ranked list has 1-1 correspondence. P( f ) 2011/10/22 Tie-Yan Liu @ CCIR 2011 20

Defining Permutation Probability Probability of a permutation is defined with Plackett-Luce Model Example: P PL ( f ) m ( j) m 1 exp( f ( x k j ( k ) j exp( f ( x )) )) P PL ABC f exp exp f (A) f (A) exp f (B) exp f (A) exp exp f (B) f (B) exp f (C) exp exp f f (C) (C) P(A ranked No.1) P(B ranked No.2 A ranked No.1) = P(B ranked No.1)/(1- P(A ranked No.1)) P(C ranked No.3 A ranked No.1, B ranked No.2) 2011/10/22 Tie-Yan Liu @ CCIR 2011 21

Distance between Ranked Lists Using KL-divergence to measure difference between distributions dis(f,g) = 0.46 dis(g,h) = 2.56 2011/10/22 Tie-Yan Liu @ CCIR 2011 22

K-L Divergence Loss ListNet (ICML 2007) ListMLE (ICML 2008) An efficient variant of ListNet g P f ( ) L( f ; x, g) D P x PL 1 g ListNet and PPLListMLE ( g) are Pg ( regarded ) 0 as otherwise among the most effective learning to rank algorithms. L( f ; x, g) log P ( f ( x)) PL PL g 2011/10/22 Tie-Yan Liu @ CCIR 2011 23

Other Work on Listwise Ranking Listwise loss functions AdaRank: a boosting approach to listwise ranking (SIGIR 2007) PermuRank: a structured SVM approach to listwise ranking (SIGIR 2008) Listwise ranking functions C-CRF: define listwise ranking function using conditional random fields (NIPS 2008) R-RSVM: define listwise ranking function using relational SVM (WWW 2008) 2011/10/22 Tie-Yan Liu @ CCIR 2011 24

Listwise Ranking Has Become an Important Branch of Learning to Rank 2011/10/22 Tie-Yan Liu @ CCIR 2011 25

Statistical Learning Theory for Ranking

Why Theory? In practice, one can only observe experimental results on relatively small datasets. Such empirical results might not be reliable, because Small training set cannot fully realize the potential of a learning algorithm. Small test set cannot reflect the true performance of an algorithm, since the real query space is too huge. Statistical learning theory analyzes the performance of an algorithm when the training data is infinite and the test data is randomly sampled. 2011/10/22 Tie-Yan Liu @ CCIR 2011 27

Generalization Analysis In the training phase, one learn a model by minimizing the empirical risk on the training data. In the test phase, we evaluate the expected risk of the model on any sample. Generalization analysis is concerned with the bound of the difference between the expected and empirical risks, when the number of training data approaches infinity. 2011/10/22 Tie-Yan Liu @ CCIR 2011 28

Generalization in Learning to Rank Loss on finite data (e.g., likelihood loss) Training Process Ranking Model Test Process Measure on infinite data e.g. (1-NDCG) Can this process generalize? Test Measure Training Loss + ε(n, m, F) query n Doc 1 query query Label Doc 1 query Label 1 1 Doc 1 Label 1 Doc n Doc 1 Label 1 Label n Doc n Label n Doc n Label n Doc m Label m Queries Web Documents 2011/10/22 Tie-Yan Liu @ CCIR 2011 29

How to Get There Test Measure Training Loss + ε(n, m, F) Test Measure Test Loss Training Loss + ε 2 2011/10/22 Tie-Yan Liu @ CCIR 2011 30 1

1 Test Loss Training Loss + ε n, m, F? This is generalization in terms of loss. To perform this generalization analysis, we need to make probabilistic assumptions on the data generation.

Previous Assumptions Document Ranking (Agarwal et.al., 2005; Clemencon et.al.,2007) Documents Doc 1 Label 1 Doc 2 Doc 3 Label 2 Label 3 Doc m Label m No notion of query! Test is conducted at query level in learning to rank! Deep and shallow training sets correspond to the same generalization ability. 2011/10/22 Tie-Yan Liu @ CCIR 2011 32

Previous Assumptions Subset Ranking (Lan et. al.,2008; Lan et. al.,2009) Queries query 1 Represent query by a deterministic subset of m documents and their labels Doc Set 1 Deterministic! Training documents are sampled, and different number of training documents will lead to different performance of the ranking model! Label Set 1 query 2 Doc Set 2 Label Set 2 query n More training documents will not enhance and even hurt generalization ability. Doc Set n Label Set n 2011/10/22 Tie-Yan Liu @ CCIR 2011 33

Two-layer Sampling (NIPS 2010) query 1 Different from document ranking, there is sampling of queries, and documents associated with different queries Queries are sampled Web according to different Doc 1 Doc distributions. 2 Doc 3 Documents Label 1 Label 2 Label 3 Doc m Different from subset ranking, the sampling of documents for each query is considered. Label m Elements in two-layer Feature sampling Feature are Feature neither Vector 1 Vector 2 Vector 3 independent nor identically distributed. Label 1 Label 2 Label 3 Feature Vector m Label m 2011/10/22 Tie-Yan Liu @ CCIR 2011 34

decomposition Concentration Two-layer Generalization Bound Test Loss Training Loss + ε (n, m, F) Two-layer error Querylayer error Introduce ghost query samples and fixed-size pseudo doc samples Doc-layer reduced two-layer RA Twolayer RA Doc-layer error Conditioned on query sample Introduce ghost doc sample for each query Query-layer reduced two-layer RA 2011/10/22 Tie-Yan Liu @ CCIR 2011 35

Discussion Deep or shallow? With budget to only label C documents, there is an optimal tradeoff between n and m. For example, if the ranking function class satisfies and, the optimal tradeoff is: 2011/10/22 Tie-Yan Liu @ CCIR 2011 36

2 Ranking Measure Loss Function?

Loss Function vs. Ranking Measure Loss Function in ListMLE L( f ; x, ) log P ( f ( x)) y Based on the scores produced by the ranking model. PL y 1- NDCG (Normalized Discounted Cumulative Gain) Normalization Cumulating Gain Position discount Based on the ranked list by sorting the scores. 2011/10/22 Tie-Yan Liu @ CCIR 2011 38

Challenge Relationship between loss and measure in ranking is unclear due to their different mathematical forms. In contrast, for classification, both loss and measure are defined regarding Loss individual Functions documents and their relationship is clear. Ranking Measures 2011/10/22 Tie-Yan Liu @ CCIR 2011 39

Essential Loss for Ranking (NIPS 2009) Model ranking as a sequence of classifications 2011/10/22 Tie-Yan Liu @ CCIR 2011 40 Ground truth permutation: Prediction of the ranking function f: { D} C B A y Classifier C D D C y C D B D C B y C D A B D C B A y { C} D A B Output the document with the largest ranking score The weighted classification error for each step in the sequence

Essential Loss vs. Ranking Measures 1) Both (1-NDCG) and (1-MAP) are upper bounded by the essential loss. 2) The zero value of the essential loss is a necessary and sufficient condition for the zero values of (1-NDCG) and (1-MAP). 2011/10/22 Tie-Yan Liu @ CCIR 2011 41

Essential Loss vs. Surrogate Losses 1) Many pairwise and listwise loss functions are upper bounds of the essential loss. 2) Therefore, the pairwise and listwise loss functions are also upper bounds of (1-NDCG) and (1-MAP). 2011/10/22 Tie-Yan Liu @ CCIR 2011 42

Learning Theory for Ranking (1) + (2) build the foundation of statistical learning theory for ranking. Guarantee on the test performance (in terms of the ranking measure) given the training performance (in terms of the loss function). Many people have started to look into this important field, inspired by our work. 2011/10/22 Tie-Yan Liu @ CCIR 2011 43

Summary and Outlook

Learning to Rank is Really Hot! Hundreds of publications at SIGIR, ICML, NIPS, etc. Several benchmark datasets released. 1~2 sessions at SIGIR every recent year. Several workshops at SIGIR, ICML, NIPS, etc. Several tutorials at SIGIR, WWW, ACL, etc. Special issue at IR Journal. Yahoo! Learning to rank challenges. Several books published on the topic. 2011/10/22 Tie-Yan Liu @ CCIR 2011 45

Wide Applications of Learning to Rank Document retrieval Question answering Multimedia retrieval Text summarization Online advertising Collaborative filtering Machine translation 2011/10/22 Tie-Yan Liu @ CCIR 2011 46

Future Work Challenges of Theories Tighter generalization bound / convergence rate Statistical consistency Coverage of more learning to rank algorithms Sampling selection bias 2011/10/22 Tie-Yan Liu @ CCIR 2011 47

Future Work Challenges from Real Applications Large scale learning to rank Robust learning to rank Online, incremental, active learning to rank Transfer learning to rank Structural learning to rank (diversity, whole-page relevance) 2011/10/22 Tie-Yan Liu @ CCIR 2011 48

References 2011/10/22 Tie-Yan Liu @ CCIR 2011 49

tyliu@microsoft.com http://research.microsoft.com/people/tyliu/ http://weibo.com/tieyanliu/ 2011/10/22 Tie-Yan Liu @ CCIR 2011 50