Lizhe Sun. November 17, Florida State University. Ranking in Statistics and Machine Learning. Lizhe Sun. Introduction

in in Florida State University November 17, 2017

Framework in 1. our life 2. Early work: Model Examples 3. webpage Web page search modeling Data structure Data analysis with machine learning algorithms

in Part I:

s in our life in Wine If I have K brands of wine of some type, like sauvignon blanc, how to rank them?

in sports in Tennis Who is the best tennis player in this year? ATP s WTA s

in Tennis players in

Web pages search in

models: overview in learning and statistical models for model: based on pairwise evaluations. Web search model: the webpages/documents according to their relevance to query.

in Part II: model

model in Let P ab denote the probability that a is preferred to b. Suppose P ab + P ba = 1 for all pairs; that is, we assume a tie cannot occur. model: Alternatively, P ab log( P ab ) = log( ) = β a β b P ba 1 P ab P ab = exp(β a )/(exp(β a ) + exp(β b )) Thus, P ab = 1 2 when β a = β b and P ab > 1 2 when β a > β b. ( ) I Residual df = (I 1) 2

model in Assumption: the samples of evaluation are independent, and the evaluations for different pairs are also independent. We can use logistic methods to fit the model.

Example 1: Major League Baseball s in Table: Results of 2011 Season for American League (Eastern Division) Baseball Teams Losing Team Winning Team Boston New York Tampa Bay Toronto Baltimore Boston - 12 6 10 10 New York 6-9 11 13 Tampa Bay 12 9-12 9 Toronto 8 7 6-12 Baltimore 8 5 9 6 - Data source: Agresti, Alan, and Maria Kateri. Categorical data analysis

model in Table: Results of Fitting Bradley-Terry Model to Baseball Data Team Winning Percentage ˆβ a SE Boston 52.8 0.454 0.304 New York 54.2 0.499 0.305 Tampa Bay 58.3 0.635 0.307 Toronto 45.8 0.229 0.303 Baltimore 38.9 0.000 -

R output in

Example 2: Tennis player in Table: Head-to-head records of players in the ATP top 20 (update to 10/30/2017) Loser Winner Rafael Nadal Roger Federer Andy Murray Novak Djokovic Stanislas Wawrinka Rafael Nadal - 23 17 24 16 Roger Federer 15-14 22 20 Andy Murray 7 11-11 10 Novak Djokovic 26 23 25-20 Stanislas Wawrinka 3 3 8 5 - data source: http://www.tennisabstract.com/

R output for tennis players in

Model extension: home team advantage in Let Pab denote the probability that team a beats team b, when a is the home team. Consider the logistic model P ab log( 1 Pab ) = α + β a β b, when α > 0, a home field advantage exists.

Part III: webpage search model in Outline Problem description and model introduction Global V.S. subset Data structure Query&webpage pair and feature vector Example: to rank competition Data analysis GBRT igbrt

Problem description in In general, we want to rank a set of documents/webpages according to their relevance to a given query. In machine learning communities, learning to rank is a supervised learning.

machine learning framework in function h(x). Extract a set of feature vectors x for each query-document pair, and train a function h(x). Rank the documents/web-pages using the value of h(x). For x with a larger value in h(x), we can say that x is ranked higher.

Global and subset in In the aforementioned model, given a query, we will rank all the documents in the training dataset. However, in application, we do not need to rank all documents for a given query. Thus, the subset model may be more popular in application.

Subset model: web search example in Filtering procedure: when the search engine system takes a query, it will use a simple algorithm for initial filtering, which limits the s to an initial pool {p j } of size m (e.g., m = 100000). Here, {p j } is returned wed page, j = 1, 2,, m. After this initial, the system use a more complicated algorithm to reorder the s in the pool.

Data Structure in Table: A example of training data structure Query Documents Feature Vector Score q1 p1 x1 1 y1 1 p2 x2 1 y2 1 pm1 xm1 1 ym1 1 q2 p1 x1 2 y1 2 p2 x2 2 y2 2 pm2 xm2 2 ym2 2 q3

Data description in The training data can be formally represented as: {(x q j, y q j )}, where q goes from 1 to n, the number of queries, j goes from 1 to m q, the number of documents for query q. x q j R p is a p-dimensional feature vector for the pair of query q and the j th document for this query, y q j is the relevance label for x q j.

Data description: features in The main categories for feature x: Web graph, Document statistics, Document classifier, Query, Text match, Topical matching, Click, External references, Time. The grade y indicates the degree of relevance of this document to its corresponding query. For example, each grade can be one element in the ordinal set, {perfect, excellent, good, fair, bad} and is labeled by human editors.

Example: datasets in learning to rank competition in Table: Datasets released for the challenge dataset1 datasets Train Valid Test Train Valid Test Queries 19944 2994 6983 1266 1266 3798 Documents 473134 71083 165660 34815 34881 103174 Features 519 596 Table: Distribution of relevance labels Grade Label dataset1 dataset2 Perfect 4 1.67% 1.89% Excellent 3 3.88% 7.67% Good 2 22.30% 28.55% Fair 1 50.22% 35.80% Bad 0 21.92% 26.09%

Example: datasets in learning to rank competition in Figure: The number of documents associated with each query.

Data analysis: overview in There are so many algorithms. Generally, current algorithms can be divided into three categories, according to different objective functions optimization: Pointwise: a regression loss or a classification loss. Pairwise: a pairwise loss function. q m q i,j,y q i >y q j l(h(x q i ) h(x q j )) Listwise: The loss function is defined over all the documents associated with query.

Model evaluation criteria in The Discounted Cumulative Gain (DCG) has been widely used to assess relevance in the context of search engines. A simple variation of DCG: DCG m = m G j /log(j + 1) j=1 where G j represents the weights assigned to the label of the document at position j. We also have expected reciprocal rank (ERR) and NDCG as model evaluation criteria.

Data analysis: learning algorithms in In this presentation, we focus on point-wise methods. Gradient Boosted Regression Trees (GBRT) is a very powerful tool to solve s search. igbrt: the initial residual of GBRT r i = y i F (x i ), F (x i ) is the estimator of RandomForests.

Regression vs. Classification in In practice, classification is better than regression. In classification, instead of training a function h(x i ) y i, we generate binary classification s, such as c = 1, 2, 3, 4 for y {0, 1, 2, 3, 4}. The c th classification predicts if the document is less relevant than c, i.e., y i < c. For each of these binary classification s, we train a classifier h c ( ): h c ( ) = P(rel(x i ) < c). In this example, we also define h 0 ( ) = 0 and h 5 ( ) = 1. Thus, we can combine all classifiers h 0, h 1,, h 5 to compute the probability of each class.

Regression vs. Classification in In our example, we compute the probability that a document x i has a relevance of r {0, 1, 2, 3, 4} And P(rel(x i ) = r) = P(rel(x i ) < r + 1) P(rel(x i ) < r) = h r+1 (x i ) h r (x i )

Results for data analysis in Table: Performance of GBRT, RF,iGBRT. All results are evaluated in ERR and NDCG. ERR method Regr./Class. dataset1 dataset2 GBRT R 0.45304 0.45669 RF R 0.46349 0.46212 igbrt R 0.46301 0.46303 GBRT C 0.45448 0.46008 RF C 0.46308 0.46200 igbrt C 0.46360 0.46246 NDCG method Regr./Class. dataset1 dataset2 GBRT R 0.76991 0.76587 RF R 0.79575 0.77552 igbrt R 0.79575 0.77725 GBRT C 0.77246 0.77132 RF C 0.79544 0.77373 igbrt C 0.79672 0.77591

Future study in There are many important topics do not cover in this presentation. algorithms about pairwise and listwise. Other models, like recommendation system. theory about learning to rank. Online learning, etc.

Thank you in Reference Chapelle, Olivier, and Yi Chang. Yahoo! learning to rank challenge overview. Proceedings of the to Rank Challenge. 2011. Cossock, David, and Tong Zhang. Statistical analysis of Bayes optimal subset. IEEE Transactions on Information Theory 54.11 (2008): 5140-5154. Chapelle, Olivier, Yi Chang, and T-Y. Liu. Future directions in learning to rank. Proceedings of the to Rank Challenge. 2011. Li, Ping, Qiang Wu, and Christopher J. Burges. Mcrank: to rank using multiple classification and gradient boosting. Advances in neural information processing systems. 2008.

Thank you in Mohan, Ananth, Zheng Chen, and Kilian Weinberger. Web-search with initialized gradient boosted regression trees. Proceedings of the to Rank Challenge. 2011. Zheng, Zhaohui, et al. A general boosting method and its application to learning functions for web search. Advances in neural information processing systems. 2008.