One-Pass Ranking Models for Low-Latency Product Recommendations

Size: px

Start display at page:

Download "One-Pass Ranking Models for Low-Latency Product Recommendations"

Bethanie Moody
5 years ago
Views:

1 One-Pass Ranking Models for Low-Latency Product Recommendations Martin MIT (Amazon Berlin)

2 One-Pass Ranking Models for Low-Latency Product Recommendations Amazon Machine Learning Team, Berlin Antonino Freno Rodolphe Jenatton Cédric Archambeau

3 Product Recommendations

4 Product Recommendations Constraints

5 Product Recommendations Constraints 1. Large # of examples Large # of features

6 Product Recommendations Constraints 1. Large # of examples Large # of features 2. Drifting distribution

7 Product Recommendations Constraints 1. Large # of examples Large # of features 2. Drifting distribution 3. Real-time ranking (<few ms)

8 Product Recommendations Constraints 1. Large # of examples Large # of features Small memory footprint 2. Drifting distribution 3. Real-time ranking (<few ms)

9 Product Recommendations Constraints 1. Large # of examples Large # of features Small memory footprint 2. Drifting distribution Fast training time 3. Real-time ranking (<few ms)

10 Product Recommendations Constraints 1. Large # of examples Large # of features Small memory footprint 2. Drifting distribution 3. Real-time ranking (<few ms) Fast training time Low prediction latency

11 Our approach Product Recommendations Small memory footprint Fast training time Low prediction latency

12 Our approach Product Recommendations Small memory footprint Fast training time Stochastic optimization One pass learning Low prediction latency

13 Our approach Product Recommendations Small memory footprint Fast training time Low prediction latency Stochastic optimization One pass learning Sparse models

14 Learning Ranking Functions

15 Learning Ranking Functions Three broad families of models 1. Pointwise (Logistic regression) 2. Pairwise (RankSVM) 3. Listwise (ListNet)

16 Learning Ranking Functions Three broad families of models 1. Pointwise (Logistic regression) 2. Pairwise (RankSVM) 3. Listwise (ListNet) Loss functions Evaluation functions (NDCG) Surrogate functions

17 Loss Function Lambda Rank (Burges et al., 2007)

18 Loss Function Lambda Rank (Burges et al., 2007) Product 1 Product 2 Product 3 Product 4 X: Features x 1 x 2 x 3 x 4 r : Ground-truth Rank

19 Loss Function Lambda Rank (Burges et al., 2007) Product 1 Product 2 Product 3 Product 4 X: Features x 1 x 2 x 3 x 4 r : Ground-truth Rank i j

20 Loss Function Lambda Rank (Burges et al., 2007) X: Features x 1 x 2 x 3 x 4 r : Ground-truth Rank Importance of sorting i and j correctly M = M(r) M(r i/j ) Product 1 Product 2 Product 3 Product 4 i j

21 Loss Function Lambda Rank (Burges et al., 2007) X: Features x 1 x 2 x 3 x 4 r : Ground-truth Rank Importance of sorting i and j correctly M = M(r) M(r i/j ) Difference in scores S = max{0, w T x j w T x i } Product 1 Product 2 Product 3 Product 4 i j

22 Loss Function Lambda Rank (Burges et al., 2007) X: Features x 1 x 2 x 3 x 4 r : Ground-truth Rank Importance of sorting i and j correctly M = M(r) M(r i/j ) Difference in scores S = max{0, w T x j w T x i } Loss L(X; w) = X Product 1 Product 2 Product 3 Product 4 r i appler j M S i j

23 ElasticRank Introducing Sparsity Adding l 1 and l 2 penalties L (X, w) =L(X, w)+ 1 w w 2 2

24 ElasticRank Introducing Sparsity l 1 l 2 Adding and penalties L (X, w) =L(X, w)+ 1 w w 2 2 Both and control model complexity 1 2

25 ElasticRank Introducing Sparsity l 1 l 2 Adding and penalties L (X, w) =L(X, w)+ 1 w w 2 2 Both and control model complexity 1 2 trades-off sparsity and performance 1

26 ElasticRank Introducing Sparsity l 1 l 2 Adding and penalties L (X, w) =L(X, w)+ 1 w w 2 2 Both and control model complexity trades-off sparsity and performance 1 adds strong convexity & improves convergence 2 1 2

27 Optimization Algorithms Extensions of Stochastic Gradient Descent

28 Optimization Algorithms Extensions of Stochastic Gradient Descent FOBOS Forward-Backward Splitting (Duchi, 2009) 1. Gradient step 2. Proximal step involving the regularization

29 Optimization Algorithms Extensions of Stochastic Gradient Descent FOBOS Forward-Backward Splitting (Duchi, 2009) 1. Gradient step 2. Proximal step involving the regularization RDA Regularized Dual Averaging (Xiao, 2010) Keeps a running average of all past gradients Solves a proximal step using the average

30 Optimization Algorithms Extensions of Stochastic Gradient Descent FOBOS Forward-Backward Splitting (Duchi, 2009) 1. Gradient step 2. Proximal step involving the regularization RDA Regularized Dual Averaging (Xiao, 2010) Keeps a running average of all past gradients Solves a proximal step using the average psgd Pruned Stochastic Gradient Descent Prunes every k gradient steps If w i < ) w i =0

31 Hyper-parameter Optimization Turn-key inference Automatic adjustment of hyper-parameters Bayesian Approach (Snoek, Larochelle, Adams; 2012) Gaussian Process Thomson Sampling

LETOR Experiments ElasticRank is comparable with state-of-the-art models 0.6 0.

32 LETOR Experiments ElasticRank is comparable with state-of-the-art models OHSUMED TD2003 TD2004 Logistic Regression RankSVM ListNet ElasticRank

33 Amazon.com Experiments Experimental Setup # examples millions # features thousands (millions of dimensions) Purchase logs from contiguous time interval Training Validation Testing

34 Experimental Results ElasticRank performs best RankSVM ElasticRank psgd ElasticRank FOBOS ElasticRank RDA 1 Logistic Regression

35 Sparsity vs Performance RDA achieves the best trade-off RDA psgd FOBOS PSGD FOBOS RDA Number of Weights

36 Prediction Time 15 Microseconds μs 8.7 μs 10.9 μs Number of Weights

37 Contributions How to learn ranking functions with Single pass Small memory footprint Sparse WITHOUT sacrificing performance

38 References C. J. C. Burges, R. Ragno, and Q. V. Le. Learning to rank with nonsmooth cost functions. In Advances in Neural Information Processing Systems (NIPS), J. C. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research (JMLR), L. Xiao. Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization. Journal of Machine Learning Research (JMLR), J. Snoek, H. Larochelle, and R. P. Adams. Practical bayesian optimization of machine learning algorithms. In Advances in Neural Information Processing Systems (NIPS), 2012.

39 One-Pass Ranking Models for Low-Latency Product Recommendations Martin MIT (Amazon Berlin)

WebSci and Learning to Rank for IR

WebSci and Learning to Rank for IR Ernesto Diaz-Aviles L3S Research Center. Hannover, Germany diaz@l3s.de Ernesto Diaz-Aviles www.l3s.de 1/16 Motivation: Information Explosion Ernesto Diaz-Aviles