Learning Dense Models of Query Similarity from User Click Logs

Similar documents
WebSci and Learning to Rank for IR

Fractional Similarity : Cross-lingual Feature Selection for Search

Information Retrieval. (M&S Ch 15)

Chapter 8. Evaluating Search Engine

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CSEP 573: Artificial Intelligence

Jan Pedersen 22 July 2010

Learning to Reweight Terms with Distributed Representations

Learning Socially Optimal Information Systems from Egoistic Users

Learning to Rank for Information Retrieval. Tie-Yan Liu Lead Researcher Microsoft Research Asia

Generalized Syntactic and Semantic Models of Query Reformulation

From Neural Re-Ranking to Neural Ranking:

Supplementary material: Strengthening the Effectiveness of Pedestrian Detection with Spatially Pooled Features

Information Retrieval. CS630 Representing and Accessing Digital Information. What is a Retrieval Model? Basic IR Processes

High Accuracy Retrieval with Multiple Nested Ranker

Metric Learning for Large-Scale Image Classification:

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Ranked Retrieval. Evaluation in IR. One option is to average the precision scores at discrete. points on the ROC curve But which points?

Mining the Search Trails of Surfing Crowds: Identifying Relevant Websites from User Activity Data

CSE 573: Artificial Intelligence Autumn 2010

on learned visual embedding patrick pérez Allegro Workshop Inria Rhônes-Alpes 22 July 2015

In = number of words appearing exactly n times N = number of words in the collection of words A = a constant. For example, if N=100 and the most

Part 5: Structured Support Vector Machines

TRANSDUCTIVE LINK SPAM DETECTION

Classification. 1 o Semestre 2007/2008

Learning to Rank. Tie-Yan Liu. Microsoft Research Asia CCIR 2011, Jinan,

Learning Ranking Functions with Implicit Feedback

Estimating Human Pose in Images. Navraj Singh December 11, 2009

Adapting Ranking Functions to User Preference

Development in Object Detection. Junyuan Lin May 4th

II TupleRank: Ranking Discovered Content in Virtual Databases 2

Search Evaluation. Tao Yang CS293S Slides partially based on text book [CMS] [MRS]

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Case Study 1: Estimating Click Probabilities

Jianyong Wang Department of Computer Science and Technology Tsinghua University

CS145: INTRODUCTION TO DATA MINING

Predictive Analysis: Evaluation and Experimentation. Heejun Kim

Large-Scale Validation and Analysis of Interleaved Search Evaluation

CCRMA MIR Workshop 2014 Evaluating Information Retrieval Systems. Leigh M. Smith Humtap Inc.

Combining SVMs with Various Feature Selection Strategies

Weka ( )

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Pseudo-Implicit Feedback for Alleviating Data Sparsity in Top-K Recommendation

Feature selection. LING 572 Fei Xia

NTT SMT System for IWSLT Katsuhito Sudoh, Taro Watanabe, Jun Suzuki, Hajime Tsukada, and Hideki Isozaki NTT Communication Science Labs.

Generative and discriminative classification techniques

Improving Difficult Queries by Leveraging Clusters in Term Graph

Machine Learning based session drop prediction in LTE networks and its SON aspects

Ranking and Learning. Table of Content. Weighted scoring for ranking Learning to rank: A simple example Learning to ranking as classification.

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Representation Learning using Multi-Task Deep Neural Networks for Semantic Classification and Information Retrieval

Evaluating Classifiers

Kernels and representation

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Relevance Feedback and Query Reformulation. Lecture 10 CS 510 Information Retrieval on the Internet Thanks to Susan Price. Outline

MI2LS: Multi-Instance Learning from Multiple Information Sources

Sponsored Search Advertising. George Trimponias, CSE

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Learning to Rank on Network Data

CS 6320 Natural Language Processing

Structured Ranking Learning using Cumulative Distribution Networks

Learning to Rank: A New Technology for Text Processing

Louis Fourrier Fabien Gaie Thomas Rolf

S-MART: Novel Tree-based Structured Learning Algorithms Applied to Tweet Entity Linking

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Part I: Data Mining Foundations

Using Text Learning to help Web browsing

Visual Recognition: Examples of Graphical Models

Personalized Web Search

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Query Refinement and Search Result Presentation

Rishiraj Saha Roy and Niloy Ganguly IIT Kharagpur India. Monojit Choudhury and Srivatsan Laxman Microsoft Research India India

Midterm Exam Search Engines ( / ) October 20, 2015

ICFHR 2014 COMPETITION ON HANDWRITTEN KEYWORD SPOTTING (H-KWS 2014)

CSCI 599: Applications of Natural Language Processing Information Retrieval Evaluation"

Tuning. Philipp Koehn presented by Gaurav Kumar. 28 September 2017

Tutorials Case studies

arxiv: v2 [cs.cv] 21 Dec 2013

A Dynamic Bayesian Network Click Model for Web Search Ranking

NUS-I2R: Learning a Combined System for Entity Linking

Part 11: Collaborative Filtering. Francesco Ricci

arxiv: v2 [cs.lg] 2 Jun 2016

A Task Level Metric for Measuring Web Search Satisfaction and its Application on Improving Relevance Estimation

Climbing the Kaggle Leaderboard by Exploiting the Log-Loss Oracle

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

One-Pass Ranking Models for Low-Latency Product Recommendations

Diversifying Query Suggestions Based on Query Documents

Part-Based Models for Object Class Recognition Part 2

Part-Based Models for Object Class Recognition Part 2

CSCI 599: Applications of Natural Language Processing Information Retrieval Retrieval Models (Part 3)"

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

Slides for Data Mining by I. H. Witten and E. Frank

Joint Optimisation of Tandem Systems using Gaussian Mixture Density Neural Network Discriminative Sequence Training

Optimizing Mean Reciprocal Rank for Person Re-identification

Information Management course

CSE 546 Machine Learning, Autumn 2013 Homework 2

Optimizing Search Engines using Click-through Data

Face2Face Comparing faces with applications Patrick Pérez. Inria, Rennes 2 Oct. 2014

arxiv: v1 [cs.cv] 2 Sep 2018

CSCI 5417 Information Retrieval Systems. Jim Martin!

Transcription:

Learning Dense Models of Query Similarity from User Click Logs Fabio De Bona, Stefan Riezler*, Keith Hall, Massi Ciaramita, Amac Herdagdelen, Maria Holmqvist Google Research, Zürich *Dept. of Computational Linguistics, University of Heidelberg

Query Rewriting in Large Scale Web Search Problem: Web search: Term mismatch between user queries and web docs. Users describe their information need by a few keywords, which are likely to be different from the index terms of the web documents. Sponsored search / Ads: Additional difficulty of matching queries against very few, very short documents. Task: Conjunctive term matching needs to be relaxed by rewriting query terms into new terms with similar statistical properties (generative models for query expansion), ranking candidate rewrites w.r.t. criteria such as click-through-rate or semantic similarity (discriminative models for rewrite ranking).

Discriminative Models for Rewrite Ranking Rewrite candidates from different sources need to be filtered according to criteria such as click-through-rate or semantic similarity = Learning-to-Rank problem: Learn ranking of query rewrites from data that are ranked according to measures of interest. Task(s): Create training data (by sampling from user logs) and test data (by manual labeling a subsample). Feature engineering, incorporating complex models of string similarity as dense features. Find most robust learner, i.e., learner that performs best under various evaluation metrics on clean test data when trained on noisy training set.

Extracting Weak Labels from Co-click Data Training data extraction: Assume two queries to be related if they lead to certain amount of user clicks on the same retrieval results (cf. Fitzpatrick & Dent (1997) s model of query similarity based on the intersection of retrieval results). Threshold of >= 10 co-clicks suffices to find query-pairs that are considered similar by human judges. Data set of > 1 billion query-rewrite pairs extracted for experiments. Test data labeling: 100 queries with 30 rewrites, sampled in descending order of co-clicks. Labeling in two steps: Rank rewrites using GUI, then (re)assign rank labels and binary relevance score (see Rubenstein & Goodenough (1965)).

Data Statistics Train Dev Test Number of queries 250,000 2,500 100 Average number of rewrites per query Percentage positive rewrites per query 4,500 4,500 30 0.2 0.2 43 Train, Dev, and Test sets are sampled from same user logs data. Different percentage of relevant documents per query. Co-click threshold of 10 just sufficient for significant correlation between human relevance judgments and automatic labeling.

Features Features are composed of following building blocks: Levenshtein distance, based on following edit operations: insertion deletion substitution all Cost functions for Levenshtein edit operations: unit cost for all operations character-based edit-distance as cost function for substitution operation probabilistic cost functions for substitution = generalized edit distance

Features, continued Probabilistic term substitution models based on Pointwise Mutual Information: PMI = log p(w i,w j ) p(w i )p(w j ) Introduced by Church & Hanks (1990) as word association ratio. Negative PMI values happen in rare events where strings co-occur less frequently than random: p(w i,w j ) < p(w i )p(w j ) Negative PMI values set to zero in our case.

Features, continued Normalizations of PMI: Positive PMI values bounded to range between 0 and 1 by linear rescaling. Joint normalization: PMI = PMI(w i,w j ) J log(p(w i,w j )) Measures the amount of shared information between two strings relative to sum of the information of the individual strings.

Features, continued Specialization normalization: PMI S = PMI(w i,w j ) log(p(w i )) PMI S favors pairs where w j is a specialization of w i PMI S is at maximum when p(w i,w j ) = p(w j ), i.e. when p(w i w j ) = 1 Generalization normalization: w j generalizes w i : PMI G = PMI(w i,w j ) log(p(w j ))

Features, continued Examples: PMI G (apple, mac os) =.2917 PMI S (apple, mac os) =.3686 Evidence for specialization. PMI G (ferrari models, ferrari) = 1 PMI S (ferrari models, ferrari) =.5558 Perfect generalization. PMI values computed from Web counts.

Features, continued Multiword queries: Original order of query terms Or: alphabetically sorted bag-of-words Estimation of cost matrix: Relative frequency of session transitions in query log of 1.3 billion English queries Smoothed transition probability from clustering model trained on user logs Resulting feature set of about 60 dense features

Learning to Rank Query Rewrites Various loss functions optimized in Stochastic Gradient Descent framework: Training data S = {x q (i),y q (i) } i=1 n where x q ={x q1,, x q,n(q) } is a set of rewrites for query q, and y q = (y q1,, y q,n(q) ) is a ranking on rewrites. Minimize regularized objective for training set by stochastic updating min w x q,y q l (w) + Ω(w) w t+1 = w t η t g t where g t = (l (w) + Ω(w))

Learning to Rank Rewrites, cont. Conditional log-linear model on set(!) of relevant queries (Riezler et al. ACL 02) for binary relevance scores (expressed as rank 1 for relevant, and rank 2 for non-relevant rewrites): l llm (w) = log e w,φ( x qi ) x qi x q y qi =1 e w,φ( x qi e ) x qi x q Gradient: l llm (w) = p [ w φ k x q ; y qi =1]+ p [ w φ k x ] q w k

Learning to Rank Rewrites, cont. Listwise hinge loss for prediction loss L(y q,π q ) = MAP (Mean Average Precision) (= SVM-MAP of Yue et al. SIGIR 07): l lh (w) = (L(y q, π q * ) w,φ(x q, y q ) φ(x q, π q * ) ) + where π q * = arg max π q Π q \ y q L (y q, π q ) + w,φ (x q, π q ) (z) + =max{0,z}, and φ(x q,y q ) is a partial order feature map (see Yue et al. 07). Gradient: l lh (w ) = 0 if w,φ (x q, y q ) φ(x q, π * q ) > L(y q, π * q ) w k (φ (x q, y q ) φ(x q, π * q )) else

Learning to Rank Rewrites, cont. (Margin-rescaled) pairwise hinge loss (Joachims 02; Cortes et al. ICML 07; Agarwal & Niyogi JMLR 09; ): 1 l ph (w) = (( 1 ) w,φ(x qi ) φ(x qj ) sgn( 1 1 )) + y qi y qj y qi y qj (i,j) P q where P q is the set of pairs of rewrites for query q that need to be compared. Gradient for SGD on pair-level: w k l ph (w ) = 1 0 if w,φ ( x qi ) φ ( x qj ) sgn( 1 1 ) > 1 y qi y qj y qi y qj 1 (φ ( x qi ) φ ( x q j )) sgn( 1 ) else y qi y qj

Learning to Rank Rewrites, cont. Bipartite pairwise ranking for binary relevances, e.g, co-clicks >= 10 vs. < 10: Multipartite pairwise ranking for relevance levels, e.g., number of co-clicks:

Experimental Evaluation Baselines: Random shuffling of relevant/non-relevant rewrites. Single dense feature that performed best on development set (clustering model log-probability used for cost-matrix estimation). SGD training: Constant learning rates η { 1,0.5,0.1,0.01,0.001 } Each metaparameter evaluated on development set after every fifth out of 100 passes over the training set. Evaluation: Evaluated on manually labeled test set of 100 queries with 30 rewrites each. Evaluation metrics Mean Average Precision (MAP), Normalized Discounted Cumulative Gain (NDCG), Area-under-the-ROC-curve (AUC), Precision@n.

Experimental Results MAP NDCG@10 AUC P@1 P@3 P@5 Random 51.8 48.7 50.4 45.6 45.6 46.6 Best Feature 71.9 70.2 74.5 70.2 68.1 68.7 Log-linear 74.7 75.1 75.7 75.3 72.2 71.3 SVM-MAP 74.3 75.2 75.3 76.3 71.8 72.0 n SVM-bipartite 73.7 73.7 74.7 79.4 70.1 70.1 SVM-multipart. 76.5 77.3 77.2 83.5 74.2 73.6 SVM-multipart. -margin 75.7 76.0 76.6 82.5 72.9 73.0

Statistical Significance Statistical significance of result differences for pairwise system comparisons: Approximate Randomization test with stratified shuffling applied to results on the query level (Noreen 1989) Best-feat. SVM-bipart. SVM-MAP Log-linear SVM-multi.- marg. SVM-multi. Best-feature - <0.005 <0.005 <0.005 <0.005 <0.005 SVM-bipart. - - 0.324 <0.005 <0.005 <0.005 SVM-MAP - - - 0.374 <0.005 <0.005 Log-linear - - - - 0.053 <0.005 SVM-multi.- marg. - - - - - <0.005 SVM-multi. - - - - - -

Experimental Results Evaluation results: SVM-multipartite outperforms all other ranking systems under all evaluation metrics at a significance level >= 0.995. Result differences for systems ranked next to each other are not statistically significant. All systems outperform random and best-feature baselines. Discussion: SVM-multipartite ranker is most robust across all eval metrics. Position-sensitive margin rescaling does not help. SVM-MAP overtrains on dev set, thus does not win on MAP evaluation.

Conclusion Research questions: Is number of co-clicks useful implicit feedback to create multipartite rankings for training rankers? Are machine learning techniques robust enough to learn from noisy data and achieve good performance w.r.t. human quality standards? Results: Co-click information could be shown to correlate well with human judgments on rewrite quality Large-scale experiment finds robust learner in multipartiee-ranking SVM TODO: More support needed from extrinsic evaluation / live search experiment!