Predictive Indexing for Fast Search

Similar documents
Predictive Indexing for Fast Search

FACE DETECTION AND LOCALIZATION USING DATASET OF TINY IMAGES

Searching in one billion vectors: re-rank with source coding

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Going nonparametric: Nearest neighbor methods for regression and classification

Faster Cover Trees. Mike Izbicki and Christian R. Shelton UC Riverside. Izbicki and Shelton (UC Riverside) Faster Cover Trees July 7, / 21

Query Answering Using Inverted Indexes

Going nonparametric: Nearest neighbor methods for regression and classification

B490 Mining the Big Data. 5. Models for Big Data

Large-scale visual recognition Efficient matching

Salford Systems Predictive Modeler Unsupervised Learning. Salford Systems

Large scale object/scene recognition

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Ranking Clustered Data with Pairwise Comparisons

Scalable Nearest Neighbor Algorithms for High Dimensional Data Marius Muja (UBC), David G. Lowe (Google) IEEE 2014

Locality- Sensitive Hashing Random Projections for NN Search

Robust PDF Table Locator

Image Analysis & Retrieval. CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W Lec 18.

Learning independent, diverse binary hash functions: pruning and locality

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Unsupervised Learning

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Bumptrees for Efficient Function, Constraint, and Classification Learning

Feature selection. LING 572 Fei Xia

over Multi Label Images

Efficient query processing

Structural and Syntactic Pattern Recognition

A Unified Approach to Learning Task-Specific Bit Vector Representations for Fast Nearest Neighbor Search

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Combining Fuzzy Information - Top-k Query Algorithms. Sanjay Kulhari

Instructor: Stefan Savev

Optimizing Search Engines using Click-through Data

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Distributed computing: index building and use

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

CS 229 Final Project - Using machine learning to enhance a collaborative filtering recommendation system for Yelp

Edge Classification in Networks

Relevance of a Document to a Query

Geometric data structures:

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

Joint Entity Resolution

Mining di Dati Web. Lezione 3 - Clustering and Classification

Leveraging Set Relations in Exact Set Similarity Join

Multimedia Information Systems

Automatic Domain Partitioning for Multi-Domain Learning

Based on Raymond J. Mooney s slides

Decentralized and Distributed Machine Learning Model Training with Actors

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

Problem 1: Complexity of Update Rules for Logistic Regression

Link Prediction in Graph Streams

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho,

Cluster Analysis. Prof. Thomas B. Fomby Department of Economics Southern Methodist University Dallas, TX April 2008 April 2010

Improvements in Continuous Variable Simulation with Multiple Point Statistics

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

Mining Web Data. Lijun Zhang

Divide and Conquer Kernel Ridge Regression

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

Nearest neighbor classification DSE 220

Machine Learning Classifiers and Boosting

INF4820 Algorithms for AI and NLP. Evaluating Classifiers Clustering

Clustering. Robert M. Haralick. Computer Science, Graduate Center City University of New York

ECS289: Scalable Machine Learning

Collective Entity Resolution in Relational Data

Sublinear Algorithms for Big Data Analysis

CHAPTER VII INDEXED K TWIN NEIGHBOUR CLUSTERING ALGORITHM 7.1 INTRODUCTION

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Algorithms for Nearest Neighbors

CS294-1 Assignment 2 Report

Boolean Model. Hongning Wang

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Adaptive Binary Quantization for Fast Nearest Neighbor Search

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Sentiment analysis under temporal shift

IMAGE MATCHING - ALOK TALEKAR - SAIRAM SUNDARESAN 11/23/2010 1

Link Analysis and Web Search

Instructor: Dr. Mehmet Aktaş. Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Supervised Hashing for Image Retrieval via Image Representation Learning

Diffusion Wavelets for Natural Image Analysis

Search Engines and Learning to Rank

Clustering CS 550: Machine Learning

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Massive Data Analysis

Supervised vs. Unsupervised Learning

The Set Classification Problem and Solution Methods

Automatic Cluster Number Selection using a Split and Merge K-Means Approach

Efficient Processing of Multiple DTW Queries in Time Series Databases

Nearest Neighbor with KD Trees

Bundling Features for Large Scale Partial-Duplicate Web Image Search

CSE 494 Project C. Garrett Wolf

Equi-sized, Homogeneous Partitioning

Mobile Visual Search with Word-HOG Descriptors

On Dimensionality Reduction of Massive Graphs for Indexing and Retrieval

ECLT 5810 Clustering

Multi-label classification using rule-based classifier systems

Information Retrieval. (M&S Ch 15)

Transcription:

Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 1 / 34

Large-Scale Search Large-scale search is now a ubiquitous operation: Web Search 10 billion web pages; given a search query, have 10ms to return top web page results. Web Advertising 1 billion ads; given a web page, have 10ms to choose best ads for the page. Nearest Neighbor N points in Euclidean space; given a query point, quickly return the nearest k neighbors. This problem arises in data compression, information retrieval, and pattern recognition. Large-scale search algorithms must satisfy tight computational restrictions. In practice, we are usually satisfied with approximate solutions. Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 2 / 34

Large-Scale Search Current Approaches Existing large-scale search solutions often work by preparing the query string and repeatedly filtering the dataset. In web search, given the query string the rain in Spain stays mainly in the plain, one approach is to 1 Rewrite the query string: the (rain or rains) in Spain stays mainly in the (plain or plains) 2 Retrieve web pages containing all the query terms 3 Filter these web pages by, for example, PageRank 4 Filter some more... 5 Return top 10 web page results Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 3 / 34

Large-Scale Search Current Approaches In this approach of query rewriting and successive filtering, a result p is good for a query q if the search system generates that result in response to the query. That is, fitness is determined by architecture. Alternatively, we could a priori define how good a page is for a query, and then build a system to support that definition. In this way, architecture is determined by fitness. For this latter approach, we need two ingredients: 1 A (machine-learned) score f (q, p) defined for every query/page pair 2 A method for quickly returning the highest scoring pages for any query Here we consider only the second problem, rapid retrieval. Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 4 / 34

The Problem Consider Q R n an input space W R m a finite output space of size N f : Q W R a known scoring function. Given an input (search query) q Q, the goal is to find, or closely approximate, the top-k output objects (web pages) p 1,..., p k in W (i.e., the top k objects as ranked by f (q, )). Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 5 / 34

Feature Representation One concrete way to map web search into this general framework is to represent both queries and pages as sparse binary feature vectors in a high-dimensional Euclidean space. Specifically, we associate each word with a coordinate: A query (page) has a value of 1 for that coordinate if it contains the word, and a value of 0 otherwise. We call this the word-based feature representation, because each query and page can be summarized by a list of its features (i.e., words) that it contains. The general framework supports many other possible representations, including those that incorporate the difference between words in the title and words in the body of the web page, the number of times a word occurs, or the IP address of the user entering the query. Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 6 / 34

Related Work A standard approach to the rapid retrieval problem is to pre-compute an index (or other datastructure) in order to significantly reduce runtime computation. There are two common techniques for doing this: Fagin s Threshold Algorithm We use this as a baseline comparison. Inverted Indices A datastructure that maps every page feature (i.e., word) to a list of pages that contain that feature. This works best for sparse scoring rules, which is not typical for the machine learned rules we consider. Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 7 / 34

Predictive Indexing Summary Suppose we are provided with a categorization of possible queries into related, potentially overlapping, sets. For example, one set might be defined as, queries containing the word Spain. We assume queries are generated from some fixed distribution. For each query set, the associated predictive index is an ordered list of web pages sorted by their expected score for random queries drawn from that set. In particular, we expect web pages at the top of the Spain list to be good, on average, for queries containing the word Spain In contrast to an inverted index, the pages in the Spain list need not themselves contain the word Spain. To retrieve results for a particular query, we optimize only over web pages in the relevant, pre-computed lists. Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 8 / 34

Predictive Indexing Formal Description Consider a finite collection Q of sets Q i Q that cover the query space (i.e., Q i Q i ). For each Q i, define the conditional probability distribution D i over queries in Q i by D i ( ) = D( Q i ). Define f i : W R as f i (p) = E q Di [f (q, p)]. The function f i (p) is the expected score of the web page p for the (related) queries in Q i. Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 9 / 34

Predictive Indexing Formal Description For each query set we pre-compute a sorted list of web pages ordered by f i (p). At runtime, we search down the lists that correspond to query sets containing the given query, halting when we have exhausted the computational budget. the rain in Spain... p 53 p 6 p 96 p 58. p 64 p 91 p 58 p 21. p 39 p 1 p 65 p 43...... p 76 p 7 p 58 p 42. f rain (p 6 ) f rain (p 91 ) f rain (p 1 ) f rain (p 7 ) > 0. Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 10 / 34

Predictive Indexing Formal Description To generate the predictive index, we do not assume anything about the particular structure of the scoring function. We do assume, however, that we can sample from the query distribution, in order to approximate the conditional expected scores. for t random queries q D do for all query sets Q j containing q do for all pages p in the data set do L j [p] L j [p] + f (q, p) end for end for end for for all lists L j do sort L j end for Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 11 / 34

Empirical Evaluation We evaluate predictive indexing for two applications: Internet advertising Approximate nearest neighbor Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 12 / 34

Empirical Evaluation Internet Advertising We consider commercial web advertising data. The data are comprised of logs of events, where each event represents a visit by a user to a particular web page p, from a set of web pages Q R n (we use a sparse feature representation). From a large set of advertisements W R m, the commercial system chooses a smaller, ordered set of ads to display on the page (generally around 4). The set of ads seen and clicked by users is logged. Note that the role played by web pages has switched, from result to query. The training data consist of 5 million events (web page ad displays) About 650,000 distinct ads Each ad contains, on average, 30 ad features (out of 1, 000, 000) About 500,000 distinct web pages Each page consists of approximately 50 page feature (out of 900, 000) Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 13 / 34

Empirical Evaluation Internet Advertisting We trained a linear scoring rule f of the form f (p, a) = i,j w ij p i a j to approximately rank the ads by their probability of click. Here, w ij are the learned weights (parameters) of the linear model, and p i, a j {0, 1}. We measured computation time in terms of the number of full evaluations by the algorithm (i.e., the number of ads scored against a given page). Thus, the true test of an algorithm was to quickly select the most promising T {100, 200, 300, 400, 500} ads (out of 650,000) to fully score against the page. Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 14 / 34

Empirical Evaluation Internet Advertising We tested four methods: Halted Threshold Algorithm (TA) Predictive Indexing Each feature corresponds to a cover set, namely the set of queries that contain that feature PI-AVG pages ordered by expected score PI-DCG pages ordered by probability of being a top result DCG f (p, a) = I r(p,a) 16 / log 2 (r(p, a) + 1). Best global ordering (BO) A degenerate form of predictive indexing that uses a single cover set. Pages ordered by DCG f (p, a). Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 15 / 34

Empirical Evaluation Internet Advertising Comparison of Serving Algorithms Probability of Exact Retrieval 1st Result 0.0 0.2 0.4 0.6 0.8 1.0 PI AVG PI DCG Fixed Ordering Halted TA 100 200 300 400 500 Number of Full Evaluations Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 16 / 34

Empirical Evaluation Internet Advertising Comparison of Serving Algorithms Probability of Exact Retrieval 10th Result 0.0 0.2 0.4 0.6 0.8 1.0 PI AVG PI DCG Fixed Ordering Halted TA 100 200 300 400 500 Number of Full Evaluations Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 17 / 34

Empirical Evaluation Approximate Nearest Neighbor A special case application of predictive indexing is approximate nearest neighbor search. Given a set of points in n-dimensional Euclidean space, and a query point x in that same space, the nearest neighbor problem is to quickly return the top-k neighbors of x. This problem is of considerable interest for a variety of applications, including data compression, information retrieval, and pattern recognition. In the predictive indexing framework, the nearest neighbor problem corresponds to minimizing a scoring function defined by Euclidean distance: f (x, y) = x y 2 Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 18 / 34

Empirical Evaluation Approximate Nearest Neighbor To start, we define a covering of the input space R n, which we borrow from locality-sensitive hashing (LSH), a commonly suggested scheme for the approximate nearest neighbor problem. We form α random partitions of the input space. Each partition splits the space by β random hyperplanes. In total, we have α 2 β cover sets, and each point is covered by α sets. Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 19 / 34

Empirical Evaluation Approximate Nearest Neighbor Given a query point x, LSH evaluates points in the α sets that cover x. Predictive indexing, in contrast, maps each cover set C to an ordered list of points sorted by their probability of being a top-10 nearest point to points in C. That is, the lists are sorted by h C (p) = Pr (p is one of the nearest 10 points to q). q D C For the query x, we then consider those points with large probability h C for at least one of the α sets that cover x. Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 20 / 34

Empirical Evaluation Approximate Nearest Neighbor We compare LSH and predictive indexing over four public data sets: 1 MNIST 60,000 training and 10,000 test points in 784 dimensions (β = 24 projections per partition) 2 Corel 37,749 points in 32 dimensions, split randomly into 95% training and 5% test subsets (β = 24 projections per partition) 3 Pendigits 7494 training and 3498 test points in 17 dimensions (β = 63 projections per partition) 4 Optdigits 3823 training and 1797 test points in 65 dimensions (β = 63 projections per partition) The number of partitions α was varied as an experimental parameter. Larger α corresponds to more full evaluations per query, resulting in improved accuracy at the expense of increased computation time. Both algorithms allowed the same average number of full evaluations per query. Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 21 / 34

Empirical Evaluation Approximate Nearest Neighbor 2 4 6 8 10 2 4 6 8 10 LSH vs. Predictive Indexing LSH Rank of 1st Result Predictive Indexing Rank of 1st Result Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 22 / 34

Empirical Evaluation Approximate Nearest Neighbor 20 40 60 80 100 20 40 60 80 100 LSH vs. Predictive Indexing All Data Sets LSH Rank of 10th Result Predictive Indexing Rank of 10th Result Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 23 / 34

Empirical Evaluation Approximate Nearest Neighbor LSH vs. Predictive Indexing on MNIST Data Rank of 1st Result 0 10 20 30 40 LSH Predictive Indexing 20 30 40 50 60 70 Number of Partitions α Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 24 / 34

Empirical Evaluation Approximate Nearest Neighbor LSH vs. Predictive Indexing on MNIST Data Rank of 10th Result 0 200 400 600 800 1000 LSH Predictive Indexing 20 30 40 50 60 70 Number of Partitions α Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 25 / 34

Conclusion Predictive indexing is based on a very simple idea: Build an index that incorporates the query distribution. The method performs well in our empirical evaluations of Internet advertising display and the approximate nearest neighbor problem. We believe the predictive index is the first datastructure capable of supporting scalable rapid ranking based on general purpose machine-learned scoring rules. Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 26 / 34

Appendix Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 27 / 34

Related Work Fagin s Threshold Algorithm Consider the word-based feature representation, and suppose the scoring function has the form f (q, p) = g i (p). {i:q i =1} For example, the query the rain in Spain corresponds to f (q, p) = g the (p) + g rain (p) + g in (p) + g Spain (p) where g i (p) is some measure of how important term i is to document p (e.g., based on tf-idf) Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 28 / 34

Related Work Fagin s Threshold Algorithm For each query feature (i.e., word), TA pre-computes an ordered list L i sorted by g i (p). the rain in Spain... p 53 p 6 p 96 p 58. p 64 p 91 p 58 p 21. p 39 p 1 p 65 p 43...... p 76 p 7 p 58 p 42. g rain (p 6 ) g rain (p 91 ) g rain (p 1 ) g rain (p 7 ) > 0. Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 29 / 34

Related Work Fagin s Threshold Algorithm Given a query (e.g., the rain in Spain ), TA searches down the relevant lists (e.g., L the, L rain, L in, and L Spain ) in parallel, fully evaluating each new page that it comes across. Clearly, any page with non-zero score must be in one of these lists. So, searching over these lists is sufficient. TA, however, avoids searching over all the pages in these lists by maintaining upper and lower bounds on the score of the k th best page, halting when these bounds cross. The lower bound is the score of the k th best page evaluated so far. Since the lists are ordered by their partial scores g i (p), the sum of the partial scores for next-to-be-scored page in each list bounds the score of any page yet to be seen, giving an upper bound. Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 30 / 34

Related Work Fagin s Threshold Algorithm Query = the rain in Spain Upper Bound: + Lower Bound: + the rain in Spain p 53 p 6 p 96 p 58 p 64 p 91 p 58 p 21 p 39 p 1 p 65 p 43.... p 76 p 7 p 58 p 42 Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 31 / 34

Related Work Fagin s Threshold Algorithm Query = the rain in Spain Upper Bound: 93.23 Lower Bound: 26.42 the rain in Spain p 53 p 6 p 96 p 58 p 64 p 91 p 58 p 21 p 39 p 1 p 65 p 43.... p 76 p 7 p 58 p 42 Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 31 / 34

Related Work Fagin s Threshold Algorithm Query = the rain in Spain Upper Bound: 87.17 Lower Bound: 80.31 the rain in Spain p 53 p 6 p 96 p 58 p 64 p 91 p 58 p 21 p 39 p 1 p 65 p 43.... p 76 p 7 p 58 p 42 Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 31 / 34

Related Work Fagin s Threshold Algorithm Query = the rain in Spain Upper Bound: 85.92 Lower Bound: 85.92 the rain in Spain p 53 p 6 p 96 p 58 p 64 p 91 p 58 p 21 p 39 p 1 p 65 p 43.... p 76 p 7 p 58 p 42 Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 31 / 34

Related Work Fagin s Threshold Algorithm The Fagin s Threshold Algorithm is particularly effective when a query contains a small number of features, facilitating fast convergence of the upper bound. In our experiments, however, we find that the halting condition is rarely satisfied within the imposed computational restrictions. One can, of course, simply halt the algorithm when it has expended the computational budget, which we refer to as the Halted Threshold Algorithm. We use this method as a baseline comparison. Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 32 / 34

Related Work Inverted Index An inverted index is a datastructure that maps every page feature i (i.e., word) to a list of pages p that contain i. When a new query arrives, a subset of page features relevant to the query is first determined. For instance, if the query contains rain, the page feature set might be { rain, precipitation, drizzle,...}. A distinction is made between query features and page features, and in particular, the relevant page features may include many more words than the query itself. Once a set of page features is determined, their respective lists (i.e., inverted indices) are searched, and from them the final list of output pages is chosen. Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 33 / 34

Related Work Inverted Index Approaches based on inverted indices are efficient only when it is sufficient to search over a relatively small set of inverted indices for each query. Inverted indices require, for each query q, that there exists a small set of page features (typically 100) such that the score of any page against q depends only on those features. In other words, the scoring rule must be extremely sparse, with most words or features in the page having zero contribution to the score for q. We consider a machine-learned scoring rule, derived from internet advertising data, with the property that almost all page features have substantial influence on the score for every query, making any straightforward approach based on inverted indices intractable. Goel, Langford & Strehl (Yahoo! Research) Predictive Indexing for Fast Search 34 / 34