Finding predictive search queries for behavioral targeting

Size: px

Start display at page:

Download "Finding predictive search queries for behavioral targeting"

Geraldine Harper
6 years ago
Views:

1 Finding predictive search queries for behavioral targeting Adwait Ratnaparkhi Yahoo! Labs 4401 Great America Parkway Santa Clara, CA ABSTRACT Behavioral targeting refers to the use of historical user internet activity to improve the relevance of internet advertisements that are shown to that user. We study one class of features search queries that are thought to be a good indicator of user interest. Several feature selection techniques are employed to find informative lists of queries for each behavioral targeting category of interest. These queries are then evaluated in a click probability model, which attempts to predict the probability that a user will click a given ad shown on some page, based on historical search queries of the user, features of the page, and features of the ad. Our experiments on a large amount of display advertisement data show that queries obtained by our feature selection techniques can improve click prediction for some behavioral categories. Furthermore, we demonstrate that for some categories, the top 20 queries are surprisingly relevant towards their corresponding behavioral category, despite being induced from historical data using purely statistical methods. 2. INTRODUCTION Among internet advertisers, behavioral targeting (BT) is a common way to target internet advertisements towards a segment of the internet audience. BT algorithms attempt to match users to ads based on the historical activity of the users, and the perceived category of the ad. For example, a user who browsed web pages related to automobiles yesterday might be a good candidate for seeing an auto-related ad today. There are many kinds of historical user features that are useful in BT, this paper will focus on just one class of features that is thought to be a good indicator of user interest, namely search queries. This paper will present several automatic feature selection strategies for selecting user search queries that indicate interest in a BT category for the display advertisements. At Yahoo!, all graphical display advertisements are editorially classified into one or more BT categories. The BT categories Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ADKDD 10, July 25, 2010, Washington D.C., USA. Copyright 2010 ACM $ form a hierarchical taxonomy which is designed to capture a broad set of user interests. The most straightforward technique to determine if a query q is relevant to a BT category c is to categorize q (with an automatic technique that examines the content of q), and only consider it relevant if q s category is the same or related to c. Automatic categorizers are typically built with machine learning methods that require many thousands of categorized examples as a training set. The feature selection techniques in this paper associate queries to BT categories not by their content, but rather, by their association with the click event. These techniques have an advantage over a content-based query categorizer in the following scenarios: 1. Queries q 1 and q 2 could both have content related to the category Autos, yet only q 1 could be related to actually clicking on advertisements in the Autos category. 2. A query q 3 could have content in some unrelated category, e.g., Insurance, yet it could be highly predictive of clicks in the Autos category. In the above scenarios, a content-based categorizer will not distinguish q 1 from q 2, nor even detect that q 3 is relevant to Autos. In contrast, a technique that learns from historical data to infer that q 1 and q 3 are both likely to be followed by a click on a display ad in the Autos category ought to be useful in both scenarios. The goal of our techniques in this paper is to find predictive queries for a click in a BT category, without regard for the category implied by their content. 3. PREVIOUS WORK Earlier work on BT has used a linear regression model [4] and a poisson model[3] to estimate the click probability of a user shown a display advertisement in a BT category. In these works, the historical user features (including search queries) are first aggregated at the user level. (In [4] the features are aggregated further into intensity and recency values). The models in both of these earlier works use the features of the ad (i.e., its BT category) and the historical features of the user to learn the contributions of the user features towards a click event for a given BT category. Other work[7] shows that past page views and search queries are correlated with sponsored search ad clicks on search results pages. The authors of this work create user segments by clustering users according to their search queries and

2 page views, and show that the resulting user segments have higher average click-through rates (CTRs), when compared with the CTRs of a user segment that did not use behavioral data. Furthermore, they notice that the search query segments have higher CTRs than the page view segments. Our work differs from all earlier attempts in that we consider the effect of the page as well as the user and the ad in our click probability model; we attempt to estimate P (click page, ad, user). Previous work such as [7] does not focus on feature selection for queries, while other work uses either a frequency cutoff [3] or machine-learned categorizer to select queries [4]. In this paper we go beyond frequency cutoffs and take a deeper look at other feature selection techniques that use only the association between the query and the click event to find predictive queries for a BT category. 4. DATA COLLECTION We collect data for both our feature selection techniques and model training from the display advertisement serving logs and search engine logs. Each line of the data represents an ad impression, and contains the following fields: bcookie: An identifier that represents the user. timestamp: The timestamp of the impression BT: A list of BT categories for the advertisement. pos: The position of the advertisement on the page. property: The Yahoo! property name on which this ad was shown. Property names are more general than URLs, a single property name can account for many URLs. E.g., sports, shopping, and news are examples of property names. queries: The historical queries of this user, including the current day (before the time of the ad impression) and 5 days before. Repeated queries in the history are replaced with a single query. click: Whether or not this impression resulted in a click on the advertisement. The data from the server logs was filtered by the pos field so that only advertisements in the 5 most prominent ad positions were retained. Table 1 shows the sizes of the training and test set, and shows what fraction of impressions have a query history. Since clicks are very sparse, the training set of impressions needs to be very large ( approx. 93 billion) in order for us to collect a large number of clicks. A series of scripts written in the Pig programming language[5] were run on a Hadoop[1] cluster to assemble the data from the display advertisement and search engine logs. 5. MODELING FRAMEWORK We use the conditional maximum entropy framework defined in [2] for click modeling, so that p(click g, a, u) = 1 Y Z(b) Z(b) = j=1...k click {0,1} j=1...k α f j (click,g,a,u) j Y α f j (click,g,a,u) j Impressions Clicks Impressions with query history Training set 92,960,778,679 >10M 27,920,032,253 (30%) Test set 27,959,492,938 >10M 8,314,259,579 (29.7%) Table 1: Training and test data. Clicks in training and test exceed 10 million. Date range of data and exact number of clicks are confidential information. where f j is a feature, α j > 0 is the corresponding parameter, g is the page, a is the ad, u is a user, and Z(b) is a normalization factor. Any information about the page, user, or ad that we deem useful for click modeling must be encoded in the features. Note that any feature f j is defined jointly over the (click,page,ad,user) tuple, written here in a general way: 8 >< f j(click, g, a, u) = >: 1 if click = 1, and (g, a, u) contains some context of interest 0 otherwise Historical queries of the user u, page property names of g, and ad categories for the ad a are examples of contexts. During model training, the parameters α j are set to maximize the log-likelihood of the training data: L(p) = p(click, g, a, u) log p(click g, a, u) (1) click,g,a,u where p(click, g, a, u) is the empirical probability of observing (click, g, a, u) in the training set, i.e., the weight of the training instance. In what follows, all features are defined for click = 1, with the exception of the default features defined below, which are defined for both click = 1 and click = 0. We first introduce a baseline model that is trained with an initial feature set looking only at page and ad features. We then augment that baseline feature set with historical query features, and then evaluate the impact of adding those features. 5.1 Baseline model The baseline model has the following kinds of features: default: There are 2 default features that are used regardless of the (page,ad,user) tuple when computing p(click = 1... ) and p(click = 0... ). They are denoted as f 0 and f 1: f 0(click, g, a, u) = 1 if click = 0, 0 otherwise f 1(click, g, a, u) = 1 if click = 1, 0 otherwise These features are used to model the prior distribution of clicks and non-clicks in the training set. BT: The BT category of the ad in the impression BT & pos: The BT category of the ad conjoined with the position of the ad BT & prop: The BT category of the ad conjoined with the property name of the page. An example of this feature

3 might be: 8 >< f j(click, g, a, u) = >: 1 if click = 1, and the property of g is yahoo-sports, and the category of a is Auto 0 otherwise There are 5 possible ad positions, and over 100 possible property names in this data set. A model with this feature set uses information from only the page and ad, it is effectively computing P r(click g, a). 5.2 Feature selection techniques In this section, we investigate feature selection techniques that integrate user information in the form of historical queries; these models compute P r(click g, a, u). The goal of the feature selection techniques are to find pairs (q, c) such that query q in the user s history is predictive of clicks on display ads with BT category c. A pair (q, c) is used to construct a feature f q,c as follows: 8 >< f q,c(click, g, a, u) = >: 1 if click = 1, and c is a valid category of the ad a,and q is a historical query of the user u 0 otherwise For all selection techniques, we only consider those (q, c) pairs that have occurred with clicks. The feature sets that are produced from the following methods will be added to the baseline feature set and evaluated for their ability to better predict clicks on display ads. We evaluate the following methods: Frequency threshold: We select pairs (q, c) such that freq(q, c, click) 20 where q is a query in the user s history, c is a BT category of the ad in the impression, and where freq(q, c, click) is the frequency of the pair (q, c) occurring with a click. Top N frequency: We select the top N (= 100K) pairs (q, c) when sorted by freq(q, c, click) in descending order. CTR ratio : We select pairs (q, c) such that the CTR ratio > 1. The CTR ratio is defined as CTRratio = p(click c, q) p(click q) (2) The CTR ratio is the conditional click probability of the pair (q, c), normalized by the click probability of just the query q. In practice, the normalization has the effect of reducing the score for queries that have high click propensity but are not related to any particular user interest in our taxonomy, e.g., pornographic queries. Top N likelihood gain: We select the top N (= 100K) pairs (q, c) when sorted by the likelihood gain statistic given in [6]. In preparation for using this feature selection technique, any pair (q, c) in the training data is used to construct a candidate feature f, as in definition (2). A candidate feature f is then evaluated by measuring the gain that it would provide to the likelihood of the training data, if it were added to the baseline model. Denote p as the baseline model. If f is a candidate feature, denote p f as a model which has been trained in a way such that its baseline feature parameters were held to the same values as in p, but where the parameter for f is allowed to vary and fit the training data. The likelihood gain of feature f is defined as L(p f ) L(p). A non-zero gain would indicate that the feature f has some information beyond the features in the baseline set. An iterative solution for the likelihood gain is given in [2]. A closed form solution (for the special case of binary-valued features) for gain is given in [6]. However [6] computes the gain not for the likelihood function itself, but instead a non-negative generalization of the likelihood function. The gain computation given in [6] is: gain(f) = E pf E pf E pf log Epf E pf = E pf = click,g,a,u click,g,a,u E pf p(g, a, u)p(click g, a, u)f(click, g, a, u) p(click, g, a, u)f(click, g, a, u) Here we use p to denote the empirical probability distribution in the training data. Then E pf is the expectation of feature f with respect to the (baseline) model p, while E pf is the observed expectation of f, and gain(f) is the feature gain. While the above gain computation is not exactly matched to our likelihood function and probability model, we assume for the sake of convenience (in both implementation and run-time) that features ranked according to this gain statistic will be useful in our setting anyway. In category features: We select pairs (q, c) such that freq(q, c, click) 5 where q is a query in the user s history, c is a BT category of the ad in the impression, and where freq(q, c, click) is the frequency of the pair (q, c) occurring with a click. Furthermore, the category c must also be a valid category of q, so both the query and ad belong to the same category. The categories for q are determined by a proprietary machine-learned query categorizer trained from a manually annotated list of queries. While other feature selection methods aim to induce the list of pairs from statistical association with clicks, this method looks at the content of q to determine the category. Using the techniques in this list, we augment the feature set of the baseline model and train and test separate models for the resulting feature sets. The feature set sizes are shown in Table Training a model Given a feature set, we extract training and test instances for the model, in the form: click x 1... x n

4 0 acat=finance acat,pos=finance,fpad acat,prop=finance,yahoo-top-page acat,q=finance,hotmail.com *default* Table 3: Example of a training instance Feature selection technique Feature set size # of (query,bt category) pairs in feature set Baseline Frequency Top N Frequency CTR Ratio Top N Likelihood In category Table 2: Feature set sizes Figure 1: Max F1 across feature selection techniques where click {0, 1}, and x 1... x n are the historical contexts of the (page,ad,user) tuple. An example training instance is shown in Table 3. Here Finance is a BT category of the ad, FPAD is a code for an ad position on the page, yahoo-toppage is a property name, and hotmail.com is a query. Given a feature set together with the training data, we use the improved iterative scaling algorithm[2] to estimate the model parameters from this data. This algorithm attempts to find a parameter setting that maximizes the likelihood (eq. (1)) of the training data. We use a map-reduce implementation of the parameter estimation algorithm on a computing cluster that uses the Hadoop[1] software. We train one model for each feature selection technique listed in Table 2. The resulting models are evaluated on the test data in the next section. 6. EVALUATION USING MA F1 SCORE The click prediction accuracy of a model is often measured by precision and recall, defined as: correct(t) = # instances for which click = 1 and p(click = 1 g, a, u) > t proposed(t) = # instances for which p(click = 1 g, a, u) > t precision(t) = correct(t) proposed(t) recall(t) = correct(t) # instances for which click = 1 where t is a threshold in between 0 and 1. A precision vs. recall graph can be obtained by varying the threshold t. The precision and recall at a threshold t can be summarized into a single statistic, known as the the F 1 score: F 1(t) = 2 precision(t) recall(t) precision(t) + recall(t) and the max F 1 score is defined as the highest F 1 for any threshold: max F 1 = max F t 1(t) Here the max F 1 score is used to summarize an entire precision vs. recall curve. Figure 1 shows the max F 1 score for all models when evaluated across the entire test set (denoted by All), as well as subsets of the test data that correspond to the behavioral category of the ad. The behavioral categories of interest are Auto Insurance, Auto, Cruises, Parenting and Children, Credit Services, and Notebooks 1. For each category, the Y axis represents the percentage gain in max F 1 score of the model vs. the baseline model on the subset of the test data for that category. The results of the max F 1 score comparison are mixed. Looking at the scores for the experiment labeled All, the statistically induced lists of query and category pairs degrade the accuracy. The in-category features do not degrade the score, but do not greatly improve it either. However, there appears to be clear improvement for the Auto Insurance, Credit Services, and Notebooks categories vs. the baseline model. Notably, for the Auto Insurance category, the top N frequency technique shows a 95% improvement in max F 1 over the baseline. The historical query features for Auto Insurance and Credit Services categories also show improvement vs. the in-category feature set. In contrast, for the Auto category, models that use historical query features greatly degrade the score. The top N frequency, CTR ratio, and top N LL gain each show promise for one category, but no one method consistently outperforms the others. For confidentiality reasons, we are not able to disclose the number of clicks in each category, but the number of impressions in the test data subsets for each category range from the tens of millions to the hundreds of millions. 6.1 Statistical Significance To our knowledge, there is no direct way to measure the statistical significance of 2 different max F 1 scores. However, 1 As in notebook computer

5 it is possible to perform a paired t-test that determines if the raw score differences between a pair of models on exactly the same test instances are statistically significant. Given 2 click probability models p 1 and p 2, and a (page,ad,user) tuple (g, a, u) in the test data, we compute p 1(click = 1 g, a, u) p 2(click = 1 g, a, u) for each test instance. If µ is the sample mean of this vector of differences, the null hypothesis is H 0 = {µ = 0}, which means that on average the 2 models return the same scores for the test instances. If H 0 is true for a (p 1, p 2) pair, it means that the models are not behaving differently on the test data. Given the 5 models derived from various feature selection techniques listed in the x axis of Figure 1 (not counting the baseline), and 7 categories (including All), we perform one paired t-test for each combination of model and category, for a total of 35 paired t-tests. In each t-test, we compute the score differences between the model of interest and the baseline model, using the subset of the test data corresponding to the category. E.g., a paired t-test was done for the CTR ratio model and the baseline model on the subset of the test data corresponding to the Auto category. In all cases, the t-statistics indicate that we can reject the null hypothesis H 0 at significance level α = Examples of CTR ratio Some of the feature selection techniques yield lists of queries that are semantically relevant to the behavioral category. Tables 4, 5, and 6 show the top 20 queries sorted by the CTR ratio technique for the Autos, Cruise, and Notebook categories. The queries shown in the tables all appear relevant to the BT category, despite having been produced by purely statistical methods that do not look at the content of the query. Although in all fairness, the quality of the queries often gets worse the deeper one goes in the list, and not all categories have queries that are this clean. Furthermore, the CTR ratio technique can find queries that are not semantically relevant to the BT category, but are nonetheless highly correlated. For example, in the Autos example, the queries cash for clunkers incentives, cash for clunkers as well as the mis-spelling cash for cluckers 2 are within the top 200 queries. These queries do not contain any auto-related terms, but are very relevant to the Autos category. We would not have been able to make this association with a content-based query categorizer, which assigns a query like cash for clunkers to the Finance category. However, these out-of-category queries are a small proportion of the overall (q, c) pairs for 98.4% of the (q, c) pairs from the CTR ratio technique, the proprietary query categorizer would also have returned c as a valid category for q. 7. EFFECT OF THE PAGE One explanation for why the query features do not help in some categories (e.g., Autos) is that for some categories, the page information already captures the behavioral interest of the user, and that additional information in the form of queries does not add any predictive power. E.g., if the model already knows that the user was visiting an autos-related page known to have a high click-rate for autos ads (e.g., autos.yahoo.com) at the time of the ad click, the fact that this user issued an autos-related search query in the past few days may not further help in predicting the click. 2 We preserve the original spelling of the queries. Query CTR Ratio 2010 ford edge new kia cars nissan versa buick enclave toyota prius equinox nissan versa chevrolet equinox kia forte chevy traverse new 2008 toyota cars nissian chevy equinox chevrolet equinox subaru cars toyota prius prius ford focus dodge cars best car deals Table 4: Queries selected by the CTR Ratio technique in the Auto category Query CTR Ratio vacations apple vacations cruise deals vacation packages cheap caribbean all inclusive vacations vacation to go vacations to go cheap cruise deals all inclusive resorts nassau bahamas last minute travel deals apple vacations all inclusive caribbean cruises cheap cruises for travelzoo.com last minute deals atlantis bahamas discount cruises last minute cruise deals Table 5: Queries selected by the CTR Ratio technique in the Cruises category

6 Query CTR Ratio dell coupons cheap laptops lenovo dell laptops hp laptops acer newegg dell gateway computers asus micro center acer laptop compusa frys toshiba officemax gateway dell.com tigerdirect laptops Table 6: Queries selected by the CTR Ratio technique in the Notebooks category Features Feature set size max F 1 score (relative to baseline) Baseline without % page Baseline % Table 7: Removing the page features from the baseline We measure the impact of using the property name of the page by comparing the following models: Baseline: This is the same model as described in section 5.1. This model computes P r(click g, a). Baseline without page: This is the baseline model, but without the BT & prop feature. The BT and BT & pos are features of the ad, so in effect, this model is computing P r(click a) Table 7 shows a 39% degradation in max F 1 score when the page features are removed. ranking will select predictive queries that would normally be missed by a content-based query categorizer. For some categories of interest, the automatically induced queries show a significant improvement in max F 1 score 95% in one case over the baseline model. For other categories, we surmise that the page feature has overshadowed the query features, and we verify that the page feature adds significant prediction ability to the baseline. 9. REFERENCES [1] Apache Hadoop. [2] Adam L. Berger, Vincent J. Della Pietra, and Stephen A. Della Pietra. A maximum entropy approach to natural language processing. Comput. Linguist., 22(1):39 71, [3] Ye Chen, Dmitry Pavlov, and John F. Canny. Large-scale behavioral targeting. In KDD 09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages , New York, NY, USA, ACM. [4] C. Y. Chung, J.M. Koran, L.-J. Lin, and H. Yin. Model for generating user profiles in a behavioral targeting system. U.S. Patent 11/394,374, March [5] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD 08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages , New York, NY, USA, ACM. [6] Kishore Papineni. Why inverse document frequency? In NAACL 01: Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001, pages 1 8, Morristown, NJ, USA, Association for Computational Linguistics. [7] Jun Yan, Ning Liu, Gang Wang, Wen Zhang, Yun Jiang, and Zheng Chen. How much can behavioral targeting help online advertising? In WWW 09: Proceedings of the 18th international conference on World wide web, pages , New York, NY, USA, ACM. 8. CONCLUSION This paper has presented the first click probability model (to our knowledge) for display advertising that integrates information from the user, the ad, as well as the page in making a click prediction. The scale of the experiment is also larger than previous studies such as [3] and [7], which present models trained on 500 million and 6 million users (resp.); our models have been trained on 93 billion and tested on 28 billion impressions. The paper first presents a baseline model that uses information from the page and ad, and then evaluates several feature selection techniques to incorporate user information in the form of historical queries. Using the CTR ratio ranking for queries, this paper shows that the top 20 ranked queries are semantically related to selected BT categories. Furthermore, there is anecdotal evidence that this

Declarative MapReduce 10/29/2018 1

Declarative MapReduce 10/29/2018 1 Declarative Reduce 10/29/2018 1 Reduce Examples Filter Aggregate Grouped aggregated Reduce Reduce Equi-join Reduce Non-equi-join Reduce 10/29/2018 2 Declarative Languages Describe what you want to do not