Finding predictive search queries for behavioral targeting

Size: px
Start display at page:

Download "Finding predictive search queries for behavioral targeting"

Transcription

1 Finding predictive search queries for behavioral targeting Adwait Ratnaparkhi Yahoo! Labs 4401 Great America Parkway Santa Clara, CA ABSTRACT Behavioral targeting refers to the use of historical user internet activity to improve the relevance of internet advertisements that are shown to that user. We study one class of features search queries that are thought to be a good indicator of user interest. Several feature selection techniques are employed to find informative lists of queries for each behavioral targeting category of interest. These queries are then evaluated in a click probability model, which attempts to predict the probability that a user will click a given ad shown on some page, based on historical search queries of the user, features of the page, and features of the ad. Our experiments on a large amount of display advertisement data show that queries obtained by our feature selection techniques can improve click prediction for some behavioral categories. Furthermore, we demonstrate that for some categories, the top 20 queries are surprisingly relevant towards their corresponding behavioral category, despite being induced from historical data using purely statistical methods. 2. INTRODUCTION Among internet advertisers, behavioral targeting (BT) is a common way to target internet advertisements towards a segment of the internet audience. BT algorithms attempt to match users to ads based on the historical activity of the users, and the perceived category of the ad. For example, a user who browsed web pages related to automobiles yesterday might be a good candidate for seeing an auto-related ad today. There are many kinds of historical user features that are useful in BT, this paper will focus on just one class of features that is thought to be a good indicator of user interest, namely search queries. This paper will present several automatic feature selection strategies for selecting user search queries that indicate interest in a BT category for the display advertisements. At Yahoo!, all graphical display advertisements are editorially classified into one or more BT categories. The BT categories Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ADKDD 10, July 25, 2010, Washington D.C., USA. Copyright 2010 ACM $ form a hierarchical taxonomy which is designed to capture a broad set of user interests. The most straightforward technique to determine if a query q is relevant to a BT category c is to categorize q (with an automatic technique that examines the content of q), and only consider it relevant if q s category is the same or related to c. Automatic categorizers are typically built with machine learning methods that require many thousands of categorized examples as a training set. The feature selection techniques in this paper associate queries to BT categories not by their content, but rather, by their association with the click event. These techniques have an advantage over a content-based query categorizer in the following scenarios: 1. Queries q 1 and q 2 could both have content related to the category Autos, yet only q 1 could be related to actually clicking on advertisements in the Autos category. 2. A query q 3 could have content in some unrelated category, e.g., Insurance, yet it could be highly predictive of clicks in the Autos category. In the above scenarios, a content-based categorizer will not distinguish q 1 from q 2, nor even detect that q 3 is relevant to Autos. In contrast, a technique that learns from historical data to infer that q 1 and q 3 are both likely to be followed by a click on a display ad in the Autos category ought to be useful in both scenarios. The goal of our techniques in this paper is to find predictive queries for a click in a BT category, without regard for the category implied by their content. 3. PREVIOUS WORK Earlier work on BT has used a linear regression model [4] and a poisson model[3] to estimate the click probability of a user shown a display advertisement in a BT category. In these works, the historical user features (including search queries) are first aggregated at the user level. (In [4] the features are aggregated further into intensity and recency values). The models in both of these earlier works use the features of the ad (i.e., its BT category) and the historical features of the user to learn the contributions of the user features towards a click event for a given BT category. Other work[7] shows that past page views and search queries are correlated with sponsored search ad clicks on search results pages. The authors of this work create user segments by clustering users according to their search queries and

2 page views, and show that the resulting user segments have higher average click-through rates (CTRs), when compared with the CTRs of a user segment that did not use behavioral data. Furthermore, they notice that the search query segments have higher CTRs than the page view segments. Our work differs from all earlier attempts in that we consider the effect of the page as well as the user and the ad in our click probability model; we attempt to estimate P (click page, ad, user). Previous work such as [7] does not focus on feature selection for queries, while other work uses either a frequency cutoff [3] or machine-learned categorizer to select queries [4]. In this paper we go beyond frequency cutoffs and take a deeper look at other feature selection techniques that use only the association between the query and the click event to find predictive queries for a BT category. 4. DATA COLLECTION We collect data for both our feature selection techniques and model training from the display advertisement serving logs and search engine logs. Each line of the data represents an ad impression, and contains the following fields: bcookie: An identifier that represents the user. timestamp: The timestamp of the impression BT: A list of BT categories for the advertisement. pos: The position of the advertisement on the page. property: The Yahoo! property name on which this ad was shown. Property names are more general than URLs, a single property name can account for many URLs. E.g., sports, shopping, and news are examples of property names. queries: The historical queries of this user, including the current day (before the time of the ad impression) and 5 days before. Repeated queries in the history are replaced with a single query. click: Whether or not this impression resulted in a click on the advertisement. The data from the server logs was filtered by the pos field so that only advertisements in the 5 most prominent ad positions were retained. Table 1 shows the sizes of the training and test set, and shows what fraction of impressions have a query history. Since clicks are very sparse, the training set of impressions needs to be very large ( approx. 93 billion) in order for us to collect a large number of clicks. A series of scripts written in the Pig programming language[5] were run on a Hadoop[1] cluster to assemble the data from the display advertisement and search engine logs. 5. MODELING FRAMEWORK We use the conditional maximum entropy framework defined in [2] for click modeling, so that p(click g, a, u) = 1 Y Z(b) Z(b) = j=1...k click {0,1} j=1...k α f j (click,g,a,u) j Y α f j (click,g,a,u) j Impressions Clicks Impressions with query history Training set 92,960,778,679 >10M 27,920,032,253 (30%) Test set 27,959,492,938 >10M 8,314,259,579 (29.7%) Table 1: Training and test data. Clicks in training and test exceed 10 million. Date range of data and exact number of clicks are confidential information. where f j is a feature, α j > 0 is the corresponding parameter, g is the page, a is the ad, u is a user, and Z(b) is a normalization factor. Any information about the page, user, or ad that we deem useful for click modeling must be encoded in the features. Note that any feature f j is defined jointly over the (click,page,ad,user) tuple, written here in a general way: 8 >< f j(click, g, a, u) = >: 1 if click = 1, and (g, a, u) contains some context of interest 0 otherwise Historical queries of the user u, page property names of g, and ad categories for the ad a are examples of contexts. During model training, the parameters α j are set to maximize the log-likelihood of the training data: L(p) = p(click, g, a, u) log p(click g, a, u) (1) click,g,a,u where p(click, g, a, u) is the empirical probability of observing (click, g, a, u) in the training set, i.e., the weight of the training instance. In what follows, all features are defined for click = 1, with the exception of the default features defined below, which are defined for both click = 1 and click = 0. We first introduce a baseline model that is trained with an initial feature set looking only at page and ad features. We then augment that baseline feature set with historical query features, and then evaluate the impact of adding those features. 5.1 Baseline model The baseline model has the following kinds of features: default: There are 2 default features that are used regardless of the (page,ad,user) tuple when computing p(click = 1... ) and p(click = 0... ). They are denoted as f 0 and f 1: f 0(click, g, a, u) = 1 if click = 0, 0 otherwise f 1(click, g, a, u) = 1 if click = 1, 0 otherwise These features are used to model the prior distribution of clicks and non-clicks in the training set. BT: The BT category of the ad in the impression BT & pos: The BT category of the ad conjoined with the position of the ad BT & prop: The BT category of the ad conjoined with the property name of the page. An example of this feature

3 might be: 8 >< f j(click, g, a, u) = >: 1 if click = 1, and the property of g is yahoo-sports, and the category of a is Auto 0 otherwise There are 5 possible ad positions, and over 100 possible property names in this data set. A model with this feature set uses information from only the page and ad, it is effectively computing P r(click g, a). 5.2 Feature selection techniques In this section, we investigate feature selection techniques that integrate user information in the form of historical queries; these models compute P r(click g, a, u). The goal of the feature selection techniques are to find pairs (q, c) such that query q in the user s history is predictive of clicks on display ads with BT category c. A pair (q, c) is used to construct a feature f q,c as follows: 8 >< f q,c(click, g, a, u) = >: 1 if click = 1, and c is a valid category of the ad a,and q is a historical query of the user u 0 otherwise For all selection techniques, we only consider those (q, c) pairs that have occurred with clicks. The feature sets that are produced from the following methods will be added to the baseline feature set and evaluated for their ability to better predict clicks on display ads. We evaluate the following methods: Frequency threshold: We select pairs (q, c) such that freq(q, c, click) 20 where q is a query in the user s history, c is a BT category of the ad in the impression, and where freq(q, c, click) is the frequency of the pair (q, c) occurring with a click. Top N frequency: We select the top N (= 100K) pairs (q, c) when sorted by freq(q, c, click) in descending order. CTR ratio : We select pairs (q, c) such that the CTR ratio > 1. The CTR ratio is defined as CTRratio = p(click c, q) p(click q) (2) The CTR ratio is the conditional click probability of the pair (q, c), normalized by the click probability of just the query q. In practice, the normalization has the effect of reducing the score for queries that have high click propensity but are not related to any particular user interest in our taxonomy, e.g., pornographic queries. Top N likelihood gain: We select the top N (= 100K) pairs (q, c) when sorted by the likelihood gain statistic given in [6]. In preparation for using this feature selection technique, any pair (q, c) in the training data is used to construct a candidate feature f, as in definition (2). A candidate feature f is then evaluated by measuring the gain that it would provide to the likelihood of the training data, if it were added to the baseline model. Denote p as the baseline model. If f is a candidate feature, denote p f as a model which has been trained in a way such that its baseline feature parameters were held to the same values as in p, but where the parameter for f is allowed to vary and fit the training data. The likelihood gain of feature f is defined as L(p f ) L(p). A non-zero gain would indicate that the feature f has some information beyond the features in the baseline set. An iterative solution for the likelihood gain is given in [2]. A closed form solution (for the special case of binary-valued features) for gain is given in [6]. However [6] computes the gain not for the likelihood function itself, but instead a non-negative generalization of the likelihood function. The gain computation given in [6] is: gain(f) = E pf E pf E pf log Epf E pf = E pf = click,g,a,u click,g,a,u E pf p(g, a, u)p(click g, a, u)f(click, g, a, u) p(click, g, a, u)f(click, g, a, u) Here we use p to denote the empirical probability distribution in the training data. Then E pf is the expectation of feature f with respect to the (baseline) model p, while E pf is the observed expectation of f, and gain(f) is the feature gain. While the above gain computation is not exactly matched to our likelihood function and probability model, we assume for the sake of convenience (in both implementation and run-time) that features ranked according to this gain statistic will be useful in our setting anyway. In category features: We select pairs (q, c) such that freq(q, c, click) 5 where q is a query in the user s history, c is a BT category of the ad in the impression, and where freq(q, c, click) is the frequency of the pair (q, c) occurring with a click. Furthermore, the category c must also be a valid category of q, so both the query and ad belong to the same category. The categories for q are determined by a proprietary machine-learned query categorizer trained from a manually annotated list of queries. While other feature selection methods aim to induce the list of pairs from statistical association with clicks, this method looks at the content of q to determine the category. Using the techniques in this list, we augment the feature set of the baseline model and train and test separate models for the resulting feature sets. The feature set sizes are shown in Table Training a model Given a feature set, we extract training and test instances for the model, in the form: click x 1... x n

4 0 acat=finance acat,pos=finance,fpad acat,prop=finance,yahoo-top-page acat,q=finance,hotmail.com *default* Table 3: Example of a training instance Feature selection technique Feature set size # of (query,bt category) pairs in feature set Baseline Frequency Top N Frequency CTR Ratio Top N Likelihood In category Table 2: Feature set sizes Figure 1: Max F1 across feature selection techniques where click {0, 1}, and x 1... x n are the historical contexts of the (page,ad,user) tuple. An example training instance is shown in Table 3. Here Finance is a BT category of the ad, FPAD is a code for an ad position on the page, yahoo-toppage is a property name, and hotmail.com is a query. Given a feature set together with the training data, we use the improved iterative scaling algorithm[2] to estimate the model parameters from this data. This algorithm attempts to find a parameter setting that maximizes the likelihood (eq. (1)) of the training data. We use a map-reduce implementation of the parameter estimation algorithm on a computing cluster that uses the Hadoop[1] software. We train one model for each feature selection technique listed in Table 2. The resulting models are evaluated on the test data in the next section. 6. EVALUATION USING MA F1 SCORE The click prediction accuracy of a model is often measured by precision and recall, defined as: correct(t) = # instances for which click = 1 and p(click = 1 g, a, u) > t proposed(t) = # instances for which p(click = 1 g, a, u) > t precision(t) = correct(t) proposed(t) recall(t) = correct(t) # instances for which click = 1 where t is a threshold in between 0 and 1. A precision vs. recall graph can be obtained by varying the threshold t. The precision and recall at a threshold t can be summarized into a single statistic, known as the the F 1 score: F 1(t) = 2 precision(t) recall(t) precision(t) + recall(t) and the max F 1 score is defined as the highest F 1 for any threshold: max F 1 = max F t 1(t) Here the max F 1 score is used to summarize an entire precision vs. recall curve. Figure 1 shows the max F 1 score for all models when evaluated across the entire test set (denoted by All), as well as subsets of the test data that correspond to the behavioral category of the ad. The behavioral categories of interest are Auto Insurance, Auto, Cruises, Parenting and Children, Credit Services, and Notebooks 1. For each category, the Y axis represents the percentage gain in max F 1 score of the model vs. the baseline model on the subset of the test data for that category. The results of the max F 1 score comparison are mixed. Looking at the scores for the experiment labeled All, the statistically induced lists of query and category pairs degrade the accuracy. The in-category features do not degrade the score, but do not greatly improve it either. However, there appears to be clear improvement for the Auto Insurance, Credit Services, and Notebooks categories vs. the baseline model. Notably, for the Auto Insurance category, the top N frequency technique shows a 95% improvement in max F 1 over the baseline. The historical query features for Auto Insurance and Credit Services categories also show improvement vs. the in-category feature set. In contrast, for the Auto category, models that use historical query features greatly degrade the score. The top N frequency, CTR ratio, and top N LL gain each show promise for one category, but no one method consistently outperforms the others. For confidentiality reasons, we are not able to disclose the number of clicks in each category, but the number of impressions in the test data subsets for each category range from the tens of millions to the hundreds of millions. 6.1 Statistical Significance To our knowledge, there is no direct way to measure the statistical significance of 2 different max F 1 scores. However, 1 As in notebook computer

5 it is possible to perform a paired t-test that determines if the raw score differences between a pair of models on exactly the same test instances are statistically significant. Given 2 click probability models p 1 and p 2, and a (page,ad,user) tuple (g, a, u) in the test data, we compute p 1(click = 1 g, a, u) p 2(click = 1 g, a, u) for each test instance. If µ is the sample mean of this vector of differences, the null hypothesis is H 0 = {µ = 0}, which means that on average the 2 models return the same scores for the test instances. If H 0 is true for a (p 1, p 2) pair, it means that the models are not behaving differently on the test data. Given the 5 models derived from various feature selection techniques listed in the x axis of Figure 1 (not counting the baseline), and 7 categories (including All), we perform one paired t-test for each combination of model and category, for a total of 35 paired t-tests. In each t-test, we compute the score differences between the model of interest and the baseline model, using the subset of the test data corresponding to the category. E.g., a paired t-test was done for the CTR ratio model and the baseline model on the subset of the test data corresponding to the Auto category. In all cases, the t-statistics indicate that we can reject the null hypothesis H 0 at significance level α = Examples of CTR ratio Some of the feature selection techniques yield lists of queries that are semantically relevant to the behavioral category. Tables 4, 5, and 6 show the top 20 queries sorted by the CTR ratio technique for the Autos, Cruise, and Notebook categories. The queries shown in the tables all appear relevant to the BT category, despite having been produced by purely statistical methods that do not look at the content of the query. Although in all fairness, the quality of the queries often gets worse the deeper one goes in the list, and not all categories have queries that are this clean. Furthermore, the CTR ratio technique can find queries that are not semantically relevant to the BT category, but are nonetheless highly correlated. For example, in the Autos example, the queries cash for clunkers incentives, cash for clunkers as well as the mis-spelling cash for cluckers 2 are within the top 200 queries. These queries do not contain any auto-related terms, but are very relevant to the Autos category. We would not have been able to make this association with a content-based query categorizer, which assigns a query like cash for clunkers to the Finance category. However, these out-of-category queries are a small proportion of the overall (q, c) pairs for 98.4% of the (q, c) pairs from the CTR ratio technique, the proprietary query categorizer would also have returned c as a valid category for q. 7. EFFECT OF THE PAGE One explanation for why the query features do not help in some categories (e.g., Autos) is that for some categories, the page information already captures the behavioral interest of the user, and that additional information in the form of queries does not add any predictive power. E.g., if the model already knows that the user was visiting an autos-related page known to have a high click-rate for autos ads (e.g., autos.yahoo.com) at the time of the ad click, the fact that this user issued an autos-related search query in the past few days may not further help in predicting the click. 2 We preserve the original spelling of the queries. Query CTR Ratio 2010 ford edge new kia cars nissan versa buick enclave toyota prius equinox nissan versa chevrolet equinox kia forte chevy traverse new 2008 toyota cars nissian chevy equinox chevrolet equinox subaru cars toyota prius prius ford focus dodge cars best car deals Table 4: Queries selected by the CTR Ratio technique in the Auto category Query CTR Ratio vacations apple vacations cruise deals vacation packages cheap caribbean all inclusive vacations vacation to go vacations to go cheap cruise deals all inclusive resorts nassau bahamas last minute travel deals apple vacations all inclusive caribbean cruises cheap cruises for travelzoo.com last minute deals atlantis bahamas discount cruises last minute cruise deals Table 5: Queries selected by the CTR Ratio technique in the Cruises category

6 Query CTR Ratio dell coupons cheap laptops lenovo dell laptops hp laptops acer newegg dell gateway computers asus micro center acer laptop compusa frys toshiba officemax gateway dell.com tigerdirect laptops Table 6: Queries selected by the CTR Ratio technique in the Notebooks category Features Feature set size max F 1 score (relative to baseline) Baseline without % page Baseline % Table 7: Removing the page features from the baseline We measure the impact of using the property name of the page by comparing the following models: Baseline: This is the same model as described in section 5.1. This model computes P r(click g, a). Baseline without page: This is the baseline model, but without the BT & prop feature. The BT and BT & pos are features of the ad, so in effect, this model is computing P r(click a) Table 7 shows a 39% degradation in max F 1 score when the page features are removed. ranking will select predictive queries that would normally be missed by a content-based query categorizer. For some categories of interest, the automatically induced queries show a significant improvement in max F 1 score 95% in one case over the baseline model. For other categories, we surmise that the page feature has overshadowed the query features, and we verify that the page feature adds significant prediction ability to the baseline. 9. REFERENCES [1] Apache Hadoop. [2] Adam L. Berger, Vincent J. Della Pietra, and Stephen A. Della Pietra. A maximum entropy approach to natural language processing. Comput. Linguist., 22(1):39 71, [3] Ye Chen, Dmitry Pavlov, and John F. Canny. Large-scale behavioral targeting. In KDD 09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages , New York, NY, USA, ACM. [4] C. Y. Chung, J.M. Koran, L.-J. Lin, and H. Yin. Model for generating user profiles in a behavioral targeting system. U.S. Patent 11/394,374, March [5] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD 08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data, pages , New York, NY, USA, ACM. [6] Kishore Papineni. Why inverse document frequency? In NAACL 01: Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001, pages 1 8, Morristown, NJ, USA, Association for Computational Linguistics. [7] Jun Yan, Ning Liu, Gang Wang, Wen Zhang, Yun Jiang, and Zheng Chen. How much can behavioral targeting help online advertising? In WWW 09: Proceedings of the 18th international conference on World wide web, pages , New York, NY, USA, ACM. 8. CONCLUSION This paper has presented the first click probability model (to our knowledge) for display advertising that integrates information from the user, the ad, as well as the page in making a click prediction. The scale of the experiment is also larger than previous studies such as [3] and [7], which present models trained on 500 million and 6 million users (resp.); our models have been trained on 93 billion and tested on 28 billion impressions. The paper first presents a baseline model that uses information from the page and ad, and then evaluates several feature selection techniques to incorporate user information in the form of historical queries. Using the CTR ratio ranking for queries, this paper shows that the top 20 ranked queries are semantically related to selected BT categories. Furthermore, there is anecdotal evidence that this

Declarative MapReduce 10/29/2018 1

Declarative MapReduce 10/29/2018 1 Declarative Reduce 10/29/2018 1 Reduce Examples Filter Aggregate Grouped aggregated Reduce Reduce Equi-join Reduce Non-equi-join Reduce 10/29/2018 2 Declarative Languages Describe what you want to do not

More information

Inferring User Search for Feedback Sessions

Inferring User Search for Feedback Sessions Inferring User Search for Feedback Sessions Sharayu Kakade 1, Prof. Ranjana Barde 2 PG Student, Department of Computer Science, MIT Academy of Engineering, Pune, MH, India 1 Assistant Professor, Department

More information

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Automatic New Topic Identification in Search Engine Transaction Log

More information

Classification. 1 o Semestre 2007/2008

Classification. 1 o Semestre 2007/2008 Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class

More information

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the

More information

Sponsored Search Advertising. George Trimponias, CSE

Sponsored Search Advertising. George Trimponias, CSE Sponsored Search Advertising form a Database Perspective George Trimponias, CSE 1 The 3 Stages of Sponsored Search Ad Selection: Select all candidate ads that may be relevant. Matchtype is important in

More information

Deep Character-Level Click-Through Rate Prediction for Sponsored Search

Deep Character-Level Click-Through Rate Prediction for Sponsored Search Deep Character-Level Click-Through Rate Prediction for Sponsored Search Bora Edizel - Phd Student UPF Amin Mantrach - Criteo Research Xiao Bai - Oath This work was done at Yahoo and will be presented as

More information

Data mining, 4 cu Lecture 6:

Data mining, 4 cu Lecture 6: 582364 Data mining, 4 cu Lecture 6: Quantitative association rules Multi-level association rules Spring 2010 Lecturer: Juho Rousu Teaching assistant: Taru Itäpelto Data mining, Spring 2010 (Slides adapted

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., DEC User Action Interpretation for Online Content Optimization

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., DEC User Action Interpretation for Online Content Optimization IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL., NO., DEC 2011 1 User Action Interpretation for Online Content Optimization Jiang Bian, Anlei Dong, Xiaofeng He, Srihari Reddy, and Yi Chang Abstract

More information

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA

SEQUENTIAL PATTERN MINING FROM WEB LOG DATA SEQUENTIAL PATTERN MINING FROM WEB LOG DATA Rajashree Shettar 1 1 Associate Professor, Department of Computer Science, R. V College of Engineering, Karnataka, India, rajashreeshettar@rvce.edu.in Abstract

More information

Where Next? Data Mining Techniques and Challenges for Trajectory Prediction. Slides credit: Layla Pournajaf

Where Next? Data Mining Techniques and Challenges for Trajectory Prediction. Slides credit: Layla Pournajaf Where Next? Data Mining Techniques and Challenges for Trajectory Prediction Slides credit: Layla Pournajaf o Navigational services. o Traffic management. o Location-based advertising. Source: A. Monreale,

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Rishiraj Saha Roy and Niloy Ganguly IIT Kharagpur India. Monojit Choudhury and Srivatsan Laxman Microsoft Research India India

Rishiraj Saha Roy and Niloy Ganguly IIT Kharagpur India. Monojit Choudhury and Srivatsan Laxman Microsoft Research India India Rishiraj Saha Roy and Niloy Ganguly IIT Kharagpur India Monojit Choudhury and Srivatsan Laxman Microsoft Research India India ACM SIGIR 2012, Portland August 15, 2012 Dividing a query into individual semantic

More information

HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining

HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining Masaharu Yoshioka Graduate School of Information Science and Technology, Hokkaido University

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

A Dynamic Bayesian Network Click Model for Web Search Ranking

A Dynamic Bayesian Network Click Model for Web Search Ranking A Dynamic Bayesian Network Click Model for Web Search Ranking Olivier Chapelle and Anne Ya Zhang Apr 22, 2009 18th International World Wide Web Conference Introduction Motivation Clicks provide valuable

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel, John Langford and Alex Strehl Yahoo! Research, New York Modern Massive Data Sets (MMDS) June 25, 2008 Goel, Langford & Strehl (Yahoo! Research) Predictive

More information

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR P.SHENBAGAVALLI M.E., Research Scholar, Assistant professor/cse MPNMJ Engineering college Sspshenba2@gmail.com J.SARAVANAKUMAR B.Tech(IT)., PG

More information

Pig Latin: A Not-So-Foreign Language for Data Processing

Pig Latin: A Not-So-Foreign Language for Data Processing Pig Latin: A Not-So-Foreign Language for Data Processing Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins (Yahoo! Research) Presented by Aaron Moss (University of Waterloo)

More information

Recommendation System for Location-based Social Network CS224W Project Report

Recommendation System for Location-based Social Network CS224W Project Report Recommendation System for Location-based Social Network CS224W Project Report Group 42, Yiying Cheng, Yangru Fang, Yongqing Yuan 1 Introduction With the rapid development of mobile devices and wireless

More information

Information Retrieval. Lecture 9 - Web search basics

Information Retrieval. Lecture 9 - Web search basics Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general

More information

Sentiment analysis under temporal shift

Sentiment analysis under temporal shift Sentiment analysis under temporal shift Jan Lukes and Anders Søgaard Dpt. of Computer Science University of Copenhagen Copenhagen, Denmark smx262@alumni.ku.dk Abstract Sentiment analysis models often rely

More information

ACM MM Dong Liu, Shuicheng Yan, Yong Rui and Hong-Jiang Zhang

ACM MM Dong Liu, Shuicheng Yan, Yong Rui and Hong-Jiang Zhang ACM MM 2010 Dong Liu, Shuicheng Yan, Yong Rui and Hong-Jiang Zhang Harbin Institute of Technology National University of Singapore Microsoft Corporation Proliferation of images and videos on the Internet

More information

Query Disambiguation from Web Search Logs

Query Disambiguation from Web Search Logs Vol.133 (Information Technology and Computer Science 2016), pp.90-94 http://dx.doi.org/10.14257/astl.2016. Query Disambiguation from Web Search Logs Christian Højgaard 1, Joachim Sejr 2, and Yun-Gyung

More information

International Journal of Computer Engineering and Applications, BIG DATA ANALYTICS USING APACHE PIG Prabhjot Kaur

International Journal of Computer Engineering and Applications, BIG DATA ANALYTICS USING APACHE PIG Prabhjot Kaur Prabhjot Kaur Department of Computer Engineering ME CSE(BIG DATA ANALYTICS)-CHANDIGARH UNIVERSITY,GHARUAN kaurprabhjot770@gmail.com ABSTRACT: In today world, as we know data is expanding along with the

More information

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski

Data Analysis and Solver Plugins for KSpread USER S MANUAL. Tomasz Maliszewski Data Analysis and Solver Plugins for KSpread USER S MANUAL Tomasz Maliszewski tmaliszewski@wp.pl Table of Content CHAPTER 1: INTRODUCTION... 3 1.1. ABOUT DATA ANALYSIS PLUGIN... 3 1.3. ABOUT SOLVER PLUGIN...

More information

Correlation Based Feature Selection with Irrelevant Feature Removal

Correlation Based Feature Selection with Irrelevant Feature Removal Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 4, April 2014,

More information

A New Technique to Optimize User s Browsing Session using Data Mining

A New Technique to Optimize User s Browsing Session using Data Mining Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

CS 224W Final Report Group 37

CS 224W Final Report Group 37 1 Introduction CS 224W Final Report Group 37 Aaron B. Adcock Milinda Lakkam Justin Meyer Much of the current research is being done on social networks, where the cost of an edge is almost nothing; the

More information

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li

Learning to Match. Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li Learning to Match Jun Xu, Zhengdong Lu, Tianqi Chen, Hang Li 1. Introduction The main tasks in many applications can be formalized as matching between heterogeneous objects, including search, recommendation,

More information

An enhanced similarity measure for utilizing site structure in web personalization systems

An enhanced similarity measure for utilizing site structure in web personalization systems University of Wollongong Research Online University of Wollongong in Dubai - Papers University of Wollongong in Dubai 2008 An enhanced similarity measure for utilizing site structure in web personalization

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Patent Classification Using Ontology-Based Patent Network Analysis

Patent Classification Using Ontology-Based Patent Network Analysis Association for Information Systems AIS Electronic Library (AISeL) PACIS 2010 Proceedings Pacific Asia Conference on Information Systems (PACIS) 2010 Patent Classification Using Ontology-Based Patent Network

More information

Hierarchical Document Clustering

Hierarchical Document Clustering Hierarchical Document Clustering Benjamin C. M. Fung, Ke Wang, and Martin Ester, Simon Fraser University, Canada INTRODUCTION Document clustering is an automatic grouping of text documents into clusters

More information

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion Ye Tian, Gary M. Weiss, Qiang Ma Department of Computer and Information Science Fordham University 441 East Fordham

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008

CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 CS4445 Data Mining and Knowledge Discovery in Databases. A Term 2008 Exam 2 October 14, 2008 Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute NAME: Prof. Ruiz Problem

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

Monotone Constraints in Frequent Tree Mining

Monotone Constraints in Frequent Tree Mining Monotone Constraints in Frequent Tree Mining Jeroen De Knijf Ad Feelders Abstract Recent studies show that using constraints that can be pushed into the mining process, substantially improves the performance

More information

Security Control Methods for Statistical Database

Security Control Methods for Statistical Database Security Control Methods for Statistical Database Li Xiong CS573 Data Privacy and Security Statistical Database A statistical database is a database which provides statistics on subsets of records OLAP

More information

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University

Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University Use of KNN for the Netflix Prize Ted Hong, Dimitris Tsamis Stanford University {tedhong, dtsamis}@stanford.edu Abstract This paper analyzes the performance of various KNNs techniques as applied to the

More information

Published in: Proceedings of the 20th ACM international conference on Information and knowledge management

Published in: Proceedings of the 20th ACM international conference on Information and knowledge management Aalborg Universitet Leveraging Wikipedia concept and category information to enhance contextual advertising Wu, Zongda; Xu, Guandong; Pan, Rong; Zhang, Yanchun; Hu, Zhiwen; Lu, Jianfeng Published in: Proceedings

More information

A Conflict-Based Confidence Measure for Associative Classification

A Conflict-Based Confidence Measure for Associative Classification A Conflict-Based Confidence Measure for Associative Classification Peerapon Vateekul and Mei-Ling Shyu Department of Electrical and Computer Engineering University of Miami Coral Gables, FL 33124, USA

More information

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings. Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review

More information

Knowledge Engineering in Search Engines

Knowledge Engineering in Search Engines San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Spring 2012 Knowledge Engineering in Search Engines Yun-Chieh Lin Follow this and additional works at:

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

Slice Intelligence!

Slice Intelligence! Intern @ Slice Intelligence! Wei1an(Wu( September(8,(2014( Outline!! Details about the job!! Skills required and learned!! My thoughts regarding the internship! About the company!! Slice, which we call

More information

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Alessandro Epasto J. Feldman*, S. Lattanzi*, S. Leonardi, V. Mirrokni*. *Google Research Sapienza U. Rome Motivation Recommendation

More information

Contextion: A Framework for Developing Context-Aware Mobile Applications

Contextion: A Framework for Developing Context-Aware Mobile Applications Contextion: A Framework for Developing Context-Aware Mobile Applications Elizabeth Williams, Jeff Gray Department of Computer Science, University of Alabama eawilliams2@crimson.ua.edu, gray@cs.ua.edu Abstract

More information

Log Linear Model for String Transformation Using Large Data Sets

Log Linear Model for String Transformation Using Large Data Sets Log Linear Model for String Transformation Using Large Data Sets Mr.G.Lenin 1, Ms.B.Vanitha 2, Mrs.C.K.Vijayalakshmi 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology,

More information

node2vec: Scalable Feature Learning for Networks

node2vec: Scalable Feature Learning for Networks node2vec: Scalable Feature Learning for Networks A paper by Aditya Grover and Jure Leskovec, presented at Knowledge Discovery and Data Mining 16. 11/27/2018 Presented by: Dharvi Verma CS 848: Graph Database

More information

Parallel learning of content recommendations using map- reduce

Parallel learning of content recommendations using map- reduce Parallel learning of content recommendations using map- reduce Michael Percy Stanford University Abstract In this paper, machine learning within the map- reduce paradigm for ranking

More information

COLLABORATIVE LOCATION AND ACTIVITY RECOMMENDATIONS WITH GPS HISTORY DATA

COLLABORATIVE LOCATION AND ACTIVITY RECOMMENDATIONS WITH GPS HISTORY DATA COLLABORATIVE LOCATION AND ACTIVITY RECOMMENDATIONS WITH GPS HISTORY DATA Vincent W. Zheng, Yu Zheng, Xing Xie, Qiang Yang Hong Kong University of Science and Technology Microsoft Research Asia WWW 2010

More information

Interactive Campaign Planning for Marketing Analysts

Interactive Campaign Planning for Marketing Analysts Interactive Campaign Planning for Marketing Analysts Fan Du University of Maryland College Park, MD, USA fan@cs.umd.edu Sana Malik Adobe Research San Jose, CA, USA sana.malik@adobe.com Eunyee Koh Adobe

More information

Hybrid Feature Selection for Modeling Intrusion Detection Systems

Hybrid Feature Selection for Modeling Intrusion Detection Systems Hybrid Feature Selection for Modeling Intrusion Detection Systems Srilatha Chebrolu, Ajith Abraham and Johnson P Thomas Department of Computer Science, Oklahoma State University, USA ajith.abraham@ieee.org,

More information

Relational Classification for Personalized Tag Recommendation

Relational Classification for Personalized Tag Recommendation Relational Classification for Personalized Tag Recommendation Leandro Balby Marinho, Christine Preisach, and Lars Schmidt-Thieme Information Systems and Machine Learning Lab (ISMLL) Samelsonplatz 1, University

More information

Ranking Classes of Search Engine Results

Ranking Classes of Search Engine Results Ranking Classes of Search Engine Results Zheng Zhu, Mark Levene Department of Computer Science and Information Systems, Birkbeck College, University of London, Malet Street, London, UK zheng@dcs.bbk.ac.uk,

More information

Predictive Indexing for Fast Search

Predictive Indexing for Fast Search Predictive Indexing for Fast Search Sharad Goel Yahoo! Research New York, NY 10018 goel@yahoo-inc.com John Langford Yahoo! Research New York, NY 10018 jl@yahoo-inc.com Alex Strehl Yahoo! Research New York,

More information

NBA 600: Day 15 Online Search 116 March Daniel Huttenlocher

NBA 600: Day 15 Online Search 116 March Daniel Huttenlocher NBA 600: Day 15 Online Search 116 March 2004 Daniel Huttenlocher Today s Class Finish up network effects topic from last week Searching, browsing, navigating Reading Beyond Google No longer available on

More information

Understanding policy intent and misconfigurations from implementations: consistency and convergence

Understanding policy intent and misconfigurations from implementations: consistency and convergence Understanding policy intent and misconfigurations from implementations: consistency and convergence Prasad Naldurg 1, Ranjita Bhagwan 1, and Tathagata Das 2 1 Microsoft Research India, prasadn@microsoft.com,

More information

Predict Topic Trend in Blogosphere

Predict Topic Trend in Blogosphere Predict Topic Trend in Blogosphere Jack Guo 05596882 jackguo@stanford.edu Abstract Graphical relationship among web pages has been used to rank their relative importance. In this paper, we introduce a

More information

CrowdPath: A Framework for Next Generation Routing Services using Volunteered Geographic Information

CrowdPath: A Framework for Next Generation Routing Services using Volunteered Geographic Information CrowdPath: A Framework for Next Generation Routing Services using Volunteered Geographic Information Abdeltawab M. Hendawi, Eugene Sturm, Dev Oliver, Shashi Shekhar hendawi@cs.umn.edu, sturm049@umn.edu,

More information

Sparse Non-negative Matrix Language Modeling

Sparse Non-negative Matrix Language Modeling Sparse Non-negative Matrix Language Modeling Joris Pelemans Noam Shazeer Ciprian Chelba joris@pelemans.be noam@google.com ciprianchelba@google.com 1 Outline Motivation Sparse Non-negative Matrix Language

More information

DATA MINING - 1DL105, 1DL111

DATA MINING - 1DL105, 1DL111 1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

Semantic text features from small world graphs

Semantic text features from small world graphs Semantic text features from small world graphs Jurij Leskovec 1 and John Shawe-Taylor 2 1 Carnegie Mellon University, USA. Jozef Stefan Institute, Slovenia. jure@cs.cmu.edu 2 University of Southampton,UK

More information

Youtube Graph Network Model and Analysis Yonghyun Ro, Han Lee, Dennis Won

Youtube Graph Network Model and Analysis Yonghyun Ro, Han Lee, Dennis Won Youtube Graph Network Model and Analysis Yonghyun Ro, Han Lee, Dennis Won Introduction A countless number of contents gets posted on the YouTube everyday. YouTube keeps its competitiveness by maximizing

More information

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500

More information

Modeling and Synthesizing Task Placement Constraints in Google Compute Clusters

Modeling and Synthesizing Task Placement Constraints in Google Compute Clusters Modeling and Synthesizing Task Placement s in Google s Bikash Sharma Pennsylvania State University University Park 1 bikash@cse.psu.edu Rasekh Rifaat Google Inc. Seattle 913 rasekh@google.com Victor Chudnovsky

More information

1) Give decision trees to represent the following Boolean functions:

1) Give decision trees to represent the following Boolean functions: 1) Give decision trees to represent the following Boolean functions: 1) A B 2) A [B C] 3) A XOR B 4) [A B] [C Dl Answer: 1) A B 2) A [B C] 1 3) A XOR B = (A B) ( A B) 4) [A B] [C D] 2 2) Consider the following

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

Deep Web Crawling and Mining for Building Advanced Search Application

Deep Web Crawling and Mining for Building Advanced Search Application Deep Web Crawling and Mining for Building Advanced Search Application Zhigang Hua, Dan Hou, Yu Liu, Xin Sun, Yanbing Yu {hua, houdan, yuliu, xinsun, yyu}@cc.gatech.edu College of computing, Georgia Tech

More information

A Study on Inappropriately Partitioned Commits How Much and What Kinds of IP Commits in Java Projects?

A Study on Inappropriately Partitioned Commits How Much and What Kinds of IP Commits in Java Projects? How Much and What Kinds of IP Commits in Java Projects? Ryo Arima r-arima@ist.osaka-u.ac.jp Yoshiki Higo higo@ist.osaka-u.ac.jp Shinji Kusumoto kusumoto@ist.osaka-u.ac.jp ABSTRACT When we use code repositories,

More information

UFeed: Refining Web Data Integration Based on User Feedback

UFeed: Refining Web Data Integration Based on User Feedback UFeed: Refining Web Data Integration Based on User Feedback ABSTRT Ahmed El-Roby University of Waterloo aelroby@uwaterlooca One of the main challenges in large-scale data integration for relational schemas

More information

Insights JiWire Mobile Audience Insights Report Q4 2012

Insights JiWire Mobile Audience Insights Report Q4 2012 Table of Contents Mobile Audience Trends 2-6 Connected Device Adoption & Trends 7-10 Worldwide Location Highlights 11-12 Public Wi-Fi Trends 13 79.5 % of mobile consumers are influenced by the availability

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

M Thulasi 2 Student ( M. Tech-CSE), S V Engineering College for Women, (Affiliated to JNTU Anantapur) Tirupati, A.P, India

M Thulasi 2 Student ( M. Tech-CSE), S V Engineering College for Women, (Affiliated to JNTU Anantapur) Tirupati, A.P, India Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Enhanced Driving

More information

arxiv: v1 [cs.mm] 12 Jan 2016

arxiv: v1 [cs.mm] 12 Jan 2016 Learning Subclass Representations for Visually-varied Image Classification Xinchao Li, Peng Xu, Yue Shi, Martha Larson, Alan Hanjalic Multimedia Information Retrieval Lab, Delft University of Technology

More information

A Web Recommendation System Based on Maximum Entropy

A Web Recommendation System Based on Maximum Entropy A Web Recommendation System Based on Maximum Entropy Xin Jin, Bamshad Mobasher,Yanzan Zhou Center for Web Intelligence School of Computer Science, Telecommunication, and Information Systems DePaul University,

More information

The Sum of Its Parts: Reducing Sparsity in Click Estimation with Query Segments

The Sum of Its Parts: Reducing Sparsity in Click Estimation with Query Segments The Sum of Its Parts: Reducing Sparsity in Click Estimation with Query Segments Dustin Hillard, Eren Manavoglu, Hema Raghavan, Chris Leggetter, Erick Cantú-Paz, and Rukmini Iyer Yahoo! Inc, 701 First Avenue,

More information

Fuzzy Cognitive Maps application for Webmining

Fuzzy Cognitive Maps application for Webmining Fuzzy Cognitive Maps application for Webmining Andreas Kakolyris Dept. Computer Science, University of Ioannina Greece, csst9942@otenet.gr George Stylios Dept. of Communications, Informatics and Management,

More information

CWS: : A Comparative Web Search System

CWS: : A Comparative Web Search System CWS: : A Comparative Web Search System Jian-Tao Sun, Xuanhui Wang, Dou Shen Hua-Jun Zeng, Zheng Chen Microsoft Research Asia University of Illinois at Urbana-Champaign Hong Kong University of Science and

More information

Memory issues in frequent itemset mining

Memory issues in frequent itemset mining Memory issues in frequent itemset mining Bart Goethals HIIT Basic Research Unit Department of Computer Science P.O. Box 26, Teollisuuskatu 2 FIN-00014 University of Helsinki, Finland bart.goethals@cs.helsinki.fi

More information

QueryLines: Approximate Query for Visual Browsing

QueryLines: Approximate Query for Visual Browsing MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com QueryLines: Approximate Query for Visual Browsing Kathy Ryall, Neal Lesh, Tom Lanning, Darren Leigh, Hiroaki Miyashita and Shigeru Makino TR2005-015

More information

Towards a Model of Analyst Effort for Traceability Research

Towards a Model of Analyst Effort for Traceability Research Towards a Model of Analyst for Traceability Research Alex Dekhtyar CalPoly San Luis Obispo California 1 + 1 85 756 2387 dekhtyar@calpoly.edu Jane Huffman Hayes, Matt Smith University of Kentucky Lexington

More information

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul 1 CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul Introduction Our problem is crawling a static social graph (snapshot). Given

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Charles Elkan elkan@cs.ucsd.edu January 18, 2011 In a real-world application of supervised learning, we have a training set of examples with labels, and a test set of examples with

More information

Logistic Regression: Probabilistic Interpretation

Logistic Regression: Probabilistic Interpretation Logistic Regression: Probabilistic Interpretation Approximate 0/1 Loss Logistic Regression Adaboost (z) SVM Solution: Approximate 0/1 loss with convex loss ( surrogate loss) 0-1 z = y w x SVM (hinge),

More information

Calibrating Random Forests

Calibrating Random Forests Calibrating Random Forests Henrik Boström Informatics Research Centre University of Skövde 541 28 Skövde, Sweden henrik.bostrom@his.se Abstract When using the output of classifiers to calculate the expected

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

Ranking Algorithms For Digital Forensic String Search Hits

Ranking Algorithms For Digital Forensic String Search Hits DIGITAL FORENSIC RESEARCH CONFERENCE Ranking Algorithms For Digital Forensic String Search Hits By Nicole Beebe and Lishu Liu Presented At The Digital Forensic Research Conference DFRWS 2014 USA Denver,

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

FORECASTING INITIAL POPULARITY OF JUST-UPLOADED USER-GENERATED VIDEOS. Changsha Ma, Zhisheng Yan and Chang Wen Chen

FORECASTING INITIAL POPULARITY OF JUST-UPLOADED USER-GENERATED VIDEOS. Changsha Ma, Zhisheng Yan and Chang Wen Chen FORECASTING INITIAL POPULARITY OF JUST-UPLOADED USER-GENERATED VIDEOS Changsha Ma, Zhisheng Yan and Chang Wen Chen Dept. of Comp. Sci. and Eng., State Univ. of New York at Buffalo, Buffalo, NY, 14260,

More information

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Generalized Additive Model and Applications in Direct Marketing Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Abstract Logistic regression 1 has been widely used in direct marketing applications

More information

Music Recommendation with Implicit Feedback and Side Information

Music Recommendation with Implicit Feedback and Side Information Music Recommendation with Implicit Feedback and Side Information Shengbo Guo Yahoo! Labs shengbo@yahoo-inc.com Behrouz Behmardi Criteo b.behmardi@criteo.com Gary Chen Vobile gary.chen@vobileinc.com Abstract

More information

Structure of Association Rule Classifiers: a Review

Structure of Association Rule Classifiers: a Review Structure of Association Rule Classifiers: a Review Koen Vanhoof Benoît Depaire Transportation Research Institute (IMOB), University Hasselt 3590 Diepenbeek, Belgium koen.vanhoof@uhasselt.be benoit.depaire@uhasselt.be

More information

Using Text Learning to help Web browsing

Using Text Learning to help Web browsing Using Text Learning to help Web browsing Dunja Mladenić J.Stefan Institute, Ljubljana, Slovenia Carnegie Mellon University, Pittsburgh, PA, USA Dunja.Mladenic@{ijs.si, cs.cmu.edu} Abstract Web browsing

More information

Using Maximum Entropy for Automatic Image Annotation

Using Maximum Entropy for Automatic Image Annotation Using Maximum Entropy for Automatic Image Annotation Jiwoon Jeon and R. Manmatha Center for Intelligent Information Retrieval Computer Science Department University of Massachusetts Amherst Amherst, MA-01003.

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information