Effectively Detecting Content Spam on the Web Using Topical Diversity Measures

Size: px
Start display at page:

Download "Effectively Detecting Content Spam on the Web Using Topical Diversity Measures"

Transcription

1 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology Effectively Detecting Content Spam on the Web Using Topical Diversity Measures Cailing Dong Department of Information Systems University of Maryland, Baltimore County Baltimore, MD, USA Bin Zhou Department of Information Systems University of Maryland, Baltimore County Baltimore, MD, USA Abstract Recent studies about web spam detection have utilized various content-based and link-based features to construct a spam classification model. In this paper, we conduct athoroughanalysisofcontentspamonthewebusingtopic models and propose several novel topical diversity measures for content spam detection. We adopt the web spam benchmark data set WEBSPAM-UK2007 for evaluation, and the experimental results verify that by integrating our topical diversity measures the performance of the state-of-the-art web spam detection methods can be greatly improved. In addition, comparing to existing features for training spam classification models, our topical diversity measures can achieve high spam detection performance using small set of training data. In personalized web spamdetection,thetraining data (i.e., user s spam labeling results) are typically small. Our finding makes personalized web spam detection highly achievable. We develop an efficient and effective regression model using topical diversity measures for personalized web spam detection, and present some promising results obtained from an empirical study. Keywords-web spam; topic model; classification; personalization; I. INTRODUCTION Web search engines have been widely adopted by users to retrieve and rank relevant web pages. Due to the fact that highly-ranked popular web pages could bring huge business opportunities, spammers have utilized many spam tricks with the purpose of increasing the rankings of their targeted pages in the web search results. In [1], web spam conceptually refers to any deliberate actions that bring to selected web pages an unjustifiable favorable relevance or importance. Web spam has become an obstacle to maintain high quality of information retrieval on the web. Thus, combating web spam is one of the top priorities in web search and has attracted much attention in recent years. Most of the existing methods model the problem of web spam detection as a traditional classification problem. A supervised learning model, e.g., learning a binary classifier using a training set which consists of pages labeled as spam or non-spam, is often adopted. Web pages can then be categorized into either spam or non-spam using the classifier. For the learning purpose, recent studies about web spam detection have utilized various content-based features such as the number of words in the page [2], average length of words in the page [2], content duplication [3], number of invisible words [4], and various link-based features such as the structure of linked pages [5], probabilistic counting [6], propagation of ranking contribution from trust pages [7]. In general, there are two types of spam tricks, named content spam and link spam [1]. Link spam tries to manipulate link-based rankings in web search engines (e.g., PageRank [8]), and existing link-based features have been shown effective for identifying link spam [4] [7]. Content spam targets at affecting content-based rankings in web search engines (e.g., TFIDF [8]). The existing content-based features mainly extract statistics of words within pages. However, these word-level statistics may not capture various characteristics of contents well enough in those spam pages. In this paper, we propose to use topic-level statistics of web page content for effective contentspamdetection.different from word-level statistics of web pages, which ignore the semantic relatedness between words, we believe that topic-level statistics can capture linguistic features hidden in text and understand the underlying intention of spam meaningfully. We conduct a thorough analysis of content spam using topic models. Topic models such as Latent Dirichlet Allocation (LDA) [9] are statistical language models for discovering hidden topics that occur in a collection of documents. We propose several novel topical diversity measures based on topic models of individual pages for effective content spam detection. An experimental evaluation using the proposed topical diversity measures is conducted on the web spam benchmark data set WEBSPAM-UK2007. Two promising results are identified. First, comparing to the start-of-the-art web spam detection results, the performance of web spam detection in terms of AUC (Area Under the Curve) and F-measure is increased greatly when considering topical diversity measures during the model learning process. Second, to achieve comparable spam detection performance, the size of the training set for topical diversity measures is smaller comparing to that for other content-based features. The latter finding is extremely promising since we can achieve high spam detection performance even using a small training set. The critical difference between a non-spam page and a spam page is whether the ranking in the search results is justifiable. Such a measurement of justifiability is often subtle and subjective. Moreover, the scale of the web is huge, and keeps increasing with a fastened pace. Thus, /12 $ IEEE DOI /WI-IAT

2 obtaining a large and reliable training set for effective web spam detection is often difficult. Learning using our topical diversity measures does not need a large training set, and the performance of spam detection is still high. Personalized web spam detection is also beneficial from using topical diversity measures. Most of the existing methods assume explicitly or implicitly that the spam detection methods are run at search engine sites. As the decision of whether a page is spam or non-spam is often subjective, it is highly desirable that spam detection can be personalized. An example of personalized web spam detection is intelligent web browser on user s computers. When a web page is shown in an intelligent web browser, the integrated spam detector can online determine whether the page is spam or not. Learning a personalized classifier using existing features based on each individual s historical spam labeling data, which is typically small, cannot achieve good result. We develop an efficient and effective regression model using topical diversity measures for personalized web spam detection. Some preliminary results in Section V-B verify the applicability of our method. The rest of the paper is organized as follows. In Section II, we briefly review some related studies about web spam detection. In Section III, we analyze the topic-level statistics of web pages using topic models and propose several novel topical diversity measures. An evaluation using topical diversity measures for web spam detection is presented in Section IV. In Section V, we discuss how these topic diversity measures are useful for personalized web spam detection, and show some preliminary results of personalized web spam detection. Finally, Section VI concludes the paper and outlines some future research directions. II. RELATED WORK Link spam and content spam are the two main categories of web spam tricks [1], [10]. Gyöngyi et al. [11] referred link spam to the cases where spammers set up structures of interconnected pages, called link spam farms, with the only purposes to boost the link-based ranking. Many methods have been proposed to effectively detect link spam. For example, Fetterly et al. [12] analyzed statistical features such as in-degrees and out-degrees of web pages. Outliers are marked as spam candidates. Gyöngyi et al. [7] utilized a concept of spam mass which is a measure of the impact of link spam on a page s ranking for link spam detection. Benczur et al. [13] proposed a method called SpamRank. The central assumption is based on the concept of personalized PageRank that detects pages with an undeserved high PageRank score. Zhou and Pei [5] introduced the concept of page farms. A page farm contains a set of pages contributing to a target page s ranking most. By analyzing the structural properties of page farms, pages that are beneficial from link spam dramatically can be identified. Different from link spam, content spam is another type of web spam tricks which disguises the content of a page to make it appear relevant to many popular web searches. Most of the existing content spam detection methods proposed so far adopted statistical analysis. For example, Fetterly et al. [12] analyzed certain content-based properties of web pages, and found that some features, such as long host names, host names containing many unusual symbols,little variation in the number of words in pages within a site are good indicators of content spam. Fetterly et al. [3] further showed that content spam pages are often mosaics of textual chunks copied from other legitimate pages. Web pages with large chunks of duplicate content are likely to be content spam pages. In [2], Ntoulasetal.presenteda number of word-level statistics for detecting content spam. A C4.5 classifier is built by combining the proposed heuristic methods to detect content spam pages. Recent work [14] conducted an empirical study on how various spam features and machine learning algorithms contribute to the quality of content spam detection methods. LogitBoost and Random- Forest are reported to achieve superior classification results. The above mentioned content-based features are more or less word-level statistics. They cannot capture linguistic features hidden in texts within content spam pages. To overcome the gap, a few recent proposals investigated the potential applications of statistical language models for understanding hidden characteristics of content spam pages. For example, Martinez-Romo and Araujo [15], [16] compared the difference of probabilistic language models between two linked pages. The assumption is that two nonspam pages that are linked should be topically related, even though the hyperlink between them is a weak contextual relation. Bíró et al. [17], [18] applied statistical language models on corpus of spam pages and non-spam pages separately. The goal is to undercover what topics are more likely to be spam topics. An unlabeled page that contains many spam topics is a strong suspect of content spam. How is Our Work Different: Existing studies about content spam detection using statistical language models mainly compare the language models from different corpus. Usually, two sets of topics representing spam corpus and non-spam corpus are extracted. They do not analyze the underlying characteristics of individual pages. We propose several topical diversity measures to analyze the statistics of topic distributions within individual spam page and non-spam page. Since the calculation is performed on each individual page, our proposed topical diversity measures can be calculated efficiently online, which is critical in personalized web spam detection, as online response requirement is often desired. III. TOPICAL DIVERSITY MEASURES FOR CONTENT SPAM DETECTION In this section, we first briefly review topic models in information retrieval, and then discuss several topical diversity measures built upon topic models of individual pages. 267

3 A. Topic Model in Information Retrieval Atopicmodelisatypeofstatisticalmodelwhichdiscovers the latent topics that occur inacollectionofdocuments. Among different variations of topic models, LDA [9] has been widely used for analyzing document corpus. In general, LDA models each latent topic as a probabilistic distribution over a word vocabulary, and each document as a probabilistic distribution over the latent topics. These distributions are sampled from Dirichlet distributions. Specifically, we assume that we have a vocabulary W which consists of a set of words, a set T of m topics and a set D of n documents with arbitrary lengths. For a topic t T, adistributionφ t on W is sampled using Dir(β), where β R W + is a smoothing parameter. Similarly, for a document d D, adistributionδ d on T is sampled using Dir(α), where α R T + is a smoothing parameter. To generate a document d with several words, for each word position, a topic t is drawn from δ d,andthenawordisdrawnfromφ t and filled into that position. B. Topical Diversity Measures Machine-generated pages are in fact the majority of content spam on the web. To receive high rankings in web search engines, content spam pages have a strong tendency to belong to targeted contents and topics, like insurance or commercial ads. If we can analyze some linguistics features hidden in those texts, we may find that such machinegenerated contents are quite different from other contents in those non-spam pages. Driven by this intuition, we propose to analyze web page contents using topic models. When using topic models for text analysis, we generally want to fit a model to the available text data and estimate all the parameters. These parameters would usually include asetofworddistributionscorrespondingtolatenttopics. After conducting a topic modeling on the text data (e.g., web page content), we can discover and characterize hidden topics in the text. Since each latent topic is associated with acorrespondingweightthatindicateshowfrequentlythis topic appears in the corpus, in general we obtain a topic distribution for the available text data. For content spam detection on the web, our goal is to understand whether spam pages and non-spam pages have unique characteristics in terms of topic distributions within those pages. The process of conducting topic modeling (e.g., LDA) for web pages is as follows: given a web page d, wecan represent the content of d using the bag-of-words model. The vocabulary of d, whichconsistsofallthewordsind, is represented as W.WecanthendirectlyapplytheLDA model on d to identify m latent topics, denoted as t 1, t 2,..., t m,whereeachtopict i (1 i m) isaworddistribution φ ti.foreachwordw W,itsprobabilitybelongingtoa topic t i is identified as φ(w t i ).WeuseW (t i ) to represent the set of words that having a probability belonging to t i larger than 0. The weight of each topic t i,whichequalsto w W (t i) φ(w t i), isdenotedasδ ti.thus, m i=1 δ t i =1. Topic Weight SpamPage1 SpamPage2 Non-SpamPage1 Non-SpamPage2 0 Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic Figure 1. Distributions of topic weights for spam pages and non-spam pages (plotted figures represent the results for randomly selected pages from the data set WEBSPAM-UK2007). 1) Distribution-based Topical Diversity Measure: Spam pages are likely topic-centric, that is, they have a specific set of targeted topics. Intuitively, spam pages and nonspam pages tend to have different topic distributions. To understand how topic distributions are depicted differently, we examined the pages in the latest web spam benchmark data set WEBSPAM-UK2007. A detailed description of the data set will be presented in Section IV-A. All the pages in WEBSPAM-UK2007 are manually labeled as spam or nonspam. We conducted topic modeling for various spam and non-spam pages. In Figure 1, we plot the distribution of topic weights for two randomly selected spam pages and two randomly selected non-spam pages. We have two interesting findings. First, the distribution of topic weights for spam pages are more likely to follow a similar trend of power-law distribution. For example, in both of the two spam pages, one topic has a very large weight while the weights for the remaining topics are relatively small. This is consistent with our previous conjecture that content spam pages tend to be topic-centric, as spammers want to make the pages relevant to a targeted set of search queries with similar topic. Second, non-spam pages tend to cover a set of topics with comparable weights. For example, both of the two non-spam pages in WEBSPAM-UK2007 have several topics with almost equal weights. This result is not surprising. When users create these web pages, they usually try to include many pieces of information in the host. For example, a personal homepage usually mentions information such as education history, work history, and so on. A commercial product s homepage usually covers information including tech specification, highlighted features, etc. To quantitatively capture this unique characteristic about distribution of topic weights for spam and non-spam pages, we propose a variance-based topical diversity measure. Definition 1 (Variance-based Topical Diversity Measure): Given a web page d, its topic distribution is T (d) ={t 1,t 2,...,t m } where each topic t i (1 i m) is associated with a weight δ ti.thevariance-basedtopical diversity measure for d, denoted as TopicVar(d), is calculated as m i=1 TopicVar(d) = (δ t i u) 2, (1) m 268

4 Fraction of Pages 35% 3 25% 2 15% 1 5% Variance-based Topical Diversity Measure Probability of Spam Figure 2. Prevalence of spam relative to the variance-based topical diversity measure. where u = m i=1 δt i m = 1 m. We calculate the variance-based topical diversity measures for every page in WEBSPAM-UK2007. Figure2demonstrates the distribution of variance-based topical diversity measures. In this figure, the horizontal axis depicts a set of value ranges. We use the bar graph to represent the percentage of pages that fall into a particular value range, and the line graph to represent the percentage of pages in each value range that are classified as spam pages. The left scale of the vertical axis refers to the bar graph, and the right scale of the vertical axis refers to the line graph. As can be observed from Figure 2, the variance-based topical diversity measures depict different distributions for spam pages and non-spam pages. Since spam pages are more topic-centric, the variance of topic weights is relatively large comparing to that of non-spam pages. A majority of pages in the data set have small variance-based topical diversity measures. Among them, the percentage of spam pages is small. When the values of the variance-based topical diversity measure increase, the percentage of being spam relatively increases as well. The results in Figure 2 clearly indicate that the variance-based topical diversity measure is agoodindictorfordetectingcontentspam. 2) Semantic-based Topical Diversity Measure: Distribution-based topical diversity measure only considers the distributions of topic weights, it does not consider whether these topics are semantically related. Spammers create spam pages with the only purpose to boost the relevance of the pages for a set of targeted search queries with similar topics. As can be imagined, the different topics of spam pages are more intended to be semantically related. Only in such a situation, spammers can make the pages more relevant to those search queries. In topic models, each topic hidden in the page content is a words distribution. How can we measure the semantic relations between these topics? Apracticalsolutionweproposedinthefollowsisbased on measuring semantics between individual words. Given a web page d, itstopicdistributionist (d) ={t 1,t 2,...,t m }. The probability that a word w belonging to a topic t i (1 i m) isidentifiedasφ(w t i ).Eachtopict i (1 i m) is represented as a set of words, denoted as W (t i ).Intuitively, two topics t i and t j (1 i, j m) aresemanticallyrelated if words from W (t i ) and W (t j ) are semantically related. We use a similarity function Sim(w i,w j ) to obtain the semantic relations between two words w i and w j.there are many similarity functions existed in the literature. In this work, we adopt the similarity function calculated from WordNet ( a popular lexical database. To measure the semantic relations between two topics t i and t j (1 i, j m), we can obtain the similarities between each pair of words from the two topics, along with their probabilities belonging to the corresponding topics, to calculate the average similarity score. That is, Sim(t i,t j )= w k W (t i),w l W (t Sim(w j) k,w l ) φ(w k t i ) φ(w l t j ), W (t i) W (t j) 2 where Sim(t i,t j ) represents the semantic relations between two topics t i and t j, W (t i ) and W (t j ) represent the size of the sets W ( t i ) and W ( t j ),respectively,andφ(w k t i ) and φ(w l t j ) are the probabilities of words w k, w l belonging to topics t i and t j,respectively. By using topic models to analyze web pages, we obtain m latent topics. Based on Equation 2, we can derive a semantic-based topical diversity measure based on pairwise similarities of m latent topics. Definition 2 (Semantic-based Topical Diversity Measure): Given a web page d, its topic distribution is T (d) = {t 1,t 2,...,t m }. The semantic-based topical diversity measure for d, denoted as TopicSim(d), is calculated as 1 i<j m TopicSim(d) = Sim(t i,t j ) 1 2 m(m 1), (3) where Sim(t i,t j ) is defined in Equation 2 We calculate the semantic-based topical diversity measure for every page in WEBSPAM-UK2007. Figure3demonstrates the distribution of semantic-based topical diversity measures. Similar to Figure 2, we use the bar graph and the line graph to represent the percentage of pages that fall into aparticularvaluerange,andthepercentageofpagesineach value range that are classified as spam pages, respectively. As can be observed from Figure 3, spam pages are more likely to have topics that are semantically related. When the semantic-based topical diversity measure increases, the percentage of being spam is high. This is consistent with our previous conjecture that spammers would like to make the spam pages focus on a targeted set of topics. We can (2) 269

5 Fraction of Pages 12% 1 8% 6% 4% 2% Probability of Spam Semantic-based Topical Diversity Measure Figure 3. Prevalence of spam relative to the semantic-based topical diversity measure Fraction of Pages 15% 1 5% Probability of Spam MaxSemantic-based Topical Diversity Measure Figure 4. Prevalence of spam relative to the maxsemantic-based topical diversity measure. adopt the semantic-based topical diversity measure as a good indicator for content spam detection. Equation 4 returns the average score of pairwise similarities of different topics. In some cases, the largest similarity score of pairwise similarities of m topics is also useful, especially when the two topics have large weights. We define amaxsemantic-basedtopicaldiversitymeasure. Definition 3 (MaxSemantic-based Diversity Measure): Given a web page d, its topic distribution is T (d) = {t 1,t 2,...,t m }.ThemaxSemantic-basedtopical diversity measure for d, denotedastopicsimmax(d), is calculated as TopicSimMax(d) =max{sim(t i,t j ) 1 i<j m}, (4) where Sim(t i,t j ) is defined in Equation 2. We calculate the maxsemantic-based topical diversity measure for every page in WEBSPAM-UK2007. Figure4 demonstrates the distribution of maxsemantic-based topical diversity measures. The settings of Figure 4 are similar to those of Figure 2 and Figure 3. Comparing to Figure 3, spam pages are even more likely to have a large maxsemantic-based topical diversity measure. This is not hard to understand. When spammers create those spam pages, the targeted set of topics are determined. Thus, topics in the spam pages should be highly similar, which turns out to have a large maxsemantic-based topical diversity measure. The results in Figure 4 verify that we can use maxsemantic-based topical diversity measure for detecting content spam. So far, we have discussed three topical diversity measures built upon topic models of individual pages. Each measure can be used as an indicator for content spam. Some machine learning algorithms, such as Random Forest and Regression can be adopted for learning a web spam classification model. IV. EXPERIMENTAL RESULTS OF CONTENT SPAM DETECTION In this section, we report a thorough evaluation of web spam detection using our topical diversity measures. In the rest of this section, we first describe the data set used for evaluation, and then discuss the performance results of web spam detection. All the experiments were conducted on a MacBook Pro computer running the Mac OS X Lion operating system, with a 2.4 GHz Intel Core i5 CPU, 4.0 GB main memory, and a 500 GB hard disk. The programs were implemented in Java and Perl. A. Web Spam Benchmark Data Set We adopted the WEBSPAM-UK2007 data set released by the Search Engine Spam Project at Yahoo! Research Barcelona 1.Thedatasetistheresultoftheeffortofateam of volunteers. The base data is a set of 105,896,555 pages in 114,529 hosts in the.uk domain. The data were downloaded in May 2007 by the Laboratory of Web Algorithmics, Università degli Studi di Milano. The spam collection data set contains a training set and a testing set. Since the number of spam pages in each set is very small, we combined the two sets together and adopted 10-fold cross validation for evaluation. Based on the URLs of pages, we collected content of those pages to build topic models. There are some pages not available online anymore. To make a fair comparison, we removed those pages from the spam collection. The final spam collection data set consists of 4,081 different pages chosen from the base data set, among which 163 pages are labeled as spam and the remaining 3,918 pages are labeled as non-spam. By default, we used this data set in the evaluation. B. Web Spam Detection using Topical Diversity Measures To implement the LDA model, we first extracted words from each web page using the Alchemy API (

6 Parameters in LDA #topics #words TPR FPR F-Measure AUC Table I THE RESULTS OF WEB SPAM DETECTION USING DIFFERENT PARAMETER SETTINGS IN LDA. alchemyapi.com/api). A stemming and POS tagging procedure was performed on the extracted words. We also removed the stop words 2 and words with some useless POS tags. The words of each page after the above mentioned data cleaning process were chosen for building the corresponding LDA model. We adopted a Java implementation of LDA, JGibbLDA ( which uses Gibbs sampling for parameter estimation and inference. The two smoothing parameters α and β in LDA were set to 0.5 and 0.1 by default. In order to determine whether a page is spam or nonspam, we adopted supervised learning algorithms to train a web spam classification model using proposed topical diversity measures. In this work, we considered various machine learning algorithms available in Weka ( cs.waikato.ac.nz/ml/weka/). The evaluation metrics we examined for comparing the results of web spam detection include: (1) TPR (true positive rate); (2) FPR (false positive rate); (3) F-measure (the harmonic mean of precision and recall); and (4) AUC (area under the curve). Among them, F-measure and AUC are widely adopted in the web spam detection community for performance evaluation. 1) Parameter settings of #topics and #words in LDA: In LDA, we can set up the parameter values for the number of topics (#topics) and the number of words in each topic (#words). To understand how these two parameters may affect the performance of web spam detection, we chose different parameter values. In Table I, we showed the performance results of web spam detection with regard to different parameter settings. The results of web spam detection in Table I were achieved by using the RandomForest algorithm based on the proposed three topical diversity measures only. The reason we chose RandomForest here is that this algorithm usually achieves best detection accuracy. This is consistent with the results recently reported in [14]. Different from previous studies which usually choose quite large number on #topics and #words, we choose to use small numbers since our language model is built based on each individual web page instead of set of web page corpus. We can observe several interesting results from Table I. On one hand, if we fix the value of #topics, when we increase the value of #words, the performance of web spam detection increases with regard to F-measure and AUC. On the other hand, if we fix the value of #words, when we increase the 2 Algorithm Feature Set TPR FPR F-Measure AUC Content RandomForest Topic All Content RandomTree Topic All Table II THE RESULTS OF WEB SPAM DETECTION USING DIFFERENT FEATURE SETS. value of #topics, the performance of web spam detection increases as well. In general, among the various parameter settings reported in Table I, when #topics and #words are set to be 5 and 5 respectively, the web spam detection algorithm (i.e., RandomForest) achieves the best results. It is possible to even use larger values for the two parameters, however, it may introduce more computational overheads for the topical diversity measures. Thus, in the following experiments, we fixed the values of #topics and #words in LDA to be 5 and 5bydefault. 2) Comparison of Web spam Detection: We also examined how useful the proposed topical diversity measures are for web spam detection. To make a comparison, we considered several other useful content-based features available in WEBSPAM-UK2007, suchasaveragewordlength,fraction of anchor text. In total, we obtained 24 different features. We name them as Content feature set. The proposed three topical diversity measures form a Topic feature set. We also considered the situation by combining the two feature sets together to form an All feature set. From the Weka package, we identified 36 different machine learning algorithms. These algorithms cover a majority of applicable algorithms including bayes, function, rules and trees. We applied each of these algorithms to detect web spam in the data set. Each algorithm was performed three times using different feature sets: the Content set, the Topic set, and the All set. The results of web spam detection were achieved by conducting a 10-fold cross validation. We first examined which feature set is most useful for detecting web spam. Among the 36 different machine learning algorithms, we found that 7 of them can achieve the highest AUC when the Topic feature set is used, 5 of them can achieve the highest AUC when the Content feature set is used, and the remaining 24 of them can achieve the highest AUC when the All feature set is used. The results verify that the proposed topical diversity measures are good indicators for web spam detection. We then analyzed how many improvements the proposed topical diversity measures can bring to the literature of web spam detection. In Table II, we showed the different evaluation metrics for the two algorithms (RandomForest and RandomTree) which achieve the highest performance. The results clearly show that by integrating the proposed topical diversity measures, the existing web spam detection algorithms can be improved. 271

7 F-Measure Content Topic Sampled Percentage of Training Data (%) AUC Content Topic Sampled Percentage of Training Data (%) Figure 5. The results of web spam detection using small scale of training data (Left: F-measure; Right: AUC). C. The Size of the Training Data In most machine learning algorithms, the quality of the training data is critical for the algorithms to achieve good performance. In web spam detection, obtaining a highquality training data with sufficient records may not be an easy task. Thus, we also examined how the scale of the training data may affect the overall performance of web spam detection. To simulate the situation of lack of training data, we randomly sampled small sizes of data (e.g., 2) from the original spam collection data set. We assume that the sampled data are the available data set for model learning. We want to compare how the different feature sets (e.g., the Content set and the Topic set) may perform with regard to small set of training data. In Figure 5, we plot the results of web spam detection (in terms of F-measure and AUC) using the Content feature set and the Topic feature set with different scales of training data. The results reported in Figure 5 were achieved by the RandomForest algorithm. We randomly sampled small scale of training data (varied from 5% to 2). It is surprisingly to find that even when the size of the training data is small, the performance using the proposed topical diversity measures is still high. The results in Figure 5 also indicate that the topical diversity measures are more useful than other content-based features when there are not enough training data for web spam detection. This result is exciting since we can use topical diversity measures for web spam detection even when alargetrainingdataisnotpossibletoobtain. V. PERSONALIZED WEB SPAM DETECTION Most of the existing studies about web spam detection explicitly or implicitly assume that web spam detection is conducted on the search engine s server side. If a page is judged as spam by search engines, it is removed from the search results. In practice, there are some other scenarios when spam detection needs to be conducted on the client side. A typical example is intelligent web browser running at user s computers. When a page is opened in the intelligent web browser, the integrated spam detector needs to determine whether the page is spam or not online. Intelligent web browsers provides users more information about web pages and more controls in the course of web surfing. Adistinctrequirementofspamdetectioninintelligent web browsers, comparing to traditional spam detection on the search engine side, is personalization. As described before, the critical difference between a spam page and a non-spam page is based on whether the relevance of a page is justifiable or not. Different users have different opinions on such a justification. A page is categorized as spam by one user may not be necessary to be categorized as spam by another user. As a result, personalized web spam detection is desired in intelligent web browsers such that the web spam detection results are tailored specifically to meet each individual s judgement on spam. For personalized web spam detection, we need to collect user s judgements on spam. Such judgements mainly refer to individual s historical labeling results of web spam. In those intelligent web browsers, users can provide spam labels for pages that are visited. Users can also update the spam labels if they do not agree with the spam detection results provided by the spam detector. These activities can be considered as explicit feedbacks of spam judgement provided by users. In addition, there also exist implicit feedbacks of spam judgement. For example, if a page is labeled as spam/non-spam and the user does not update the spam detection result, it implicitly means the user agrees with such spam judgements. In those intelligent web browsers, all the explicit feedbacks and implicit feedbacks are collected and the data are useful for learning a personalized web spam detector. It is worthy of mentioning that the collected user s judgements on spam are usually with a very small scale. However, as reported in Section IV-C, the web spam detection results are still high even only a small size of the training set is adopted. This unique property of the proposed topical diversity measures indicates that these measures are valuable for achieving personalized web spam detection. In personalized web spam detection, binary classification of web spam may not be suitable. It is more desired that users can have more controls on determining how likely a page is spam. An intuitive solution is to provide a numerical score to indicate the likelihood of a page being spam. Users can set up low/high threshold values to categorize more/less spam pages according to particular situations. In [5], a spamicity score is proposed to achieve such a goal. However, the spamicity score is an ad-hoc measure which does not involve a self-learning process. In this paper, we adopt the regression model using the proposed topical diversity measures for personalized web spam detection. A. Regression Model for Personalized Web Spam Detection The personalized web spam detection is achieved by learning a personalized spam detector using collected data of user s spam judgements. Technically, our personalized web spam detector estimates the probability p that a page will be categorized as spam given three topical diversity measures. For simplicity, we denote them as x 1, x 2,and x 3,respectively.Thedetectortakes the collected training set to learn a logistic regression model. The formula of the logistic regression is ln p 1 p = γ i=1 γ i x i, where γ 0,...,γ 3 are the coefficients. In other words, the probability p that a page will be categorized as spam is 272

8 Threshold TPR FPR F-Measure AUC Table III THE RESULTS OF PERSONALIZED WEB SPAM DETECTION USING DIFFERENT THRESHOLD VALUES. 1 measured as p = 1+e (γ γ i=1 i x i.theprobabilityp is in ) the range [0,1]. The larger the value of p, themorelikely the corresponding page is spam. There are several methods to learn the regression coefficients, among which Newton-Raphson method [19] is a popular one. Once the coefficients arelearned,foreachnew web page, the spam detector will use the model to label the page as spam if its probability is above a certain threshold value. This threshold value can be turned by the user to adjust how aggressive should be in detecting spam. As users keep generating the labeled data of web spam, we need to re-train the regression model to meet the new training data. After several rounds of re-trains, the model is able to precisely capture users spam judgements and provides high-quality personalized web spam detection results. B. Results of Personalized Web Spam Detection We report some results obtained from an empirical study of personalized web spam detection. We used the labeling results of pages from the WEBSPAM-UK2007 data set as collected user s spam judgements. A regression model is learned from the data. We varied the threshold values, and the results of personalized web spam detection are reported in Figure III. Clearly, when the threshold value decreases, the spam detection is more aggressive. Thus, more pages are categorized as spam. Users have the ability to choose different threshold values to control how many spams and how aggressive the spam detection should be. Personalized web spam detection needs to be conducted online. Thus, an online response is critical. We randomly selected 100 pages from the WEBSPAM-UK2007 data set and examined the total time that are needed to categorize them. On average, each page only needs less than 0.1 second for the web spam categorization. In general, the personalized web spam detection can be achieved efficiently. VI. CONCLUSION In this paper, we conduct a thorough analysis of content spam on the web using topic models. Different from many existing word-level statistics, topic models can identify topic-level linguistics features hidden in text. We propose several topical diversity measures based on topic models. The evaluation conducted on the web spam benchmark data set indicates that by integrating our topical diversity measures, the performance of the state-of-the-art web spam detection algorithms can be greatly improved. We also provide some studies about personalized web spam detection. There are several interesting future research directions. For example, we only considered the topical diversity measures within each individual page. It is interesting to explore whether the topical diversity measures can be integrated with some link-based features, e.g., page farms [5]. ACKNOWLEDGMENT This research is supported in part by a UMBC Special Research Assistantship/Initiative Support (SRAIS) grant. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agency. REFERENCES [1] Z. Gyöngyi and H. Garcia-Molina, Web spam taxonomy, in AIRWeb 05, [2] A. Ntoulas et al., Detecting spam web pages through content analysis, in WWW 06, pp [3] D. Fetterly et al., Detecting phrase-level duplication on the world wide web, in SIGIR 05, pp [4] B. Zhou et al., A spamicity approach to web spam detection, in SDM 08, pp [5] B. Zhou and J. Pei, Link spam target detection using page farms, ACM Transactions on Knowledge Discovery from Data (TKDD), vol.3,no.3,2009. [6] L. Becchetti et al., Using rank propagation and probabilistic counting for link-based spam detection, in WebKDD 06, [7] Z. Gyöngyi et al., Link spam detection based on mass estimation, in VLDB 06, pp [8] C. D. Manning et al., Introduction to Information Retrieval, Cambridge University Press, [9] D. M. Blei et al., Latent dirichlet allocation, vol. 3, JMLR.org, Mar. 2003, pp [10] N. Spirin and J. Han, Survey on web spam detection: principles and algorithms, SIGKDD Explorations, vol. 13, no. 2, pp , [11] Z. Gyöngyi and H. Garcia-Molina, Link spam alliances, in VLDB 05, pp [12] D. Fetterly et al., Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages, in WebDB 04, pp [13] A. A. Benczur et al., Spamrank: Fully automatic link spam detection, in AIRWeb 05, [14] M. Erdélyi et al., Web spam classification: a few features worth more, in Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality, pp [15] J. Martinez-Romo and L. Araujo, Web spam identification through language model analysis, in AIRWeb 09,pp [16] L. Araujo and J. Martinez-Romo, Web spam detection: new classification features based on qualified link analysis and language models, IEEE Transactions on Information Forensics and Security,vol.5,no.3,pp ,2010. [17] I. Bíró et al., Latent dirichlet allocation in web spam filtering, in AIRWeb 04, pp [18] I. Bíró etal., Linkedlatentdirichletallocationinwebspam filtering, in AIRWeb 09, pp [19] T. Hastie et al., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer,

Towards Building A Personalized Online Web Spam Detector in Intelligent Web Browsers

Towards Building A Personalized Online Web Spam Detector in Intelligent Web Browsers 203 IEEE/WIC/ACM International Conferences on Web Intelligence (WI) and Intelligent Agent Technology (IAT) Towards Building A Personalized Online Web Spam Detector in Intelligent Web Browsers Cailing Dong,

More information

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology Seminar: Future Of Web Search University of Saarland Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood Tutor : Klaus Berberich Date : 17-Jan-2008 The Agenda

More information

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings. Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review

More information

VisoLink: A User-Centric Social Relationship Mining

VisoLink: A User-Centric Social Relationship Mining VisoLink: A User-Centric Social Relationship Mining Lisa Fan and Botang Li Department of Computer Science, University of Regina Regina, Saskatchewan S4S 0A2 Canada {fan, li269}@cs.uregina.ca Abstract.

More information

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Effective Latent Space Graph-based Re-ranking Model with Global Consistency Effective Latent Space Graph-based Re-ranking Model with Global Consistency Feb. 12, 2009 1 Outline Introduction Related work Methodology Graph-based re-ranking model Learning a latent space graph A case

More information

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion Ye Tian, Gary M. Weiss, Qiang Ma Department of Computer and Information Science Fordham University 441 East Fordham

More information

WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS

WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS Juan Martinez-Romo and Lourdes Araujo Natural Language Processing and Information Retrieval Group at UNED * nlp.uned.es Fifth International Workshop

More information

Method to Study and Analyze Fraud Ranking In Mobile Apps

Method to Study and Analyze Fraud Ranking In Mobile Apps Method to Study and Analyze Fraud Ranking In Mobile Apps Ms. Priyanka R. Patil M.Tech student Marri Laxman Reddy Institute of Technology & Management Hyderabad. Abstract: Ranking fraud in the mobile App

More information

Chapter 6: Information Retrieval and Web Search. An introduction

Chapter 6: Information Retrieval and Web Search. An introduction Chapter 6: Information Retrieval and Web Search An introduction Introduction n Text mining refers to data mining using text documents as data. n Most text mining tasks use Information Retrieval (IR) methods

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Video annotation based on adaptive annular spatial partition scheme

Video annotation based on adaptive annular spatial partition scheme Video annotation based on adaptive annular spatial partition scheme Guiguang Ding a), Lu Zhang, and Xiaoxu Li Key Laboratory for Information System Security, Ministry of Education, Tsinghua National Laboratory

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

Semi-Supervised Clustering with Partial Background Information

Semi-Supervised Clustering with Partial Background Information Semi-Supervised Clustering with Partial Background Information Jing Gao Pang-Ning Tan Haibin Cheng Abstract Incorporating background knowledge into unsupervised clustering algorithms has been the subject

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

Previous: how search engines work

Previous: how search engines work detection Ricardo Baeza-Yates,3 ricardo@baeza.cl With: L. Becchetti 2, P. Boldi 5, C. Castillo, D. Donato, A. Gionis, S. Leonardi 2, V.Murdock, M. Santini 5, F. Silvestri 4, S. Vigna 5. Yahoo! Research

More information

Link Analysis in Web Mining

Link Analysis in Web Mining Problem formulation (998) Link Analysis in Web Mining Hubs and Authorities Spam Detection Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES Mu. Annalakshmi Research Scholar, Department of Computer Science, Alagappa University, Karaikudi. annalakshmi_mu@yahoo.co.in Dr. A.

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web

More information

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008 Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008 Overview Introduction Countering Email Spam Problem Description Classification

More information

Detecting Content Spam on the Web through Text Diversity Analysis

Detecting Content Spam on the Web through Text Diversity Analysis Detecting Content Spam on the Web through Text Diversity Analysis Anton Pavlov pavvloff@yandex.ru M.V. Lomonosov Mosco State University, Faculty of Computational Mathematics and Cybernetics Boris Dobrov

More information

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS

MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS MODELLING DOCUMENT CATEGORIES BY EVOLUTIONARY LEARNING OF TEXT CENTROIDS J.I. Serrano M.D. Del Castillo Instituto de Automática Industrial CSIC. Ctra. Campo Real km.0 200. La Poveda. Arganda del Rey. 28500

More information

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman Abstract We intend to show that leveraging semantic features can improve precision and recall of query results in information

More information

Classification. 1 o Semestre 2007/2008

Classification. 1 o Semestre 2007/2008 Classification Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 Single-Class

More information

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 3, SEPTEMBER

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 3, SEPTEMBER IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 3, SEPTEMBER 2010 581 Web Spam Detection: New Classification Features Based on Qualified Link Analysis and Language Models Lourdes Araujo

More information

Information Retrieval. (M&S Ch 15)

Information Retrieval. (M&S Ch 15) Information Retrieval (M&S Ch 15) 1 Retrieval Models A retrieval model specifies the details of: Document representation Query representation Retrieval function Determines a notion of relevance. Notion

More information

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015

University of Virginia Department of Computer Science. CS 4501: Information Retrieval Fall 2015 University of Virginia Department of Computer Science CS 4501: Information Retrieval Fall 2015 5:00pm-6:15pm, Monday, October 26th Name: ComputingID: This is a closed book and closed notes exam. No electronic

More information

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University CS473: CS-473 Course Review Luo Si Department of Computer Science Purdue University Basic Concepts of IR: Outline Basic Concepts of Information Retrieval: Task definition of Ad-hoc IR Terminologies and

More information

Reducing Redundancy with Anchor Text and Spam Priors

Reducing Redundancy with Anchor Text and Spam Priors Reducing Redundancy with Anchor Text and Spam Priors Marijn Koolen 1 Jaap Kamps 1,2 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Informatics Institute, University

More information

New Classification Method Based on Decision Tree for Web Spam Detection

New Classification Method Based on Decision Tree for Web Spam Detection Research Article International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347-5161 2014 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Rashmi

More information

Predict Topic Trend in Blogosphere

Predict Topic Trend in Blogosphere Predict Topic Trend in Blogosphere Jack Guo 05596882 jackguo@stanford.edu Abstract Graphical relationship among web pages has been used to rank their relative importance. In this paper, we introduce a

More information

Query Difficulty Prediction for Contextual Image Retrieval

Query Difficulty Prediction for Contextual Image Retrieval Query Difficulty Prediction for Contextual Image Retrieval Xing Xing 1, Yi Zhang 1, and Mei Han 2 1 School of Engineering, UC Santa Cruz, Santa Cruz, CA 95064 2 Google Inc., Mountain View, CA 94043 Abstract.

More information

Query-Sensitive Similarity Measure for Content-Based Image Retrieval

Query-Sensitive Similarity Measure for Content-Based Image Retrieval Query-Sensitive Similarity Measure for Content-Based Image Retrieval Zhi-Hua Zhou Hong-Bin Dai National Laboratory for Novel Software Technology Nanjing University, Nanjing 2193, China {zhouzh, daihb}@lamda.nju.edu.cn

More information

RMIT University at TREC 2006: Terabyte Track

RMIT University at TREC 2006: Terabyte Track RMIT University at TREC 2006: Terabyte Track Steven Garcia Falk Scholer Nicholas Lester Milad Shokouhi School of Computer Science and IT RMIT University, GPO Box 2476V Melbourne 3001, Australia 1 Introduction

More information

Focused crawling: a new approach to topic-specific Web resource discovery. Authors

Focused crawling: a new approach to topic-specific Web resource discovery. Authors Focused crawling: a new approach to topic-specific Web resource discovery Authors Soumen Chakrabarti Martin van den Berg Byron Dom Presented By: Mohamed Ali Soliman m2ali@cs.uwaterloo.ca Outline Why Focused

More information

Automatic Identification of User Goals in Web Search [WWW 05]

Automatic Identification of User Goals in Web Search [WWW 05] Automatic Identification of User Goals in Web Search [WWW 05] UichinLee @ UCLA ZhenyuLiu @ UCLA JunghooCho @ UCLA Presenter: Emiran Curtmola@ UC San Diego CSE 291 4/29/2008 Need to improve the quality

More information

Classification Algorithms in Data Mining

Classification Algorithms in Data Mining August 9th, 2016 Suhas Mallesh Yash Thakkar Ashok Choudhary CIS660 Data Mining and Big Data Processing -Dr. Sunnie S. Chung Classification Algorithms in Data Mining Deciding on the classification algorithms

More information

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR

SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR SCALABLE KNOWLEDGE BASED AGGREGATION OF COLLECTIVE BEHAVIOR P.SHENBAGAVALLI M.E., Research Scholar, Assistant professor/cse MPNMJ Engineering college Sspshenba2@gmail.com J.SARAVANAKUMAR B.Tech(IT)., PG

More information

Fraud Detection of Mobile Apps

Fraud Detection of Mobile Apps Fraud Detection of Mobile Apps Urmila Aware*, Prof. Amruta Deshmuk** *(Student, Dept of Computer Engineering, Flora Institute Of Technology Pune, Maharashtra, India **( Assistant Professor, Dept of Computer

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

A Content Vector Model for Text Classification

A Content Vector Model for Text Classification A Content Vector Model for Text Classification Eric Jiang Abstract As a popular rank-reduced vector space approach, Latent Semantic Indexing (LSI) has been used in information retrieval and other applications.

More information

Associating Terms with Text Categories

Associating Terms with Text Categories Associating Terms with Text Categories Osmar R. Zaïane Department of Computing Science University of Alberta Edmonton, AB, Canada zaiane@cs.ualberta.ca Maria-Luiza Antonie Department of Computing Science

More information

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

Evaluating the Usefulness of Sentiment Information for Focused Crawlers Evaluating the Usefulness of Sentiment Information for Focused Crawlers Tianjun Fu 1, Ahmed Abbasi 2, Daniel Zeng 1, Hsinchun Chen 1 University of Arizona 1, University of Wisconsin-Milwaukee 2 futj@email.arizona.edu,

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

Evaluating Classifiers

Evaluating Classifiers Evaluating Classifiers Reading for this topic: T. Fawcett, An introduction to ROC analysis, Sections 1-4, 7 (linked from class website) Evaluating Classifiers What we want: Classifier that best predicts

More information

Query Likelihood with Negative Query Generation

Query Likelihood with Negative Query Generation Query Likelihood with Negative Query Generation Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 ylv2@uiuc.edu ChengXiang Zhai Department of Computer

More information

CS294-1 Final Project. Algorithms Comparison

CS294-1 Final Project. Algorithms Comparison CS294-1 Final Project Algorithms Comparison Deep Learning Neural Network AdaBoost Random Forest Prepared By: Shuang Bi (24094630) Wenchang Zhang (24094623) 2013-05-15 1 INTRODUCTION In this project, we

More information

jldadmm: A Java package for the LDA and DMM topic models

jldadmm: A Java package for the LDA and DMM topic models jldadmm: A Java package for the LDA and DMM topic models Dat Quoc Nguyen School of Computing and Information Systems The University of Melbourne, Australia dqnguyen@unimelb.edu.au Abstract: In this technical

More information

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Dipak J Kakade, Nilesh P Sable Department of Computer Engineering, JSPM S Imperial College of Engg. And Research,

More information

Deep Character-Level Click-Through Rate Prediction for Sponsored Search

Deep Character-Level Click-Through Rate Prediction for Sponsored Search Deep Character-Level Click-Through Rate Prediction for Sponsored Search Bora Edizel - Phd Student UPF Amin Mantrach - Criteo Research Xiao Bai - Oath This work was done at Yahoo and will be presented as

More information

A Survey on Postive and Unlabelled Learning

A Survey on Postive and Unlabelled Learning A Survey on Postive and Unlabelled Learning Gang Li Computer & Information Sciences University of Delaware ligang@udel.edu Abstract In this paper we survey the main algorithms used in positive and unlabeled

More information

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression

A Comparative Study of Locality Preserving Projection and Principle Component Analysis on Classification Performance Using Logistic Regression Journal of Data Analysis and Information Processing, 2016, 4, 55-63 Published Online May 2016 in SciRes. http://www.scirp.org/journal/jdaip http://dx.doi.org/10.4236/jdaip.2016.42005 A Comparative Study

More information

Splog Detection Using Self-Similarity Analysis on Blog Temporal Dynamics. Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tatemura and Belle Tseng

Splog Detection Using Self-Similarity Analysis on Blog Temporal Dynamics. Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tatemura and Belle Tseng Splog Detection Using Self-Similarity Analysis on Blog Temporal Dynamics Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tatemura and Belle Tseng NEC Laboratories America, Cupertino, CA AIRWeb Workshop 2007

More information

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK 1 Mount Steffi Varish.C, 2 Guru Rama SenthilVel Abstract - Image Mining is a recent trended approach enveloped in

More information

Text Document Clustering Using DPM with Concept and Feature Analysis

Text Document Clustering Using DPM with Concept and Feature Analysis Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,

More information

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor COSC160: Detection and Classification Jeremy Bolton, PhD Assistant Teaching Professor Outline I. Problem I. Strategies II. Features for training III. Using spatial information? IV. Reducing dimensionality

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #11: Link Analysis 3 Seoul National University 1 In This Lecture WebSpam: definition and method of attacks TrustRank: how to combat WebSpam HITS algorithm: another algorithm

More information

RSDC 09: Tag Recommendation Using Keywords and Association Rules

RSDC 09: Tag Recommendation Using Keywords and Association Rules RSDC 09: Tag Recommendation Using Keywords and Association Rules Jian Wang, Liangjie Hong and Brian D. Davison Department of Computer Science and Engineering Lehigh University, Bethlehem, PA 18015 USA

More information

CS47300 Web Information Search and Management

CS47300 Web Information Search and Management CS47300 Web Information Search and Management Search Engine Optimization Prof. Chris Clifton 31 October 2018 What is Search Engine Optimization? 90% of search engine clickthroughs are on the first page

More information

A Bagging Method using Decision Trees in the Role of Base Classifiers

A Bagging Method using Decision Trees in the Role of Base Classifiers A Bagging Method using Decision Trees in the Role of Base Classifiers Kristína Machová 1, František Barčák 2, Peter Bednár 3 1 Department of Cybernetics and Artificial Intelligence, Technical University,

More information

INTRODUCTION. Chapter GENERAL

INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which

More information

Louis Fourrier Fabien Gaie Thomas Rolf

Louis Fourrier Fabien Gaie Thomas Rolf CS 229 Stay Alert! The Ford Challenge Louis Fourrier Fabien Gaie Thomas Rolf Louis Fourrier Fabien Gaie Thomas Rolf 1. Problem description a. Goal Our final project is a recent Kaggle competition submitted

More information

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy Martin Rajman, Pierre Andrews, María del Mar Pérez Almenta, and Florian Seydoux Artificial Intelligence

More information

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language

Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language Ontology based Model and Procedure Creation for Topic Analysis in Chinese Language Dong Han and Kilian Stoffel Information Management Institute, University of Neuchâtel Pierre-à-Mazel 7, CH-2000 Neuchâtel,

More information

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics

A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics A Comparison of Text-Categorization Methods applied to N-Gram Frequency Statistics Helmut Berger and Dieter Merkl 2 Faculty of Information Technology, University of Technology, Sydney, NSW, Australia hberger@it.uts.edu.au

More information

Final Report - Smart and Fast Sorting

Final Report - Smart and Fast  Sorting Final Report - Smart and Fast Email Sorting Antonin Bas - Clement Mennesson 1 Project s Description Some people receive hundreds of emails a week and sorting all of them into different categories (e.g.

More information

Analyzing and Detecting Review Spam

Analyzing and Detecting Review Spam Seventh IEEE International Conference on Data Mining Analyzing and Detecting Review Spam Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago nitin.jindal@gmail.com,

More information

Downloading Hidden Web Content

Downloading Hidden Web Content Downloading Hidden Web Content Alexandros Ntoulas Petros Zerfos Junghoo Cho UCLA Computer Science {ntoulas, pzerfos, cho}@cs.ucla.edu Abstract An ever-increasing amount of information on the Web today

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Ranking web pages using machine learning approaches

Ranking web pages using machine learning approaches University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2008 Ranking web pages using machine learning approaches Sweah Liang Yong

More information

A Deep Relevance Matching Model for Ad-hoc Retrieval

A Deep Relevance Matching Model for Ad-hoc Retrieval A Deep Relevance Matching Model for Ad-hoc Retrieval Jiafeng Guo 1, Yixing Fan 1, Qingyao Ai 2, W. Bruce Croft 2 1 CAS Key Lab of Web Data Science and Technology, Institute of Computing Technology, Chinese

More information

Identifying Web Spam With User Behavior Analysis

Identifying Web Spam With User Behavior Analysis Identifying Web Spam With User Behavior Analysis Yiqun Liu, Rongwei Cen, Min Zhang, Shaoping Ma, Liyun Ru State Key Lab of Intelligent Tech. & Sys. Tsinghua University 2008/04/23 Introduction simple math

More information

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2 1 Department of Electronics & Comp. Sc, RTMNU, Nagpur, India 2 Department of Computer Science, Hislop College, Nagpur,

More information

A P2P-based Incremental Web Ranking Algorithm

A P2P-based Incremental Web Ranking Algorithm A P2P-based Incremental Web Ranking Algorithm Sumalee Sangamuang Pruet Boonma Juggapong Natwichai Computer Engineering Department Faculty of Engineering, Chiang Mai University, Thailand sangamuang.s@gmail.com,

More information

Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web

Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web Feature LDA: a Supervised Topic Model for Automatic Detection of Web API Documentations from the Web Chenghua Lin, Yulan He, Carlos Pedrinaci, and John Domingue Knowledge Media Institute, The Open University

More information

Regularization and model selection

Regularization and model selection CS229 Lecture notes Andrew Ng Part VI Regularization and model selection Suppose we are trying select among several different models for a learning problem. For instance, we might be using a polynomial

More information

Assisting Trustworthiness Based Web Services Selection Using the Fidelity of Websites *

Assisting Trustworthiness Based Web Services Selection Using the Fidelity of Websites * Assisting Trustworthiness Based Web Services Selection Using the Fidelity of Websites * Lijie Wang, Fei Liu, Ge Li **, Liang Gu, Liangjie Zhang, and Bing Xie Software Institute, School of Electronic Engineering

More information

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning

Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Online Pattern Recognition in Multivariate Data Streams using Unsupervised Learning Devina Desai ddevina1@csee.umbc.edu Tim Oates oates@csee.umbc.edu Vishal Shanbhag vshan1@csee.umbc.edu Machine Learning

More information

HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining

HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining HUKB at NTCIR-12 IMine-2 task: Utilization of Query Analysis Results and Wikipedia Data for Subtopic Mining Masaharu Yoshioka Graduate School of Information Science and Technology, Hokkaido University

More information

Web Spam Challenge 2008

Web Spam Challenge 2008 Web Spam Challenge 2008 Data Analysis School, Moscow, Russia K. Bauman, A. Brodskiy, S. Kacher, E. Kalimulina, R. Kovalev, M. Lebedev, D. Orlov, P. Sushin, P. Zryumov, D. Leshchiner, I. Muchnik The Data

More information

Rank Measures for Ordering

Rank Measures for Ordering Rank Measures for Ordering Jin Huang and Charles X. Ling Department of Computer Science The University of Western Ontario London, Ontario, Canada N6A 5B7 email: fjhuang33, clingg@csd.uwo.ca Abstract. Many

More information

Semantic Search in s

Semantic Search in  s Semantic Search in Emails Navneet Kapur, Mustafa Safdari, Rahul Sharma December 10, 2010 Abstract Web search technology is abound with techniques to tap into the semantics of information. For email search,

More information

Spatial Latent Dirichlet Allocation

Spatial Latent Dirichlet Allocation Spatial Latent Dirichlet Allocation Xiaogang Wang and Eric Grimson Computer Science and Computer Science and Artificial Intelligence Lab Massachusetts Tnstitute of Technology, Cambridge, MA, 02139, USA

More information

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction

Chapter 5: Summary and Conclusion CHAPTER 5 SUMMARY AND CONCLUSION. Chapter 1: Introduction CHAPTER 5 SUMMARY AND CONCLUSION Chapter 1: Introduction Data mining is used to extract the hidden, potential, useful and valuable information from very large amount of data. Data mining tools can handle

More information

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Generalized Additive Model and Applications in Direct Marketing Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing Abstract Logistic regression 1 has been widely used in direct marketing applications

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

Web Page Recommender System based on Folksonomy Mining for ITNG 06 Submissions

Web Page Recommender System based on Folksonomy Mining for ITNG 06 Submissions Web Page Recommender System based on Folksonomy Mining for ITNG 06 Submissions Satoshi Niwa University of Tokyo niwa@nii.ac.jp Takuo Doi University of Tokyo Shinichi Honiden University of Tokyo National

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers James P. Biagioni Piotr M. Szczurek Peter C. Nelson, Ph.D. Abolfazl Mohammadian, Ph.D. Agenda Background

More information

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS CHAPTER 4 CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS 4.1 Introduction Optical character recognition is one of

More information

A PERSONALIZED RECOMMENDER SYSTEM FOR TELECOM PRODUCTS AND SERVICES

A PERSONALIZED RECOMMENDER SYSTEM FOR TELECOM PRODUCTS AND SERVICES A PERSONALIZED RECOMMENDER SYSTEM FOR TELECOM PRODUCTS AND SERVICES Zui Zhang, Kun Liu, William Wang, Tai Zhang and Jie Lu Decision Systems & e-service Intelligence Lab, Centre for Quantum Computation

More information

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier Wang Ding, Songnian Yu, Shanqing Yu, Wei Wei, and Qianfeng Wang School of Computer Engineering and Science, Shanghai University, 200072

More information

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA

More information

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India

Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Web Page Classification using FP Growth Algorithm Akansha Garg,Computer Science Department Swami Vivekanad Subharti University,Meerut, India Abstract - The primary goal of the web site is to provide the

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information