Effectively Detecting Content Spam on the Web Using Topical Diversity Measures

Size: px

Start display at page:

Download "Effectively Detecting Content Spam on the Web Using Topical Diversity Measures"

Percival Alexander
6 years ago
Views:

2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology Effectively Detecting Content Spam on the Web Using Topical Diversity Measures Cailing Dong Department

1 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology Effectively Detecting Content Spam on the Web Using Topical Diversity Measures Cailing Dong Department of Information Systems University of Maryland, Baltimore County Baltimore, MD, USA Bin Zhou Department of Information Systems University of Maryland, Baltimore County Baltimore, MD, USA Abstract Recent studies about web spam detection have utilized various content-based and link-based features to construct a spam classification model. In this paper, we conduct athoroughanalysisofcontentspamonthewebusingtopic models and propose several novel topical diversity measures for content spam detection. We adopt the web spam benchmark data set WEBSPAM-UK2007 for evaluation, and the experimental results verify that by integrating our topical diversity measures the performance of the state-of-the-art web spam detection methods can be greatly improved. In addition, comparing to existing features for training spam classification models, our topical diversity measures can achieve high spam detection performance using small set of training data. In personalized web spamdetection,thetraining data (i.e., user s spam labeling results) are typically small. Our finding makes personalized web spam detection highly achievable. We develop an efficient and effective regression model using topical diversity measures for personalized web spam detection, and present some promising results obtained from an empirical study. Keywords-web spam; topic model; classification; personalization; I. INTRODUCTION Web search engines have been widely adopted by users to retrieve and rank relevant web pages. Due to the fact that highly-ranked popular web pages could bring huge business opportunities, spammers have utilized many spam tricks with the purpose of increasing the rankings of their targeted pages in the web search results. In [1], web spam conceptually refers to any deliberate actions that bring to selected web pages an unjustifiable favorable relevance or importance. Web spam has become an obstacle to maintain high quality of information retrieval on the web. Thus, combating web spam is one of the top priorities in web search and has attracted much attention in recent years. Most of the existing methods model the problem of web spam detection as a traditional classification problem. A supervised learning model, e.g., learning a binary classifier using a training set which consists of pages labeled as spam or non-spam, is often adopted. Web pages can then be categorized into either spam or non-spam using the classifier. For the learning purpose, recent studies about web spam detection have utilized various content-based features such as the number of words in the page [2], average length of words in the page [2], content duplication [3], number of invisible words [4], and various link-based features such as the structure of linked pages [5], probabilistic counting [6], propagation of ranking contribution from trust pages [7]. In general, there are two types of spam tricks, named content spam and link spam [1]. Link spam tries to manipulate link-based rankings in web search engines (e.g., PageRank [8]), and existing link-based features have been shown effective for identifying link spam [4] [7]. Content spam targets at affecting content-based rankings in web search engines (e.g., TFIDF [8]). The existing content-based features mainly extract statistics of words within pages. However, these word-level statistics may not capture various characteristics of contents well enough in those spam pages. In this paper, we propose to use topic-level statistics of web page content for effective contentspamdetection.different from word-level statistics of web pages, which ignore the semantic relatedness between words, we believe that topic-level statistics can capture linguistic features hidden in text and understand the underlying intention of spam meaningfully. We conduct a thorough analysis of content spam using topic models. Topic models such as Latent Dirichlet Allocation (LDA) [9] are statistical language models for discovering hidden topics that occur in a collection of documents. We propose several novel topical diversity measures based on topic models of individual pages for effective content spam detection. An experimental evaluation using the proposed topical diversity measures is conducted on the web spam benchmark data set WEBSPAM-UK2007. Two promising results are identified. First, comparing to the start-of-the-art web spam detection results, the performance of web spam detection in terms of AUC (Area Under the Curve) and F-measure is increased greatly when considering topical diversity measures during the model learning process. Second, to achieve comparable spam detection performance, the size of the training set for topical diversity measures is smaller comparing to that for other content-based features. The latter finding is extremely promising since we can achieve high spam detection performance even using a small training set. The critical difference between a non-spam page and a spam page is whether the ranking in the search results is justifiable. Such a measurement of justifiability is often subtle and subjective. Moreover, the scale of the web is huge, and keeps increasing with a fastened pace. Thus, /12 $ IEEE DOI /WI-IAT

2 obtaining a large and reliable training set for effective web spam detection is often difficult. Learning using our topical diversity measures does not need a large training set, and the performance of spam detection is still high. Personalized web spam detection is also beneficial from using topical diversity measures. Most of the existing methods assume explicitly or implicitly that the spam detection methods are run at search engine sites. As the decision of whether a page is spam or non-spam is often subjective, it is highly desirable that spam detection can be personalized. An example of personalized web spam detection is intelligent web browser on user s computers. When a web page is shown in an intelligent web browser, the integrated spam detector can online determine whether the page is spam or not. Learning a personalized classifier using existing features based on each individual s historical spam labeling data, which is typically small, cannot achieve good result. We develop an efficient and effective regression model using topical diversity measures for personalized web spam detection. Some preliminary results in Section V-B verify the applicability of our method. The rest of the paper is organized as follows. In Section II, we briefly review some related studies about web spam detection. In Section III, we analyze the topic-level statistics of web pages using topic models and propose several novel topical diversity measures. An evaluation using topical diversity measures for web spam detection is presented in Section IV. In Section V, we discuss how these topic diversity measures are useful for personalized web spam detection, and show some preliminary results of personalized web spam detection. Finally, Section VI concludes the paper and outlines some future research directions. II. RELATED WORK Link spam and content spam are the two main categories of web spam tricks [1], [10]. Gyöngyi et al. [11] referred link spam to the cases where spammers set up structures of interconnected pages, called link spam farms, with the only purposes to boost the link-based ranking. Many methods have been proposed to effectively detect link spam. For example, Fetterly et al. [12] analyzed statistical features such as in-degrees and out-degrees of web pages. Outliers are marked as spam candidates. Gyöngyi et al. [7] utilized a concept of spam mass which is a measure of the impact of link spam on a page s ranking for link spam detection. Benczur et al. [13] proposed a method called SpamRank. The central assumption is based on the concept of personalized PageRank that detects pages with an undeserved high PageRank score. Zhou and Pei [5] introduced the concept of page farms. A page farm contains a set of pages contributing to a target page s ranking most. By analyzing the structural properties of page farms, pages that are beneficial from link spam dramatically can be identified. Different from link spam, content spam is another type of web spam tricks which disguises the content of a page to make it appear relevant to many popular web searches. Most of the existing content spam detection methods proposed so far adopted statistical analysis. For example, Fetterly et al. [12] analyzed certain content-based properties of web pages, and found that some features, such as long host names, host names containing many unusual symbols,little variation in the number of words in pages within a site are good indicators of content spam. Fetterly et al. [3] further showed that content spam pages are often mosaics of textual chunks copied from other legitimate pages. Web pages with large chunks of duplicate content are likely to be content spam pages. In [2], Ntoulasetal.presenteda number of word-level statistics for detecting content spam. A C4.5 classifier is built by combining the proposed heuristic methods to detect content spam pages. Recent work [14] conducted an empirical study on how various spam features and machine learning algorithms contribute to the quality of content spam detection methods. LogitBoost and Random- Forest are reported to achieve superior classification results. The above mentioned content-based features are more or less word-level statistics. They cannot capture linguistic features hidden in texts within content spam pages. To overcome the gap, a few recent proposals investigated the potential applications of statistical language models for understanding hidden characteristics of content spam pages. For example, Martinez-Romo and Araujo [15], [16] compared the difference of probabilistic language models between two linked pages. The assumption is that two nonspam pages that are linked should be topically related, even though the hyperlink between them is a weak contextual relation. Bíró et al. [17], [18] applied statistical language models on corpus of spam pages and non-spam pages separately. The goal is to undercover what topics are more likely to be spam topics. An unlabeled page that contains many spam topics is a strong suspect of content spam. How is Our Work Different: Existing studies about content spam detection using statistical language models mainly compare the language models from different corpus. Usually, two sets of topics representing spam corpus and non-spam corpus are extracted. They do not analyze the underlying characteristics of individual pages. We propose several topical diversity measures to analyze the statistics of topic distributions within individual spam page and non-spam page. Since the calculation is performed on each individual page, our proposed topical diversity measures can be calculated efficiently online, which is critical in personalized web spam detection, as online response requirement is often desired. III. TOPICAL DIVERSITY MEASURES FOR CONTENT SPAM DETECTION In this section, we first briefly review topic models in information retrieval, and then discuss several topical diversity measures built upon topic models of individual pages. 267

3 A. Topic Model in Information Retrieval Atopicmodelisatypeofstatisticalmodelwhichdiscovers the latent topics that occur inacollectionofdocuments. Among different variations of topic models, LDA [9] has been widely used for analyzing document corpus. In general, LDA models each latent topic as a probabilistic distribution over a word vocabulary, and each document as a probabilistic distribution over the latent topics. These distributions are sampled from Dirichlet distributions. Specifically, we assume that we have a vocabulary W which consists of a set of words, a set T of m topics and a set D of n documents with arbitrary lengths. For a topic t T, adistributionφ t on W is sampled using Dir(β), where β R W + is a smoothing parameter. Similarly, for a document d D, adistributionδ d on T is sampled using Dir(α), where α R T + is a smoothing parameter. To generate a document d with several words, for each word position, a topic t is drawn from δ d,andthenawordisdrawnfromφ t and filled into that position. B. Topical Diversity Measures Machine-generated pages are in fact the majority of content spam on the web. To receive high rankings in web search engines, content spam pages have a strong tendency to belong to targeted contents and topics, like insurance or commercial ads. If we can analyze some linguistics features hidden in those texts, we may find that such machinegenerated contents are quite different from other contents in those non-spam pages. Driven by this intuition, we propose to analyze web page contents using topic models. When using topic models for text analysis, we generally want to fit a model to the available text data and estimate all the parameters. These parameters would usually include asetofworddistributionscorrespondingtolatenttopics. After conducting a topic modeling on the text data (e.g., web page content), we can discover and characterize hidden topics in the text. Since each latent topic is associated with acorrespondingweightthatindicateshowfrequentlythis topic appears in the corpus, in general we obtain a topic distribution for the available text data. For content spam detection on the web, our goal is to understand whether spam pages and non-spam pages have unique characteristics in terms of topic distributions within those pages. The process of conducting topic modeling (e.g., LDA) for web pages is as follows: given a web page d, wecan represent the content of d using the bag-of-words model. The vocabulary of d, whichconsistsofallthewordsind, is represented as W.WecanthendirectlyapplytheLDA model on d to identify m latent topics, denoted as t 1, t 2,..., t m,whereeachtopict i (1 i m) isaworddistribution φ ti.foreachwordw W,itsprobabilitybelongingtoa topic t i is identified as φ(w t i ).WeuseW (t i ) to represent the set of words that having a probability belonging to t i larger than 0. The weight of each topic t i,whichequalsto w W (t i) φ(w t i), isdenotedasδ ti.thus, m i=1 δ t i =1. Topic Weight SpamPage1 SpamPage2 Non-SpamPage1 Non-SpamPage2 0 Topic 1 Topic 2 Topic 3 Topic 4 Topic 5 Topic Figure 1. Distributions of topic weights for spam pages and non-spam pages (plotted figures represent the results for randomly selected pages from the data set WEBSPAM-UK2007). 1) Distribution-based Topical Diversity Measure: Spam pages are likely topic-centric, that is, they have a specific set of targeted topics. Intuitively, spam pages and nonspam pages tend to have different topic distributions. To understand how topic distributions are depicted differently, we examined the pages in the latest web spam benchmark data set WEBSPAM-UK2007. A detailed description of the data set will be presented in Section IV-A. All the pages in WEBSPAM-UK2007 are manually labeled as spam or nonspam. We conducted topic modeling for various spam and non-spam pages. In Figure 1, we plot the distribution of topic weights for two randomly selected spam pages and two randomly selected non-spam pages. We have two interesting findings. First, the distribution of topic weights for spam pages are more likely to follow a similar trend of power-law distribution. For example, in both of the two spam pages, one topic has a very large weight while the weights for the remaining topics are relatively small. This is consistent with our previous conjecture that content spam pages tend to be topic-centric, as spammers want to make the pages relevant to a targeted set of search queries with similar topic. Second, non-spam pages tend to cover a set of topics with comparable weights. For example, both of the two non-spam pages in WEBSPAM-UK2007 have several topics with almost equal weights. This result is not surprising. When users create these web pages, they usually try to include many pieces of information in the host. For example, a personal homepage usually mentions information such as education history, work history, and so on. A commercial product s homepage usually covers information including tech specification, highlighted features, etc. To quantitatively capture this unique characteristic about distribution of topic weights for spam and non-spam pages, we propose a variance-based topical diversity measure. Definition 1 (Variance-based Topical Diversity Measure): Given a web page d, its topic distribution is T (d) ={t 1,t 2,...,t m } where each topic t i (1 i m) is associated with a weight δ ti.thevariance-basedtopical diversity measure for d, denoted as TopicVar(d), is calculated as m i=1 TopicVar(d) = (δ t i u) 2, (1) m 268

4 Fraction of Pages 35% 3 25% 2 15% 1 5% Variance-based Topical Diversity Measure Probability of Spam Figure 2. Prevalence of spam relative to the variance-based topical diversity measure. where u = m i=1 δt i m = 1 m. We calculate the variance-based topical diversity measures for every page in WEBSPAM-UK2007. Figure2demonstrates the distribution of variance-based topical diversity measures. In this figure, the horizontal axis depicts a set of value ranges. We use the bar graph to represent the percentage of pages that fall into a particular value range, and the line graph to represent the percentage of pages in each value range that are classified as spam pages. The left scale of the vertical axis refers to the bar graph, and the right scale of the vertical axis refers to the line graph. As can be observed from Figure 2, the variance-based topical diversity measures depict different distributions for spam pages and non-spam pages. Since spam pages are more topic-centric, the variance of topic weights is relatively large comparing to that of non-spam pages. A majority of pages in the data set have small variance-based topical diversity measures. Among them, the percentage of spam pages is small. When the values of the variance-based topical diversity measure increase, the percentage of being spam relatively increases as well. The results in Figure 2 clearly indicate that the variance-based topical diversity measure is agoodindictorfordetectingcontentspam. 2) Semantic-based Topical Diversity Measure: Distribution-based topical diversity measure only considers the distributions of topic weights, it does not consider whether these topics are semantically related. Spammers create spam pages with the only purpose to boost the relevance of the pages for a set of targeted search queries with similar topics. As can be imagined, the different topics of spam pages are more intended to be semantically related. Only in such a situation, spammers can make the pages more relevant to those search queries. In topic models, each topic hidden in the page content is a words distribution. How can we measure the semantic relations between these topics? Apracticalsolutionweproposedinthefollowsisbased on measuring semantics between individual words. Given a web page d, itstopicdistributionist (d) ={t 1,t 2,...,t m }. The probability that a word w belonging to a topic t i (1 i m) isidentifiedasφ(w t i ).Eachtopict i (1 i m) is represented as a set of words, denoted as W (t i ).Intuitively, two topics t i and t j (1 i, j m) aresemanticallyrelated if words from W (t i ) and W (t j ) are semantically related. We use a similarity function Sim(w i,w j ) to obtain the semantic relations between two words w i and w j.there are many similarity functions existed in the literature. In this work, we adopt the similarity function calculated from WordNet ( a popular lexical database. To measure the semantic relations between two topics t i and t j (1 i, j m), we can obtain the similarities between each pair of words from the two topics, along with their probabilities belonging to the corresponding topics, to calculate the average similarity score. That is, Sim(t i,t j )= w k W (t i),w l W (t Sim(w j) k,w l ) φ(w k t i ) φ(w l t j ), W (t i) W (t j) 2 where Sim(t i,t j ) represents the semantic relations between two topics t i and t j, W (t i ) and W (t j ) represent the size of the sets W ( t i ) and W ( t j ),respectively,andφ(w k t i ) and φ(w l t j ) are the probabilities of words w k, w l belonging to topics t i and t j,respectively. By using topic models to analyze web pages, we obtain m latent topics. Based on Equation 2, we can derive a semantic-based topical diversity measure based on pairwise similarities of m latent topics. Definition 2 (Semantic-based Topical Diversity Measure): Given a web page d, its topic distribution is T (d) = {t 1,t 2,...,t m }. The semantic-based topical diversity measure for d, denoted as TopicSim(d), is calculated as 1 i<j m TopicSim(d) = Sim(t i,t j ) 1 2 m(m 1), (3) where Sim(t i,t j ) is defined in Equation 2 We calculate the semantic-based topical diversity measure for every page in WEBSPAM-UK2007. Figure3demonstrates the distribution of semantic-based topical diversity measures. Similar to Figure 2, we use the bar graph and the line graph to represent the percentage of pages that fall into aparticularvaluerange,andthepercentageofpagesineach value range that are classified as spam pages, respectively. As can be observed from Figure 3, spam pages are more likely to have topics that are semantically related. When the semantic-based topical diversity measure increases, the percentage of being spam is high. This is consistent with our previous conjecture that spammers would like to make the spam pages focus on a targeted set of topics. We can (2) 269

5 Fraction of Pages 12% 1 8% 6% 4% 2% Probability of Spam Semantic-based Topical Diversity Measure Figure 3. Prevalence of spam relative to the semantic-based topical diversity measure Fraction of Pages 15% 1 5% Probability of Spam MaxSemantic-based Topical Diversity Measure Figure 4. Prevalence of spam relative to the maxsemantic-based topical diversity measure. adopt the semantic-based topical diversity measure as a good indicator for content spam detection. Equation 4 returns the average score of pairwise similarities of different topics. In some cases, the largest similarity score of pairwise similarities of m topics is also useful, especially when the two topics have large weights. We define amaxsemantic-basedtopicaldiversitymeasure. Definition 3 (MaxSemantic-based Diversity Measure): Given a web page d, its topic distribution is T (d) = {t 1,t 2,...,t m }.ThemaxSemantic-basedtopical diversity measure for d, denotedastopicsimmax(d), is calculated as TopicSimMax(d) =max{sim(t i,t j ) 1 i<j m}, (4) where Sim(t i,t j ) is defined in Equation 2. We calculate the maxsemantic-based topical diversity measure for every page in WEBSPAM-UK2007. Figure4 demonstrates the distribution of maxsemantic-based topical diversity measures. The settings of Figure 4 are similar to those of Figure 2 and Figure 3. Comparing to Figure 3, spam pages are even more likely to have a large maxsemantic-based topical diversity measure. This is not hard to understand. When spammers create those spam pages, the targeted set of topics are determined. Thus, topics in the spam pages should be highly similar, which turns out to have a large maxsemantic-based topical diversity measure. The results in Figure 4 verify that we can use maxsemantic-based topical diversity measure for detecting content spam. So far, we have discussed three topical diversity measures built upon topic models of individual pages. Each measure can be used as an indicator for content spam. Some machine learning algorithms, such as Random Forest and Regression can be adopted for learning a web spam classification model. IV. EXPERIMENTAL RESULTS OF CONTENT SPAM DETECTION In this section, we report a thorough evaluation of web spam detection using our topical diversity measures. In the rest of this section, we first describe the data set used for evaluation, and then discuss the performance results of web spam detection. All the experiments were conducted on a MacBook Pro computer running the Mac OS X Lion operating system, with a 2.4 GHz Intel Core i5 CPU, 4.0 GB main memory, and a 500 GB hard disk. The programs were implemented in Java and Perl. A. Web Spam Benchmark Data Set We adopted the WEBSPAM-UK2007 data set released by the Search Engine Spam Project at Yahoo! Research Barcelona 1.Thedatasetistheresultoftheeffortofateam of volunteers. The base data is a set of 105,896,555 pages in 114,529 hosts in the.uk domain. The data were downloaded in May 2007 by the Laboratory of Web Algorithmics, Università degli Studi di Milano. The spam collection data set contains a training set and a testing set. Since the number of spam pages in each set is very small, we combined the two sets together and adopted 10-fold cross validation for evaluation. Based on the URLs of pages, we collected content of those pages to build topic models. There are some pages not available online anymore. To make a fair comparison, we removed those pages from the spam collection. The final spam collection data set consists of 4,081 different pages chosen from the base data set, among which 163 pages are labeled as spam and the remaining 3,918 pages are labeled as non-spam. By default, we used this data set in the evaluation. B. Web Spam Detection using Topical Diversity Measures To implement the LDA model, we first extracted words from each web page using the Alchemy API (

6 Parameters in LDA #topics #words TPR FPR F-Measure AUC Table I THE RESULTS OF WEB SPAM DETECTION USING DIFFERENT PARAMETER SETTINGS IN LDA. alchemyapi.com/api). A stemming and POS tagging procedure was performed on the extracted words. We also removed the stop words 2 and words with some useless POS tags. The words of each page after the above mentioned data cleaning process were chosen for building the corresponding LDA model. We adopted a Java implementation of LDA, JGibbLDA ( which uses Gibbs sampling for parameter estimation and inference. The two smoothing parameters α and β in LDA were set to 0.5 and 0.1 by default. In order to determine whether a page is spam or nonspam, we adopted supervised learning algorithms to train a web spam classification model using proposed topical diversity measures. In this work, we considered various machine learning algorithms available in Weka ( cs.waikato.ac.nz/ml/weka/). The evaluation metrics we examined for comparing the results of web spam detection include: (1) TPR (true positive rate); (2) FPR (false positive rate); (3) F-measure (the harmonic mean of precision and recall); and (4) AUC (area under the curve). Among them, F-measure and AUC are widely adopted in the web spam detection community for performance evaluation. 1) Parameter settings of #topics and #words in LDA: In LDA, we can set up the parameter values for the number of topics (#topics) and the number of words in each topic (#words). To understand how these two parameters may affect the performance of web spam detection, we chose different parameter values. In Table I, we showed the performance results of web spam detection with regard to different parameter settings. The results of web spam detection in Table I were achieved by using the RandomForest algorithm based on the proposed three topical diversity measures only. The reason we chose RandomForest here is that this algorithm usually achieves best detection accuracy. This is consistent with the results recently reported in [14]. Different from previous studies which usually choose quite large number on #topics and #words, we choose to use small numbers since our language model is built based on each individual web page instead of set of web page corpus. We can observe several interesting results from Table I. On one hand, if we fix the value of #topics, when we increase the value of #words, the performance of web spam detection increases with regard to F-measure and AUC. On the other hand, if we fix the value of #words, when we increase the 2 Algorithm Feature Set TPR FPR F-Measure AUC Content RandomForest Topic All Content RandomTree Topic All Table II THE RESULTS OF WEB SPAM DETECTION USING DIFFERENT FEATURE SETS. value of #topics, the performance of web spam detection increases as well. In general, among the various parameter settings reported in Table I, when #topics and #words are set to be 5 and 5 respectively, the web spam detection algorithm (i.e., RandomForest) achieves the best results. It is possible to even use larger values for the two parameters, however, it may introduce more computational overheads for the topical diversity measures. Thus, in the following experiments, we fixed the values of #topics and #words in LDA to be 5 and 5bydefault. 2) Comparison of Web spam Detection: We also examined how useful the proposed topical diversity measures are for web spam detection. To make a comparison, we considered several other useful content-based features available in WEBSPAM-UK2007, suchasaveragewordlength,fraction of anchor text. In total, we obtained 24 different features. We name them as Content feature set. The proposed three topical diversity measures form a Topic feature set. We also considered the situation by combining the two feature sets together to form an All feature set. From the Weka package, we identified 36 different machine learning algorithms. These algorithms cover a majority of applicable algorithms including bayes, function, rules and trees. We applied each of these algorithms to detect web spam in the data set. Each algorithm was performed three times using different feature sets: the Content set, the Topic set, and the All set. The results of web spam detection were achieved by conducting a 10-fold cross validation. We first examined which feature set is most useful for detecting web spam. Among the 36 different machine learning algorithms, we found that 7 of them can achieve the highest AUC when the Topic feature set is used, 5 of them can achieve the highest AUC when the Content feature set is used, and the remaining 24 of them can achieve the highest AUC when the All feature set is used. The results verify that the proposed topical diversity measures are good indicators for web spam detection. We then analyzed how many improvements the proposed topical diversity measures can bring to the literature of web spam detection. In Table II, we showed the different evaluation metrics for the two algorithms (RandomForest and RandomTree) which achieve the highest performance. The results clearly show that by integrating the proposed topical diversity measures, the existing web spam detection algorithms can be improved. 271

7 F-Measure Content Topic Sampled Percentage of Training Data (%) AUC Content Topic Sampled Percentage of Training Data (%) Figure 5. The results of web spam detection using small scale of training data (Left: F-measure; Right: AUC). C. The Size of the Training Data In most machine learning algorithms, the quality of the training data is critical for the algorithms to achieve good performance. In web spam detection, obtaining a highquality training data with sufficient records may not be an easy task. Thus, we also examined how the scale of the training data may affect the overall performance of web spam detection. To simulate the situation of lack of training data, we randomly sampled small sizes of data (e.g., 2) from the original spam collection data set. We assume that the sampled data are the available data set for model learning. We want to compare how the different feature sets (e.g., the Content set and the Topic set) may perform with regard to small set of training data. In Figure 5, we plot the results of web spam detection (in terms of F-measure and AUC) using the Content feature set and the Topic feature set with different scales of training data. The results reported in Figure 5 were achieved by the RandomForest algorithm. We randomly sampled small scale of training data (varied from 5% to 2). It is surprisingly to find that even when the size of the training data is small, the performance using the proposed topical diversity measures is still high. The results in Figure 5 also indicate that the topical diversity measures are more useful than other content-based features when there are not enough training data for web spam detection. This result is exciting since we can use topical diversity measures for web spam detection even when alargetrainingdataisnotpossibletoobtain. V. PERSONALIZED WEB SPAM DETECTION Most of the existing studies about web spam detection explicitly or implicitly assume that web spam detection is conducted on the search engine s server side. If a page is judged as spam by search engines, it is removed from the search results. In practice, there are some other scenarios when spam detection needs to be conducted on the client side. A typical example is intelligent web browser running at user s computers. When a page is opened in the intelligent web browser, the integrated spam detector needs to determine whether the page is spam or not online. Intelligent web browsers provides users more information about web pages and more controls in the course of web surfing. Adistinctrequirementofspamdetectioninintelligent web browsers, comparing to traditional spam detection on the search engine side, is personalization. As described before, the critical difference between a spam page and a non-spam page is based on whether the relevance of a page is justifiable or not. Different users have different opinions on such a justification. A page is categorized as spam by one user may not be necessary to be categorized as spam by another user. As a result, personalized web spam detection is desired in intelligent web browsers such that the web spam detection results are tailored specifically to meet each individual s judgement on spam. For personalized web spam detection, we need to collect user s judgements on spam. Such judgements mainly refer to individual s historical labeling results of web spam. In those intelligent web browsers, users can provide spam labels for pages that are visited. Users can also update the spam labels if they do not agree with the spam detection results provided by the spam detector. These activities can be considered as explicit feedbacks of spam judgement provided by users. In addition, there also exist implicit feedbacks of spam judgement. For example, if a page is labeled as spam/non-spam and the user does not update the spam detection result, it implicitly means the user agrees with such spam judgements. In those intelligent web browsers, all the explicit feedbacks and implicit feedbacks are collected and the data are useful for learning a personalized web spam detector. It is worthy of mentioning that the collected user s judgements on spam are usually with a very small scale. However, as reported in Section IV-C, the web spam detection results are still high even only a small size of the training set is adopted. This unique property of the proposed topical diversity measures indicates that these measures are valuable for achieving personalized web spam detection. In personalized web spam detection, binary classification of web spam may not be suitable. It is more desired that users can have more controls on determining how likely a page is spam. An intuitive solution is to provide a numerical score to indicate the likelihood of a page being spam. Users can set up low/high threshold values to categorize more/less spam pages according to particular situations. In [5], a spamicity score is proposed to achieve such a goal. However, the spamicity score is an ad-hoc measure which does not involve a self-learning process. In this paper, we adopt the regression model using the proposed topical diversity measures for personalized web spam detection. A. Regression Model for Personalized Web Spam Detection The personalized web spam detection is achieved by learning a personalized spam detector using collected data of user s spam judgements. Technically, our personalized web spam detector estimates the probability p that a page will be categorized as spam given three topical diversity measures. For simplicity, we denote them as x 1, x 2,and x 3,respectively.Thedetectortakes the collected training set to learn a logistic regression model. The formula of the logistic regression is ln p 1 p = γ i=1 γ i x i, where γ 0,...,γ 3 are the coefficients. In other words, the probability p that a page will be categorized as spam is 272

8 Threshold TPR FPR F-Measure AUC Table III THE RESULTS OF PERSONALIZED WEB SPAM DETECTION USING DIFFERENT THRESHOLD VALUES. 1 measured as p = 1+e (γ γ i=1 i x i.theprobabilityp is in ) the range [0,1]. The larger the value of p, themorelikely the corresponding page is spam. There are several methods to learn the regression coefficients, among which Newton-Raphson method [19] is a popular one. Once the coefficients arelearned,foreachnew web page, the spam detector will use the model to label the page as spam if its probability is above a certain threshold value. This threshold value can be turned by the user to adjust how aggressive should be in detecting spam. As users keep generating the labeled data of web spam, we need to re-train the regression model to meet the new training data. After several rounds of re-trains, the model is able to precisely capture users spam judgements and provides high-quality personalized web spam detection results. B. Results of Personalized Web Spam Detection We report some results obtained from an empirical study of personalized web spam detection. We used the labeling results of pages from the WEBSPAM-UK2007 data set as collected user s spam judgements. A regression model is learned from the data. We varied the threshold values, and the results of personalized web spam detection are reported in Figure III. Clearly, when the threshold value decreases, the spam detection is more aggressive. Thus, more pages are categorized as spam. Users have the ability to choose different threshold values to control how many spams and how aggressive the spam detection should be. Personalized web spam detection needs to be conducted online. Thus, an online response is critical. We randomly selected 100 pages from the WEBSPAM-UK2007 data set and examined the total time that are needed to categorize them. On average, each page only needs less than 0.1 second for the web spam categorization. In general, the personalized web spam detection can be achieved efficiently. VI. CONCLUSION In this paper, we conduct a thorough analysis of content spam on the web using topic models. Different from many existing word-level statistics, topic models can identify topic-level linguistics features hidden in text. We propose several topical diversity measures based on topic models. The evaluation conducted on the web spam benchmark data set indicates that by integrating our topical diversity measures, the performance of the state-of-the-art web spam detection algorithms can be greatly improved. We also provide some studies about personalized web spam detection. There are several interesting future research directions. For example, we only considered the topical diversity measures within each individual page. It is interesting to explore whether the topical diversity measures can be integrated with some link-based features, e.g., page farms [5]. ACKNOWLEDGMENT This research is supported in part by a UMBC Special Research Assistantship/Initiative Support (SRAIS) grant. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agency. REFERENCES [1] Z. Gyöngyi and H. Garcia-Molina, Web spam taxonomy, in AIRWeb 05, [2] A. Ntoulas et al., Detecting spam web pages through content analysis, in WWW 06, pp [3] D. Fetterly et al., Detecting phrase-level duplication on the world wide web, in SIGIR 05, pp [4] B. Zhou et al., A spamicity approach to web spam detection, in SDM 08, pp [5] B. Zhou and J. Pei, Link spam target detection using page farms, ACM Transactions on Knowledge Discovery from Data (TKDD), vol.3,no.3,2009. [6] L. Becchetti et al., Using rank propagation and probabilistic counting for link-based spam detection, in WebKDD 06, [7] Z. Gyöngyi et al., Link spam detection based on mass estimation, in VLDB 06, pp [8] C. D. Manning et al., Introduction to Information Retrieval, Cambridge University Press, [9] D. M. Blei et al., Latent dirichlet allocation, vol. 3, JMLR.org, Mar. 2003, pp [10] N. Spirin and J. Han, Survey on web spam detection: principles and algorithms, SIGKDD Explorations, vol. 13, no. 2, pp , [11] Z. Gyöngyi and H. Garcia-Molina, Link spam alliances, in VLDB 05, pp [12] D. Fetterly et al., Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages, in WebDB 04, pp [13] A. A. Benczur et al., Spamrank: Fully automatic link spam detection, in AIRWeb 05, [14] M. Erdélyi et al., Web spam classification: a few features worth more, in Proceedings of the 2011 Joint WICOW/AIRWeb Workshop on Web Quality, pp [15] J. Martinez-Romo and L. Araujo, Web spam identification through language model analysis, in AIRWeb 09,pp [16] L. Araujo and J. Martinez-Romo, Web spam detection: new classification features based on qualified link analysis and language models, IEEE Transactions on Information Forensics and Security,vol.5,no.3,pp ,2010. [17] I. Bíró et al., Latent dirichlet allocation in web spam filtering, in AIRWeb 04, pp [18] I. Bíró etal., Linkedlatentdirichletallocationinwebspam filtering, in AIRWeb 09, pp [19] T. Hastie et al., The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer,

Towards Building A Personalized Online Web Spam Detector in Intelligent Web Browsers

203 IEEE/WIC/ACM International Conferences on Web Intelligence (WI) and Intelligent Agent Technology (IAT) Towards Building A Personalized Online Web Spam Detector in Intelligent Web Browsers Cailing Dong,