Towards Building A Personalized Online Web Spam Detector in Intelligent Web Browsers

Size: px

Start display at page:

Download "Towards Building A Personalized Online Web Spam Detector in Intelligent Web Browsers"

Juliet Taylor
5 years ago
Views:

203 IEEE/WIC/ACM International Conferences on Web Intelligence (WI) and Intelligent Agent Technology (IAT) Towards Building A Personalized Online Web Spam Detector in Intelligent Web Browsers Cailing

1 203 IEEE/WIC/ACM International Conferences on Web Intelligence (WI) and Intelligent Agent Technology (IAT) Towards Building A Personalized Online Web Spam Detector in Intelligent Web Browsers Cailing Dong, Bin Zhou, Lina Zhou Department of Information Systems University of Maryland, Baltimore County Baltimore, MD, USA {cailing.dong, bzhou, zhoul}@umbc.edu Abstract Most of the extant studies about web spam detection either explicitly or implicitly assume that web spam detection is performed on the server side of a search engine. In this paper, we argue that in some scenarios, web spam detection is preferred to be conducted on the client side (e.g., intelligent web browsers). When a page is viewed using an intelligent web browser, an integrated personalized web spam detector can determine whether the page is spam or not specifically tailored to the user s judgements. We propose a framework for implementing a personalized web spam detector. The experimental results obtained from an empirical evaluation confirmed the effectiveness of the proposed personalized web spam detection method. Keywords-web spam; personalization; intelligent web browser; I. INTRODUCTION In recent years, web search engines such as Google and Microsoft s Bing have become the main entryways for billions of users to surf the web. Users can issue a keyword query and web search engines will return a ranked list of pages that are relevant to the query. A page ranked high in the search results will attract more users to visit the page that in turn brings more business opportunities to the website owners. As a result, many spam tricks have been adopted by the spammers to boost the rankings of their targeted pages in the web search results. The phenomena that spammers utilize any deliberate actions that bring to selected web pages an unjustifiable favorable relevance or importance [2] is referred to as web spam. Web spam undoubtedly decreases the quality of information searched on the web. Combating web spam effectively is a growing challenge for web search. The problem of web spam detection has attracted much attention and many methods have been proposed in the literature [], [2], [5], [6], [8], [0], [], [5], [7], [20]. In general, most existing studies model web spam detection as a traditional classification problem. Supervised learning, e.g., learning a binary web spam classifier using a training set which consists of pages labeled as spam or non-spam, is widely adopted. An unseen page can be categorized as either spam or non-spam using the classifier. The success of building a trustworthy web spam classifier mainly relies on two key factors: whether a large reliable training set exists or not, and whether a set of features is available for distinguishing spam pages from non-spam pages. Search engine companies usually have full access to the above two sets of information. Thus, it is not surprising to find that most existing studies about web spam detection explicitly or implicitly assume that web spam detection is performed on the search engine s server side. In practice, a reliable training dataset is hard to obtain. Usually, it starts with a subset manually labeled by domain experts. But even for domain experts, a universal labeling agreement is still difficult to achieve. Taking the well-known web spam benchmark data set WEBSPAM-UK2007 as an example, although volunteers participating in the labeling task were researchers and experts in the domain of web spam detection and detailed labeling guidelines were provided, inconsistent labels account for as high as 7.3% of the results (details will be discussed in Section IV). Given that the manual labelling is time-consuming, the training datasets are usually not frequently updated. This raises one critical issue of search engine side web spam detection: it will take a long time to reach an agreement on the data of new spam tricks before actually taking actions to detect them. Therefore, we propose a spam detection approach which can not only follow individual s judgement on spam, but also detect new spam tricks in a timely fashion. There is no doubt that search engines are the victims of web spam. Search engine companies have strong incentives to wipe out web spam and provide high-quality search results to users. However, individual users that rely on search engines for web surfing are in fact the ultimate victims of web spam. Therefore, users should have even stronger desires to have spam-free experience when using web search engines. To solve this problem, in this paper, we focus on the situation where web spam detection is initialized and conducted on the user side (i.e., the client side). Intelligent web browsers running on users computers are a typical application scenario. When a page is retrieved and viewed using an intelligent web browser, an integrated web spam detector can determine whether the page is spam or not specifically tailored to user s judgements of spam. Intelligent web browsers provide users with more information about /3 $ IEEE DOI 0.09/WI-IAT

2 and more controls over web pages. The training dataset is built automatically over the course of web surfing, and new spam tricks can be detected in a timely manner. Compared to the scenario where web spam detection is conducted on the search engine s server side, the client side web spam detection has several advantages. First, the results of web spam detection conducted on the client side can be personalized. As described above, one critical difference between a spam page and a non-spam page lies in whether the relevance of the page is justifiable or not. Different users have different opinions on such justifications. One person s spam page may be another person s treasure. In addition, user s personal opinion on spam may be evolving as well. As a result, the server side web spam detection does not always provide satisfactory results. Client side web spam detection is built specifically to meet each individual s judgements on spam. Second, a reliable training set for building a personalized web spam classifier is relatively easy to collect on the client side. The quality of the training set greatly affects the quality of the web spam classifier. As mentioned before, a universal agreement on spam labeling is not easy to achieve. In intelligent web browsers, we can collect each individual s judgements on spam from previous browsing history. The collected data reflect each individual s personal opinion on spam. The data are suitable for constructing a high-quality personalized web spam classifier. Third, client side web spam detection can respond to new spam tricks in a timely fashion. The web is evolving, and new spam tricks keep generating from time to time. The web search engine companies need to collect sufficient evidence for new spam tricks. Thus, there could be a considerable time delay for web search engines to wipe out those new spam tricks. On the client side, users can identify new spam tricks and provide spam labels whenever the pages are viewed. Last but not the least, users have more controls in web search when web spam detection is performed on the client side. Willing or not, web search engines currently determine what is trustworthy and what is not [8]. If a page is judged as spam by search engines, it is removed from the search results. We believe that the trustworthiness of a page should be a personal decision, not an absolute quality of a page. The proposed personalized web spam detector can help individual users make more informed decisions about the quality of the information they find on the web. Meanwhile, once users preferences are considered in web search, different users will see different web search results. This makes it much harder, even impossible at all, for the spammers to make their targeted pages universally appear in top ranked results. Thus, the motivation of adopting spam tricks for those spammers can be largely reduced. It is worth mentioning that conducting client-side web spam detection can complement the existing studies about server-side web spam detection. The two web spam detection processes are conducted at different stages. After search engine companies detect and filter global spam pages, a client-side web spam detector can still function to retrieve personalized web spam detection results. Besides, personalized web spam detection is different from personalized web search [9], although both of them try to provide different search results for different users based on their interests and preferences. Personalized web search is conducted on the basis of spam-free results given by search engines, while personalized web spam detection uses all the available information on the web and filters user-defined spams. In this paper, we describe in detail the technical approaches to building a personalized web spam detector in intelligent web browsers. The remainder of the paper is organized as follows. In Section II, we briefly reviewsome related studies on web spam detection and personalization techniques. In Section III, we present the framework of a personalized web spam detector. Section IV reports some experimental results obtained from an empirical evaluation. Finally, Section V concludes the paper and outlines some future research directions. II. RELATED WORK Web spam detection has attracted much attention in the past several years. Among different categories of web spam tricks, link spam and content spam are the two main categories [2], [7]. In [], Gyöngyi et al. described link spam as the situations where spammers set up structures of interconnected pages, called link spam farms, with the only purpose to boost the link-based ranking. In the literature, many efficient techniques have been developed to identify web pages which adopt link spam tricks. Statistical and structural properties of local web graph structures [], [5], [8] as well as some link-based evaluation measures and methods such as spam mass [0], SpamRank [3] and page farm [20] have been used in detecting link spam. Content spam is another type of web spam tricks. Unlike link spam tricks, content spam tricks mainly target at disguising the content of a page to make it appear relevant to many popular web searches. Statistical analysis of web page content, such as long host name [8] and duplicated textual chunks [9], has been widely used in content spam detection [5]. In addition to the word-level statistics, topiclevel measurements are also proposed to differentiate spam and non-spam pages [6]. Search engine companies typically conduct web spam detection by integrating both content spam detection and link spam detection techniques. Recently, Erdélyi et al. conducted an empirical study on how various spam features and machine learning algorithms contribute to the quality of web spam detection methods [7]. LogitBoost and RandomForest are reported to achieve superior classification results. While web spam detection is usually conducted on the search engine s server side, there are no existing studies 586

3 which focus on developing personalized web spam detection approaches. In the literature, personalization techniques have been widely used in many recommender systems and webbased services, including personalized web search [9], online social networks [4], etc. III. A PERSONALIZED WEB SPAM DETECTOR As personalized web spam detection is conducted on the client side, the client (e.g., web browser) needs to classify a web page as either spam or non-spam. We refer to such a web browser as an intelligent web browser. A typical implementation of intelligent web browsers is to integrate a personalized web spam detector into a regular web browser. In practice, the personalized web spam detector can be developed as a web browser plug-in. In this paper, we focus on the design of personalized web spam detector. The framework of the personalized web spam detector is depicted in Figure. Entities in rectangular boxes with solid lines are the major components of the spam detector. The process flow of detecting web spam in a personalized way is described in the following. ) When using an intelligent web browser, a user can directly submit a keyword query q to the web search engine. A ranked list of k web pages, denoted as L g = {(p,s g (p )), (p 2,s g (p 2 )),...,(p k,s g (p k ))}, is returned by the search engine. We use p i ( i k) to represent the i-th ranked page in L g. s g (p i ) ( i k) represents the probability that p i is a spam page. The value of s g (p i ) in L g is provided by the web search engine using the Global Spam Classifier maintained on search engine side. In other words, s g (p i ) captures search engine s judgements of spam. 2) The list L g is then forwarded to the component Personalized Spam Classifier. The personalized spam classifier is constructed based on the user s previous spam labeling results. It analyzes each page p i ( i k) in L g and produces a personalized spamicity score s p (p i ). The greater the value of s p (p i ), the more likely p i is categorized as a spam page by the user. We use L p = {(p,s p (p )), (p 2,s p (p 2 )),...,(p k,s p (p k ))} to represent the list of pages and their corresponding personalized spamicity scores. 3) The two lists L g and L p are then passed onto the component Personalized Spam Fuser. The goal of this component is to calculate a fused spamicity score which integrates both user s and search engine s judgements of spam. For each page p i ( i k), afused spamicity score, denoted as s f (p i ), can be calculated as s f (p i )=s p (p i ) λ + s g (p i ) ( λ). The parameter λ [0, ] determines the weight of s p (p i ) in the fused spamicity score. When λ is set to be, the fused spamicity score solely depends on s p (p i ), the user s historical judgements of spam. Global Spam Classifier Web Search Engine keyword query User Figure. (p,s g (p )) (p 2,s g (p 2 ))... (p k,s g (p k )) Personalized Spam Classifier Personalized Spam Fuser (p,s f (p )) (p 2,s f (p 2 ))... (p k,s f (p k )) Personalized Spam Filter Feedback Collector (p,s p (p )) (p 2,s p (p 2 ))... (p k,s p (p k )) Personalized Spam Detector A framework of personalized web spam detection. When λ is set to be 0, the fused spamicity score equals to s g (p i ), the search engine s judgements of spam. In practice, the value of λ can be learnt to meet each user s requirement. We use L f = {(p,s f (p )), (p 2,s f (p 2 )),...,(p k,s f (p k ))} to represent the list of pages and their corresponding fused spamicity scores. 4) The list L f is used as the input for the component Personalized Spam Filter. This component utilizes a spamicity threshold θ (0 θ ) to distinguish spam pages from non-spam pages. For a page p i L f ( i k), if s f (p i ) θ, p i is categorized as a spam page by the personalized spam filter and is removed from the final results. It is easy to see that the value of θ controls how aggressive the web spam detection should be. The value of θ that achieves the best spam detection performance is recommended by the detector to the user. Meanwhile, each user has full controls over the value of θ as his judgements of spam evolve over time. 5) After filtering those pages that are categorized as spam, a spam-free list of pages, denoted as L sf,is displayed to the user as the final results in response to user s keyword query q. The user can click on those pages in L sf to see whether they are relevant to q. During this step, user s feedback about spam can be collected. For example, if a page p i L sf ( i k) is clicked by the user but the user feels that p i should be filtered out, the user can simply submit an explicit feedback to the Feedback Collector which is integrated in the intelligent web browser. If the user does not submit any explicit feedback for p j,itisregardedas an implicit feedback that the user agrees on the current judgement on p j. Based on the user s feedbacks, the personalized spam classifier will be updated so as to capture user s new spam judgements. 587

4 When implementing the proposed personalized web spam detector in intelligent web browsers, there are several technical challenges. We address these challenges in the following. A. Obtaining the value of s g (p i ) Web search engines conduct web spam detection on the server side using the global spam classifier. But the current major web search engines do not explicit provide such spam judgement scores. Alternatively, we can rely on some publicly available APIs of web spam detection tools to obtain binary spam labeling results (e.g., http: //tool.motoricerca.info/spam-detector/). If p i is labeled as spam, we can set s g (p i )=,otherwises g (p i ) is set to be 0. Existing studies indicate that the performance of web spam detection using those APIs are comparable to state-ofthe-art methods. B. Constructing a personalized spam classifier Similar to many recommender systems, we also encounter a cold start problem [6] in constructing the personalized spam classifier, that is, the personalized spam classifier is built upon the collected user s historical spam judgements. Such data do not exist when the user starts to use the intelligent web browser at the very beginning. To address this problem, we basically set the parameter λ that determines the weight of s p (p i ) in the fused spamicity score, to be 0. When a set of user s spam judgements is collected, we can use an adaptive parameter turning method to update the value of λ, which will be introduced in Section III-C. To build a personalized spam classifier, we need to measure the probability s p that a page will be categorized as spam (i.e., personalized spamicity score). Suppose l features, denoted as x,...,x l, are obtained for each web page, we can train a logistic regression model as the spam classifier. The logistic regression equation is defined as follows: s p l ln = γ 0 + γ i x i, s p i= where γ 0,...,γ l are the coefficients. In other words, the probability s p that a page will be categorized as spam is measured as s p =. +e (γ 0 + l γ i= i x i ) There are several methods to learn the regression coefficients, among which Newton-Raphson method [3] is a popular one. When a new set of user s spam judgements is collected, this method can be used to update the personalized spam classifier efficiently. C. Adaptive parameter tuning in personalized spam fuser In Personalized Spam Fuser, Thevalueofλis initialized to be 0 such that the decision of being spam mainly depends on the global spamicity score s g (p i ).Thenλisupdated automatically to optimize spam detection performance. The adaptive parameter tuning is achieved by considering how satisfied the user is with regard to the current spam detection performance. Given a spamicity threshold θ, C g = {p i s g (p i ) θ} and C p = {p i s p (p i ) θ}, wherec g and C p represent the set of pages that are categorized as spam when the global spam classifier and the personal spam classifier are considered, separately. We use C u to denote the set of pages in the final list L sf on which user u provides explicit feedback. That is, C u contains the set of pages that user u feels they should be categorized as spam. Intuitively, if C u is more similar to C p than C g, we should assign a higher weight to s p (p i ) when calculating the fused spamicity score. Specifically, let λ t be the current value of λ for computing s f (p i ). We derive the new value λ t+ as follows: { λ t+ λ t + α β (Dice(C = u,c p) Dice(C u,c g)) λ t ( λ t ) α β (Dice(C u,c p) <Dice(C u,c g)) where Dice(C u,c p ) is the Dice coefficient [6] defined as Dice(C u,c p )= 2 Cu Cp C u + C p. The parameter α is a real number in the range of [0, ] controlling the adjustment rate. The parameter β is equal to Cu L sf. The larger the size of C u, the faster the value of λ is updated. D. Recommending the value for the spamicity threshold θ The value of θ controls the amount of spam pages to be detected. When θ is low, more pages will be classified as spam. Although users have full controls over determining the value of θ, it is often desired that the system can recommend an appropriate value for the parameter. In general, the smaller the value of θ, the higher recall but lower precision. A good value for θ should lead to both a relatively high precision and a relatively high recall. Therefore, we chose a value for θ that leads to the best F-measure which models the harmonic mean of precision and recall. Such θ can be obtained using the Simulated Annealing algorithm [4]. IV. AN EMPIRICAL EVALUATION The proposed personalized web spam detector is implemented as a FireFox plugin. In this section, we report some experimental results obtained from an empirical evaluation. Our major contribution in this paper is the development of the personalized web spam detection framework. In the following section, we mainly focus on the performance of Personalized Spam Classifier. A. Web Spam Benchmark Data Set We adopted the WEBSPAM-UK2007 data set released by the Search Engine Spam Project at Yahoo! Research Barcelona. The spam collection data set contains a training set and a testing set. A team of 48 volunteers participated to manually label these pages as spam, normal, and undecided based on a comprehensive list of labeling guidelines. A page is labeled as undecided if the volunteer 588

5 Labels S N U S&N S&U N&U S&N&U #Pages Table I FREQUENCY OF DIVERSE LABELS PROVIDED BY VOLUNTEERS IN THE WEBSPAM-UK2007 DATA SET (S: SPAM; N: NORMAL; U: UNDECIDED). cannot determine whether the page is a spam page or a normal page (i.e., a borderline page). In the data set, each page was labeled by at least one volunteer, and more than 90% pages were labeled by at least three volunteers. Given that the number of spam pages in each set is very small, we merged the two sets and adopted 0-fold cross validation for performance evaluation. In total, the data set consists of 6479 unique web pages crawled from the.uk domain. B. Diversity of Spam Judgements Across Individuals We first examined whether spam judgements from different volunteers were consistent. For each page in the data set, we checked whether the labels provided by different volunteers are exactly the same. The statistical results of spam judgements are summarized in Table I. The numbers of pages that only have unique spam label, normal label, and undecided label are 82, 4780, and 395, respectively. The remaining,22 pages in the data set, about 7.3% of the data set, have at least two different labels. In addition, 84 pages even have all the three different labels. This result clearly reveals the fact that each user has his/her personal judgements on spam. C. Performance of Personalized Spam Classifier We conducted experiments to evaluate how effective the personalized spam classifier can identify spam pages. For this purpose, we set the value of λ to be, and θ to be the recommended value with the highest F-measure. We extracted a total of 42 different features from each page when training the classifier. These features cover a wide variety of properties of web pages, including link-based features (e.g., in-degree, out-degree, PageRank, etc.) and contentbased features (e.g., number of words in the page/title, average word length, etc.). Among all the available features provided in the WEBSPAM-UK2007 spam collections, these features are reported to achieve high performance for web spam detection. When volunteers were asked to label those pages, they were required to open the page, check the page content, and then generate the label one by one. We considered this as a simulation of using the proposed personalized web spam detector: volunteers open those pages one by one in the intelligent web browser and provide explicit feedbacks based on their judgements of spam. We were interested to find out whether the performance of the personalized spam classifier improves with the size of user s spam judgements. For the pages that each volunteer provided a label, we partitioned the pages into 0 parts with equal size. The i- th set of pages is denoted as S i ( i 0). Volunteers AUC Step AUC Step Figure 2. The performance of the personalized spam classifiers across different steps (Left: Volunteer A; Right: Volunteer B). Precision/Recall Precision Recall Threshold (θ) Figure 3. Precision and recall of the personalized spam classifier when the spamicity threshold varies. first provided labels for pages in S, then pages in S 2,and so on and so forth. Correspondingly, the personalized spam classifier in step i is built using user s spam judgements collected from {S,...,S i }, and measure its performance on S i+ ( i 9). The AUC value was chosen as the metric for comparing the performance of the personalized spam classifiers. In Figure 2, we depict the AUC values for randomly selected two volunteers. We identified similar trends for other volunteers in the data set. When more user s spam judgements are included for training the personalized spam classifier, the performance improves. At some particular steps, the spam detection performance may get worse than the previous step. This may have been caused by the result of new spam tricks used in those spam pages, which can not be identified in the current step. However, as revealed from Figure 2, these new spam tricks are well captured in the following step(s). This result demonstrates that the personalized web spam detector can respond to new spam tricks in a timely fashion. D. Results by Varying the Spamicity Threshold We also conducted some experiments to evaluate the effect of spamicity threshold θ in the proposed Personalized Spam Filter component. In this experiment, a personalized web spam classifier is trained only using each volunteer s labeled data. A 0-fold cross validation is performed to evaluate the average performance of personalized web spam detection for those participated volunteers. We varied the threshold value θ from 0.2 to 0.9. Precision and recall for the personalized spam classifier are shown in Figure 3. Clearly, when the value of threshold θ is small, web spam detection is aggressive, which results in low precision and high recall. When the value of θ increases, fewer pages are categorized as spam pages. Both precision and recall are affected by the threshold value θ. In our proposed framework, experienced users have the ability to choose different threshold values to control how aggressive the web spam detection should be. For those inexperienced users, the 589

6 personalized spam detector can automatically recommend a value (0.6 in this experiment) to them. V. CONCLUSIONS In this paper, we addressed the scenario where web spam detection is conducted on the client side (e.g., intelligent web browsers). We identified the advantages of performing personalized web spam detection, and developed a practical framework to implement the personalized web spam detector in intelligent web browsers. An empirical evaluation confirmed the advantages of the proposed personalized web spam detection solution. In future research, we plan to incorporate additional user s preferences to enhance personalized web spam detection. For example, individual s activities in public social media can serve as a valuable data source for such purpose. ACKNOWLEDGMENT This research is supported in part by a Samsung Global Research Outreach grant and a UMBC Special Research Assistantship/Initiative Support grant. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agency. REFERENCES [] J. Abernethy, O. Chapelle, and C. Castillo. Web spam identification through content and hyperlinks. In Proceedings of the 4th international workshop on Adversarial information retrieval on the web, pages ACM, [2] S. P. Algur and N. T. Pendari. Hybrid spamicity score approach to web spam detection. In Pattern Recognition, Informatics and Medical Engineering (PRIME), 202 International Conference on, pages IEEE, 202. [3] A. A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. Spamrank: Fully automatic link spam detection. In Proceedings of the st International Workshop on Adversarial InformationRetrieval on the Web (AIRWeb 05), [4] D. Carmel, N. Zwerdling, I. Guy, S. Ofek-Koifman, N. Har El, I. Ronen, E. Uziel, S. Yogev, and S. Chernov. Personalized social search based on the user s social network. In Proceedings of the 8th ACM conference on Information and knowledge management, pages ACM, [5] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages ACM, [6] C. Dong and B. Zhou. Effectively detecting content spam on the web using topical diversity measures. In Proceedings of the 202 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2), 202. [7] M. Erdélyi, A. Garzó, and A. A. Benczúr. Web spam classification: a few features worth more. In Proceedings of the 20 Joint WICOW/AIRWeb Workshop on Web Quality, WebQuality, pages 27 34, New York, NY, USA, 20. ACM. [8] D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB 04), pages 6, New York, NY, USA, ACM Press. [9] D. Fetterly, M. Manasse, and M. Najork. Detecting phraselevel duplication on the world wide web. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 05), pages 70 77, New York, NY, USA, ACM Press. [0] Z. Gyöngyi, P. Berkhin, H. Garcia-Molina, and J. Pedersen. Link spam detection based on mass estimation. In Proceedings of the 32nd International Conference on Very Large Databases (VLDB 06), pages ACM, [] Z. Gyöngyi and H. Garcia-Molina. Link spam alliances. In Proceedings of the 3st International Conference on Very Large Databases (VLDB 05), pages ACM, [2] Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb 05), [3] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 200. [4] S. Kirkpatrick, D. G. Jr., and M. P. Vecchi. Optimization by simulated annealing. science, 220(4598):67 680, 983. [5] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 5th International World Wide Web Conference (WWW 06), pages 83 92, New York, NY, USA, ACM Press. [6] A. I. Schein, A. Popescul, L. H. Ungar, and D. M. Pennock. Methods and metrics for cold-start recommendations. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages ACM, [7] N. Spirin and J. Han. Survey on web spam detection: principles and algorithms. SIGKDD Explorations, 3(2):50 64, 20. [8] M. Totty and M. Mangalindan. As google becomes web s gatekeeper, sites fight to get in. Wall Street Jounal XXCLI, 39, [9] J.-R. Wen, Z. Dou, and R. Song. Personalized web search. Encyclopedia of Database Systems, [20] B. Zhou and J. Pei. Link spam target detection using page farms. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(3),

Effectively Detecting Content Spam on the Web Using Topical Diversity Measures

2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology Effectively Detecting Content Spam on the Web Using Topical Diversity Measures Cailing Dong Department of