Towards Building A Personalized Online Web Spam Detector in Intelligent Web Browsers

Size: px
Start display at page:

Download "Towards Building A Personalized Online Web Spam Detector in Intelligent Web Browsers"

Transcription

1 203 IEEE/WIC/ACM International Conferences on Web Intelligence (WI) and Intelligent Agent Technology (IAT) Towards Building A Personalized Online Web Spam Detector in Intelligent Web Browsers Cailing Dong, Bin Zhou, Lina Zhou Department of Information Systems University of Maryland, Baltimore County Baltimore, MD, USA {cailing.dong, bzhou, zhoul}@umbc.edu Abstract Most of the extant studies about web spam detection either explicitly or implicitly assume that web spam detection is performed on the server side of a search engine. In this paper, we argue that in some scenarios, web spam detection is preferred to be conducted on the client side (e.g., intelligent web browsers). When a page is viewed using an intelligent web browser, an integrated personalized web spam detector can determine whether the page is spam or not specifically tailored to the user s judgements. We propose a framework for implementing a personalized web spam detector. The experimental results obtained from an empirical evaluation confirmed the effectiveness of the proposed personalized web spam detection method. Keywords-web spam; personalization; intelligent web browser; I. INTRODUCTION In recent years, web search engines such as Google and Microsoft s Bing have become the main entryways for billions of users to surf the web. Users can issue a keyword query and web search engines will return a ranked list of pages that are relevant to the query. A page ranked high in the search results will attract more users to visit the page that in turn brings more business opportunities to the website owners. As a result, many spam tricks have been adopted by the spammers to boost the rankings of their targeted pages in the web search results. The phenomena that spammers utilize any deliberate actions that bring to selected web pages an unjustifiable favorable relevance or importance [2] is referred to as web spam. Web spam undoubtedly decreases the quality of information searched on the web. Combating web spam effectively is a growing challenge for web search. The problem of web spam detection has attracted much attention and many methods have been proposed in the literature [], [2], [5], [6], [8], [0], [], [5], [7], [20]. In general, most existing studies model web spam detection as a traditional classification problem. Supervised learning, e.g., learning a binary web spam classifier using a training set which consists of pages labeled as spam or non-spam, is widely adopted. An unseen page can be categorized as either spam or non-spam using the classifier. The success of building a trustworthy web spam classifier mainly relies on two key factors: whether a large reliable training set exists or not, and whether a set of features is available for distinguishing spam pages from non-spam pages. Search engine companies usually have full access to the above two sets of information. Thus, it is not surprising to find that most existing studies about web spam detection explicitly or implicitly assume that web spam detection is performed on the search engine s server side. In practice, a reliable training dataset is hard to obtain. Usually, it starts with a subset manually labeled by domain experts. But even for domain experts, a universal labeling agreement is still difficult to achieve. Taking the well-known web spam benchmark data set WEBSPAM-UK2007 as an example, although volunteers participating in the labeling task were researchers and experts in the domain of web spam detection and detailed labeling guidelines were provided, inconsistent labels account for as high as 7.3% of the results (details will be discussed in Section IV). Given that the manual labelling is time-consuming, the training datasets are usually not frequently updated. This raises one critical issue of search engine side web spam detection: it will take a long time to reach an agreement on the data of new spam tricks before actually taking actions to detect them. Therefore, we propose a spam detection approach which can not only follow individual s judgement on spam, but also detect new spam tricks in a timely fashion. There is no doubt that search engines are the victims of web spam. Search engine companies have strong incentives to wipe out web spam and provide high-quality search results to users. However, individual users that rely on search engines for web surfing are in fact the ultimate victims of web spam. Therefore, users should have even stronger desires to have spam-free experience when using web search engines. To solve this problem, in this paper, we focus on the situation where web spam detection is initialized and conducted on the user side (i.e., the client side). Intelligent web browsers running on users computers are a typical application scenario. When a page is retrieved and viewed using an intelligent web browser, an integrated web spam detector can determine whether the page is spam or not specifically tailored to user s judgements of spam. Intelligent web browsers provide users with more information about /3 $ IEEE DOI 0.09/WI-IAT

2 and more controls over web pages. The training dataset is built automatically over the course of web surfing, and new spam tricks can be detected in a timely manner. Compared to the scenario where web spam detection is conducted on the search engine s server side, the client side web spam detection has several advantages. First, the results of web spam detection conducted on the client side can be personalized. As described above, one critical difference between a spam page and a non-spam page lies in whether the relevance of the page is justifiable or not. Different users have different opinions on such justifications. One person s spam page may be another person s treasure. In addition, user s personal opinion on spam may be evolving as well. As a result, the server side web spam detection does not always provide satisfactory results. Client side web spam detection is built specifically to meet each individual s judgements on spam. Second, a reliable training set for building a personalized web spam classifier is relatively easy to collect on the client side. The quality of the training set greatly affects the quality of the web spam classifier. As mentioned before, a universal agreement on spam labeling is not easy to achieve. In intelligent web browsers, we can collect each individual s judgements on spam from previous browsing history. The collected data reflect each individual s personal opinion on spam. The data are suitable for constructing a high-quality personalized web spam classifier. Third, client side web spam detection can respond to new spam tricks in a timely fashion. The web is evolving, and new spam tricks keep generating from time to time. The web search engine companies need to collect sufficient evidence for new spam tricks. Thus, there could be a considerable time delay for web search engines to wipe out those new spam tricks. On the client side, users can identify new spam tricks and provide spam labels whenever the pages are viewed. Last but not the least, users have more controls in web search when web spam detection is performed on the client side. Willing or not, web search engines currently determine what is trustworthy and what is not [8]. If a page is judged as spam by search engines, it is removed from the search results. We believe that the trustworthiness of a page should be a personal decision, not an absolute quality of a page. The proposed personalized web spam detector can help individual users make more informed decisions about the quality of the information they find on the web. Meanwhile, once users preferences are considered in web search, different users will see different web search results. This makes it much harder, even impossible at all, for the spammers to make their targeted pages universally appear in top ranked results. Thus, the motivation of adopting spam tricks for those spammers can be largely reduced. It is worth mentioning that conducting client-side web spam detection can complement the existing studies about server-side web spam detection. The two web spam detection processes are conducted at different stages. After search engine companies detect and filter global spam pages, a client-side web spam detector can still function to retrieve personalized web spam detection results. Besides, personalized web spam detection is different from personalized web search [9], although both of them try to provide different search results for different users based on their interests and preferences. Personalized web search is conducted on the basis of spam-free results given by search engines, while personalized web spam detection uses all the available information on the web and filters user-defined spams. In this paper, we describe in detail the technical approaches to building a personalized web spam detector in intelligent web browsers. The remainder of the paper is organized as follows. In Section II, we briefly reviewsome related studies on web spam detection and personalization techniques. In Section III, we present the framework of a personalized web spam detector. Section IV reports some experimental results obtained from an empirical evaluation. Finally, Section V concludes the paper and outlines some future research directions. II. RELATED WORK Web spam detection has attracted much attention in the past several years. Among different categories of web spam tricks, link spam and content spam are the two main categories [2], [7]. In [], Gyöngyi et al. described link spam as the situations where spammers set up structures of interconnected pages, called link spam farms, with the only purpose to boost the link-based ranking. In the literature, many efficient techniques have been developed to identify web pages which adopt link spam tricks. Statistical and structural properties of local web graph structures [], [5], [8] as well as some link-based evaluation measures and methods such as spam mass [0], SpamRank [3] and page farm [20] have been used in detecting link spam. Content spam is another type of web spam tricks. Unlike link spam tricks, content spam tricks mainly target at disguising the content of a page to make it appear relevant to many popular web searches. Statistical analysis of web page content, such as long host name [8] and duplicated textual chunks [9], has been widely used in content spam detection [5]. In addition to the word-level statistics, topiclevel measurements are also proposed to differentiate spam and non-spam pages [6]. Search engine companies typically conduct web spam detection by integrating both content spam detection and link spam detection techniques. Recently, Erdélyi et al. conducted an empirical study on how various spam features and machine learning algorithms contribute to the quality of web spam detection methods [7]. LogitBoost and RandomForest are reported to achieve superior classification results. While web spam detection is usually conducted on the search engine s server side, there are no existing studies 586

3 which focus on developing personalized web spam detection approaches. In the literature, personalization techniques have been widely used in many recommender systems and webbased services, including personalized web search [9], online social networks [4], etc. III. A PERSONALIZED WEB SPAM DETECTOR As personalized web spam detection is conducted on the client side, the client (e.g., web browser) needs to classify a web page as either spam or non-spam. We refer to such a web browser as an intelligent web browser. A typical implementation of intelligent web browsers is to integrate a personalized web spam detector into a regular web browser. In practice, the personalized web spam detector can be developed as a web browser plug-in. In this paper, we focus on the design of personalized web spam detector. The framework of the personalized web spam detector is depicted in Figure. Entities in rectangular boxes with solid lines are the major components of the spam detector. The process flow of detecting web spam in a personalized way is described in the following. ) When using an intelligent web browser, a user can directly submit a keyword query q to the web search engine. A ranked list of k web pages, denoted as L g = {(p,s g (p )), (p 2,s g (p 2 )),...,(p k,s g (p k ))}, is returned by the search engine. We use p i ( i k) to represent the i-th ranked page in L g. s g (p i ) ( i k) represents the probability that p i is a spam page. The value of s g (p i ) in L g is provided by the web search engine using the Global Spam Classifier maintained on search engine side. In other words, s g (p i ) captures search engine s judgements of spam. 2) The list L g is then forwarded to the component Personalized Spam Classifier. The personalized spam classifier is constructed based on the user s previous spam labeling results. It analyzes each page p i ( i k) in L g and produces a personalized spamicity score s p (p i ). The greater the value of s p (p i ), the more likely p i is categorized as a spam page by the user. We use L p = {(p,s p (p )), (p 2,s p (p 2 )),...,(p k,s p (p k ))} to represent the list of pages and their corresponding personalized spamicity scores. 3) The two lists L g and L p are then passed onto the component Personalized Spam Fuser. The goal of this component is to calculate a fused spamicity score which integrates both user s and search engine s judgements of spam. For each page p i ( i k), afused spamicity score, denoted as s f (p i ), can be calculated as s f (p i )=s p (p i ) λ + s g (p i ) ( λ). The parameter λ [0, ] determines the weight of s p (p i ) in the fused spamicity score. When λ is set to be, the fused spamicity score solely depends on s p (p i ), the user s historical judgements of spam. Global Spam Classifier Web Search Engine keyword query User Figure. (p,s g (p )) (p 2,s g (p 2 ))... (p k,s g (p k )) Personalized Spam Classifier Personalized Spam Fuser (p,s f (p )) (p 2,s f (p 2 ))... (p k,s f (p k )) Personalized Spam Filter Feedback Collector (p,s p (p )) (p 2,s p (p 2 ))... (p k,s p (p k )) Personalized Spam Detector A framework of personalized web spam detection. When λ is set to be 0, the fused spamicity score equals to s g (p i ), the search engine s judgements of spam. In practice, the value of λ can be learnt to meet each user s requirement. We use L f = {(p,s f (p )), (p 2,s f (p 2 )),...,(p k,s f (p k ))} to represent the list of pages and their corresponding fused spamicity scores. 4) The list L f is used as the input for the component Personalized Spam Filter. This component utilizes a spamicity threshold θ (0 θ ) to distinguish spam pages from non-spam pages. For a page p i L f ( i k), if s f (p i ) θ, p i is categorized as a spam page by the personalized spam filter and is removed from the final results. It is easy to see that the value of θ controls how aggressive the web spam detection should be. The value of θ that achieves the best spam detection performance is recommended by the detector to the user. Meanwhile, each user has full controls over the value of θ as his judgements of spam evolve over time. 5) After filtering those pages that are categorized as spam, a spam-free list of pages, denoted as L sf,is displayed to the user as the final results in response to user s keyword query q. The user can click on those pages in L sf to see whether they are relevant to q. During this step, user s feedback about spam can be collected. For example, if a page p i L sf ( i k) is clicked by the user but the user feels that p i should be filtered out, the user can simply submit an explicit feedback to the Feedback Collector which is integrated in the intelligent web browser. If the user does not submit any explicit feedback for p j,itisregardedas an implicit feedback that the user agrees on the current judgement on p j. Based on the user s feedbacks, the personalized spam classifier will be updated so as to capture user s new spam judgements. 587

4 When implementing the proposed personalized web spam detector in intelligent web browsers, there are several technical challenges. We address these challenges in the following. A. Obtaining the value of s g (p i ) Web search engines conduct web spam detection on the server side using the global spam classifier. But the current major web search engines do not explicit provide such spam judgement scores. Alternatively, we can rely on some publicly available APIs of web spam detection tools to obtain binary spam labeling results (e.g., http: //tool.motoricerca.info/spam-detector/). If p i is labeled as spam, we can set s g (p i )=,otherwises g (p i ) is set to be 0. Existing studies indicate that the performance of web spam detection using those APIs are comparable to state-ofthe-art methods. B. Constructing a personalized spam classifier Similar to many recommender systems, we also encounter a cold start problem [6] in constructing the personalized spam classifier, that is, the personalized spam classifier is built upon the collected user s historical spam judgements. Such data do not exist when the user starts to use the intelligent web browser at the very beginning. To address this problem, we basically set the parameter λ that determines the weight of s p (p i ) in the fused spamicity score, to be 0. When a set of user s spam judgements is collected, we can use an adaptive parameter turning method to update the value of λ, which will be introduced in Section III-C. To build a personalized spam classifier, we need to measure the probability s p that a page will be categorized as spam (i.e., personalized spamicity score). Suppose l features, denoted as x,...,x l, are obtained for each web page, we can train a logistic regression model as the spam classifier. The logistic regression equation is defined as follows: s p l ln = γ 0 + γ i x i, s p i= where γ 0,...,γ l are the coefficients. In other words, the probability s p that a page will be categorized as spam is measured as s p =. +e (γ 0 + l γ i= i x i ) There are several methods to learn the regression coefficients, among which Newton-Raphson method [3] is a popular one. When a new set of user s spam judgements is collected, this method can be used to update the personalized spam classifier efficiently. C. Adaptive parameter tuning in personalized spam fuser In Personalized Spam Fuser, Thevalueofλis initialized to be 0 such that the decision of being spam mainly depends on the global spamicity score s g (p i ).Thenλisupdated automatically to optimize spam detection performance. The adaptive parameter tuning is achieved by considering how satisfied the user is with regard to the current spam detection performance. Given a spamicity threshold θ, C g = {p i s g (p i ) θ} and C p = {p i s p (p i ) θ}, wherec g and C p represent the set of pages that are categorized as spam when the global spam classifier and the personal spam classifier are considered, separately. We use C u to denote the set of pages in the final list L sf on which user u provides explicit feedback. That is, C u contains the set of pages that user u feels they should be categorized as spam. Intuitively, if C u is more similar to C p than C g, we should assign a higher weight to s p (p i ) when calculating the fused spamicity score. Specifically, let λ t be the current value of λ for computing s f (p i ). We derive the new value λ t+ as follows: { λ t+ λ t + α β (Dice(C = u,c p) Dice(C u,c g)) λ t ( λ t ) α β (Dice(C u,c p) <Dice(C u,c g)) where Dice(C u,c p ) is the Dice coefficient [6] defined as Dice(C u,c p )= 2 Cu Cp C u + C p. The parameter α is a real number in the range of [0, ] controlling the adjustment rate. The parameter β is equal to Cu L sf. The larger the size of C u, the faster the value of λ is updated. D. Recommending the value for the spamicity threshold θ The value of θ controls the amount of spam pages to be detected. When θ is low, more pages will be classified as spam. Although users have full controls over determining the value of θ, it is often desired that the system can recommend an appropriate value for the parameter. In general, the smaller the value of θ, the higher recall but lower precision. A good value for θ should lead to both a relatively high precision and a relatively high recall. Therefore, we chose a value for θ that leads to the best F-measure which models the harmonic mean of precision and recall. Such θ can be obtained using the Simulated Annealing algorithm [4]. IV. AN EMPIRICAL EVALUATION The proposed personalized web spam detector is implemented as a FireFox plugin. In this section, we report some experimental results obtained from an empirical evaluation. Our major contribution in this paper is the development of the personalized web spam detection framework. In the following section, we mainly focus on the performance of Personalized Spam Classifier. A. Web Spam Benchmark Data Set We adopted the WEBSPAM-UK2007 data set released by the Search Engine Spam Project at Yahoo! Research Barcelona. The spam collection data set contains a training set and a testing set. A team of 48 volunteers participated to manually label these pages as spam, normal, and undecided based on a comprehensive list of labeling guidelines. A page is labeled as undecided if the volunteer 588

5 Labels S N U S&N S&U N&U S&N&U #Pages Table I FREQUENCY OF DIVERSE LABELS PROVIDED BY VOLUNTEERS IN THE WEBSPAM-UK2007 DATA SET (S: SPAM; N: NORMAL; U: UNDECIDED). cannot determine whether the page is a spam page or a normal page (i.e., a borderline page). In the data set, each page was labeled by at least one volunteer, and more than 90% pages were labeled by at least three volunteers. Given that the number of spam pages in each set is very small, we merged the two sets and adopted 0-fold cross validation for performance evaluation. In total, the data set consists of 6479 unique web pages crawled from the.uk domain. B. Diversity of Spam Judgements Across Individuals We first examined whether spam judgements from different volunteers were consistent. For each page in the data set, we checked whether the labels provided by different volunteers are exactly the same. The statistical results of spam judgements are summarized in Table I. The numbers of pages that only have unique spam label, normal label, and undecided label are 82, 4780, and 395, respectively. The remaining,22 pages in the data set, about 7.3% of the data set, have at least two different labels. In addition, 84 pages even have all the three different labels. This result clearly reveals the fact that each user has his/her personal judgements on spam. C. Performance of Personalized Spam Classifier We conducted experiments to evaluate how effective the personalized spam classifier can identify spam pages. For this purpose, we set the value of λ to be, and θ to be the recommended value with the highest F-measure. We extracted a total of 42 different features from each page when training the classifier. These features cover a wide variety of properties of web pages, including link-based features (e.g., in-degree, out-degree, PageRank, etc.) and contentbased features (e.g., number of words in the page/title, average word length, etc.). Among all the available features provided in the WEBSPAM-UK2007 spam collections, these features are reported to achieve high performance for web spam detection. When volunteers were asked to label those pages, they were required to open the page, check the page content, and then generate the label one by one. We considered this as a simulation of using the proposed personalized web spam detector: volunteers open those pages one by one in the intelligent web browser and provide explicit feedbacks based on their judgements of spam. We were interested to find out whether the performance of the personalized spam classifier improves with the size of user s spam judgements. For the pages that each volunteer provided a label, we partitioned the pages into 0 parts with equal size. The i- th set of pages is denoted as S i ( i 0). Volunteers AUC Step AUC Step Figure 2. The performance of the personalized spam classifiers across different steps (Left: Volunteer A; Right: Volunteer B). Precision/Recall Precision Recall Threshold (θ) Figure 3. Precision and recall of the personalized spam classifier when the spamicity threshold varies. first provided labels for pages in S, then pages in S 2,and so on and so forth. Correspondingly, the personalized spam classifier in step i is built using user s spam judgements collected from {S,...,S i }, and measure its performance on S i+ ( i 9). The AUC value was chosen as the metric for comparing the performance of the personalized spam classifiers. In Figure 2, we depict the AUC values for randomly selected two volunteers. We identified similar trends for other volunteers in the data set. When more user s spam judgements are included for training the personalized spam classifier, the performance improves. At some particular steps, the spam detection performance may get worse than the previous step. This may have been caused by the result of new spam tricks used in those spam pages, which can not be identified in the current step. However, as revealed from Figure 2, these new spam tricks are well captured in the following step(s). This result demonstrates that the personalized web spam detector can respond to new spam tricks in a timely fashion. D. Results by Varying the Spamicity Threshold We also conducted some experiments to evaluate the effect of spamicity threshold θ in the proposed Personalized Spam Filter component. In this experiment, a personalized web spam classifier is trained only using each volunteer s labeled data. A 0-fold cross validation is performed to evaluate the average performance of personalized web spam detection for those participated volunteers. We varied the threshold value θ from 0.2 to 0.9. Precision and recall for the personalized spam classifier are shown in Figure 3. Clearly, when the value of threshold θ is small, web spam detection is aggressive, which results in low precision and high recall. When the value of θ increases, fewer pages are categorized as spam pages. Both precision and recall are affected by the threshold value θ. In our proposed framework, experienced users have the ability to choose different threshold values to control how aggressive the web spam detection should be. For those inexperienced users, the 589

6 personalized spam detector can automatically recommend a value (0.6 in this experiment) to them. V. CONCLUSIONS In this paper, we addressed the scenario where web spam detection is conducted on the client side (e.g., intelligent web browsers). We identified the advantages of performing personalized web spam detection, and developed a practical framework to implement the personalized web spam detector in intelligent web browsers. An empirical evaluation confirmed the advantages of the proposed personalized web spam detection solution. In future research, we plan to incorporate additional user s preferences to enhance personalized web spam detection. For example, individual s activities in public social media can serve as a valuable data source for such purpose. ACKNOWLEDGMENT This research is supported in part by a Samsung Global Research Outreach grant and a UMBC Special Research Assistantship/Initiative Support grant. All opinions, findings, conclusions and recommendations in this paper are those of the authors and do not necessarily reflect the views of the funding agency. REFERENCES [] J. Abernethy, O. Chapelle, and C. Castillo. Web spam identification through content and hyperlinks. In Proceedings of the 4th international workshop on Adversarial information retrieval on the web, pages ACM, [2] S. P. Algur and N. T. Pendari. Hybrid spamicity score approach to web spam detection. In Pattern Recognition, Informatics and Medical Engineering (PRIME), 202 International Conference on, pages IEEE, 202. [3] A. A. Benczur, K. Csalogany, T. Sarlos, and M. Uher. Spamrank: Fully automatic link spam detection. In Proceedings of the st International Workshop on Adversarial InformationRetrieval on the Web (AIRWeb 05), [4] D. Carmel, N. Zwerdling, I. Guy, S. Ofek-Koifman, N. Har El, I. Ronen, E. Uziel, S. Yogev, and S. Chernov. Personalized social search based on the user s social network. In Proceedings of the 8th ACM conference on Information and knowledge management, pages ACM, [5] C. Castillo, D. Donato, A. Gionis, V. Murdock, and F. Silvestri. Know your neighbors: Web spam detection using the web topology. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pages ACM, [6] C. Dong and B. Zhou. Effectively detecting content spam on the web using topical diversity measures. In Proceedings of the 202 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2), 202. [7] M. Erdélyi, A. Garzó, and A. A. Benczúr. Web spam classification: a few features worth more. In Proceedings of the 20 Joint WICOW/AIRWeb Workshop on Web Quality, WebQuality, pages 27 34, New York, NY, USA, 20. ACM. [8] D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proceedings of the 7th International Workshop on the Web and Databases (WebDB 04), pages 6, New York, NY, USA, ACM Press. [9] D. Fetterly, M. Manasse, and M. Najork. Detecting phraselevel duplication on the world wide web. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 05), pages 70 77, New York, NY, USA, ACM Press. [0] Z. Gyöngyi, P. Berkhin, H. Garcia-Molina, and J. Pedersen. Link spam detection based on mass estimation. In Proceedings of the 32nd International Conference on Very Large Databases (VLDB 06), pages ACM, [] Z. Gyöngyi and H. Garcia-Molina. Link spam alliances. In Proceedings of the 3st International Conference on Very Large Databases (VLDB 05), pages ACM, [2] Z. Gyöngyi and H. Garcia-Molina. Web spam taxonomy. In Proceedings of the st International Workshop on Adversarial Information Retrieval on the Web (AIRWeb 05), [3] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 200. [4] S. Kirkpatrick, D. G. Jr., and M. P. Vecchi. Optimization by simulated annealing. science, 220(4598):67 680, 983. [5] A. Ntoulas, M. Najork, M. Manasse, and D. Fetterly. Detecting spam web pages through content analysis. In Proceedings of the 5th International World Wide Web Conference (WWW 06), pages 83 92, New York, NY, USA, ACM Press. [6] A. I. Schein, A. Popescul, L. H. Ungar, and D. M. Pennock. Methods and metrics for cold-start recommendations. In Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval, pages ACM, [7] N. Spirin and J. Han. Survey on web spam detection: principles and algorithms. SIGKDD Explorations, 3(2):50 64, 20. [8] M. Totty and M. Mangalindan. As google becomes web s gatekeeper, sites fight to get in. Wall Street Jounal XXCLI, 39, [9] J.-R. Wen, Z. Dou, and R. Song. Personalized web search. Encyclopedia of Database Systems, [20] B. Zhou and J. Pei. Link spam target detection using page farms. ACM Transactions on Knowledge Discovery from Data (TKDD), 3(3),

Effectively Detecting Content Spam on the Web Using Topical Diversity Measures

Effectively Detecting Content Spam on the Web Using Topical Diversity Measures 2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology Effectively Detecting Content Spam on the Web Using Topical Diversity Measures Cailing Dong Department of

More information

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology Seminar: Future Of Web Search University of Saarland Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood Tutor : Klaus Berberich Date : 17-Jan-2008 The Agenda

More information

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings.

Karami, A., Zhou, B. (2015). Online Review Spam Detection by New Linguistic Features. In iconference 2015 Proceedings. Online Review Spam Detection by New Linguistic Features Amir Karam, University of Maryland Baltimore County Bin Zhou, University of Maryland Baltimore County Karami, A., Zhou, B. (2015). Online Review

More information

Method to Study and Analyze Fraud Ranking In Mobile Apps

Method to Study and Analyze Fraud Ranking In Mobile Apps Method to Study and Analyze Fraud Ranking In Mobile Apps Ms. Priyanka R. Patil M.Tech student Marri Laxman Reddy Institute of Technology & Management Hyderabad. Abstract: Ranking fraud in the mobile App

More information

Robust PageRank and Locally Computable Spam Detection Features

Robust PageRank and Locally Computable Spam Detection Features Robust PageRank and Locally Computable Spam Detection Features Reid Andersen reidan@microsoftcom John Hopcroft Cornell University jeh@cscornelledu Christian Borgs borgs@microsoftcom Kamal Jain kamalj@microsoftcom

More information

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach

Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Detecting Spam Bots in Online Social Networking Sites: A Machine Learning Approach Alex Hai Wang College of Information Sciences and Technology, The Pennsylvania State University, Dunmore, PA 18512, USA

More information

Fraudulent Support Telephone Number Identification Based on Co-occurrence Information on the Web

Fraudulent Support Telephone Number Identification Based on Co-occurrence Information on the Web Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence Fraudulent Support Telephone Number Identification Based on Co-occurrence Information on the Web Xin Li, Yiqun Liu, Min Zhang,

More information

Previous: how search engines work

Previous: how search engines work detection Ricardo Baeza-Yates,3 ricardo@baeza.cl With: L. Becchetti 2, P. Boldi 5, C. Castillo, D. Donato, A. Gionis, S. Leonardi 2, V.Murdock, M. Santini 5, F. Silvestri 4, S. Vigna 5. Yahoo! Research

More information

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion Ye Tian, Gary M. Weiss, Qiang Ma Department of Computer and Information Science Fordham University 441 East Fordham

More information

A Method for Finding Link Hijacking Based on Modified PageRank Algorithms

A Method for Finding Link Hijacking Based on Modified PageRank Algorithms DEWS2008 A10-1 A Method for Finding Link Hijacking Based on Modified PageRank Algorithms Young joo Chung Masashi Toyoda Masaru Kitsuregawa Institute of Industrial Science, University of Tokyo 4-6-1 Komaba

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

VisoLink: A User-Centric Social Relationship Mining

VisoLink: A User-Centric Social Relationship Mining VisoLink: A User-Centric Social Relationship Mining Lisa Fan and Botang Li Department of Computer Science, University of Regina Regina, Saskatchewan S4S 0A2 Canada {fan, li269}@cs.uregina.ca Abstract.

More information

University of TREC 2009: Indexing half a billion web pages

University of TREC 2009: Indexing half a billion web pages University of Twente @ TREC 2009: Indexing half a billion web pages Claudia Hauff and Djoerd Hiemstra University of Twente, The Netherlands {c.hauff, hiemstra}@cs.utwente.nl draft 1 Introduction The University

More information

ResPubliQA 2010

ResPubliQA 2010 SZTAKI @ ResPubliQA 2010 David Mark Nemeskey Computer and Automation Research Institute, Hungarian Academy of Sciences, Budapest, Hungary (SZTAKI) Abstract. This paper summarizes the results of our first

More information

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems

A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems A novel supervised learning algorithm and its use for Spam Detection in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University of Economics

More information

ASCERTAINING THE RELEVANCE MODEL OF A WEB SEARCH-ENGINE BIPIN SURESH

ASCERTAINING THE RELEVANCE MODEL OF A WEB SEARCH-ENGINE BIPIN SURESH ASCERTAINING THE RELEVANCE MODEL OF A WEB SEARCH-ENGINE BIPIN SURESH Abstract We analyze the factors contributing to the relevance of a web-page as computed by popular industry web search-engines. We also

More information

Search Costs vs. User Satisfaction on Mobile

Search Costs vs. User Satisfaction on Mobile Search Costs vs. User Satisfaction on Mobile Manisha Verma, Emine Yilmaz University College London mverma@cs.ucl.ac.uk, emine.yilmaz@ucl.ac.uk Abstract. Information seeking is an interactive process where

More information

Link Prediction for Social Network

Link Prediction for Social Network Link Prediction for Social Network Ning Lin Computer Science and Engineering University of California, San Diego Email: nil016@eng.ucsd.edu Abstract Friendship recommendation has become an important issue

More information

Learning to Detect Web Spam by Genetic Programming

Learning to Detect Web Spam by Genetic Programming Learning to Detect Web Spam by Genetic Programming Xiaofei Niu 1,3, Jun Ma 1,, Qiang He 1, Shuaiqiang Wang 2, and Dongmei Zhang 1,3 1 School of Computer Science and Technology, Shandong University, Jinan

More information

A Method for Finding Link Hijacking Based on Modified PageRank Algorithms

A Method for Finding Link Hijacking Based on Modified PageRank Algorithms DEWS2008 A10-1 A Method for Finding Link Hijacking Based on Modified PageRank Algorithms Young joo Chung Masashi Toyoda Masaru Kitsuregawa Institute of Industrial Science, University of Tokyo 4-6-1 Komaba

More information

Assisting Trustworthiness Based Web Services Selection Using the Fidelity of Websites *

Assisting Trustworthiness Based Web Services Selection Using the Fidelity of Websites * Assisting Trustworthiness Based Web Services Selection Using the Fidelity of Websites * Lijie Wang, Fei Liu, Ge Li **, Liang Gu, Liangjie Zhang, and Bing Xie Software Institute, School of Electronic Engineering

More information

Reducing Redundancy with Anchor Text and Spam Priors

Reducing Redundancy with Anchor Text and Spam Priors Reducing Redundancy with Anchor Text and Spam Priors Marijn Koolen 1 Jaap Kamps 1,2 1 Archives and Information Studies, Faculty of Humanities, University of Amsterdam 2 ISLA, Informatics Institute, University

More information

Fraud Detection of Mobile Apps

Fraud Detection of Mobile Apps Fraud Detection of Mobile Apps Urmila Aware*, Prof. Amruta Deshmuk** *(Student, Dept of Computer Engineering, Flora Institute Of Technology Pune, Maharashtra, India **( Assistant Professor, Dept of Computer

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Survey on Community Question Answering Systems

Survey on Community Question Answering Systems World Journal of Technology, Engineering and Research, Volume 3, Issue 1 (2018) 114-119 Contents available at WJTER World Journal of Technology, Engineering and Research Journal Homepage: www.wjter.com

More information

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating

Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Combining Review Text Content and Reviewer-Item Rating Matrix to Predict Review Rating Dipak J Kakade, Nilesh P Sable Department of Computer Engineering, JSPM S Imperial College of Engg. And Research,

More information

New Classification Method Based on Decision Tree for Web Spam Detection

New Classification Method Based on Decision Tree for Web Spam Detection Research Article International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347-5161 2014 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Rashmi

More information

Optimized Searching Algorithm Based On Page Ranking: (OSA PR)

Optimized Searching Algorithm Based On Page Ranking: (OSA PR) Volume 2, No. 5, Sept-Oct 2011 International Journal of Advanced Research in Computer Science RESEARCH PAPER Available Online at www.ijarcs.info ISSN No. 0976-5697 Optimized Searching Algorithm Based On

More information

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK

IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK IMAGE RETRIEVAL SYSTEM: BASED ON USER REQUIREMENT AND INFERRING ANALYSIS TROUGH FEEDBACK 1 Mount Steffi Varish.C, 2 Guru Rama SenthilVel Abstract - Image Mining is a recent trended approach enveloped in

More information

Splog Detection Using Self-Similarity Analysis on Blog Temporal Dynamics. Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tatemura and Belle Tseng

Splog Detection Using Self-Similarity Analysis on Blog Temporal Dynamics. Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tatemura and Belle Tseng Splog Detection Using Self-Similarity Analysis on Blog Temporal Dynamics Yu-Ru Lin, Hari Sundaram, Yun Chi, Junichi Tatemura and Belle Tseng NEC Laboratories America, Cupertino, CA AIRWeb Workshop 2007

More information

Comparison of Optimization Methods for L1-regularized Logistic Regression

Comparison of Optimization Methods for L1-regularized Logistic Regression Comparison of Optimization Methods for L1-regularized Logistic Regression Aleksandar Jovanovich Department of Computer Science and Information Systems Youngstown State University Youngstown, OH 44555 aleksjovanovich@gmail.com

More information

Ranking web pages using machine learning approaches

Ranking web pages using machine learning approaches University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2008 Ranking web pages using machine learning approaches Sweah Liang Yong

More information

Community-Based Recommendations: a Solution to the Cold Start Problem

Community-Based Recommendations: a Solution to the Cold Start Problem Community-Based Recommendations: a Solution to the Cold Start Problem Shaghayegh Sahebi Intelligent Systems Program University of Pittsburgh sahebi@cs.pitt.edu William W. Cohen Machine Learning Department

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 3, SEPTEMBER

IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 3, SEPTEMBER IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, VOL. 5, NO. 3, SEPTEMBER 2010 581 Web Spam Detection: New Classification Features Based on Qualified Link Analysis and Language Models Lourdes Araujo

More information

Analyzing and Detecting Review Spam

Analyzing and Detecting Review Spam Seventh IEEE International Conference on Data Mining Analyzing and Detecting Review Spam Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago nitin.jindal@gmail.com,

More information

Overview of the TREC 2013 Crowdsourcing Track

Overview of the TREC 2013 Crowdsourcing Track Overview of the TREC 2013 Crowdsourcing Track Mark D. Smucker 1, Gabriella Kazai 2, and Matthew Lease 3 1 Department of Management Sciences, University of Waterloo 2 Microsoft Research, Cambridge, UK 3

More information

WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS

WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS Juan Martinez-Romo and Lourdes Araujo Natural Language Processing and Information Retrieval Group at UNED * nlp.uned.es Fifth International Workshop

More information

An Empirical Study of Lazy Multilabel Classification Algorithms

An Empirical Study of Lazy Multilabel Classification Algorithms An Empirical Study of Lazy Multilabel Classification Algorithms E. Spyromitros and G. Tsoumakas and I. Vlahavas Department of Informatics, Aristotle University of Thessaloniki, 54124 Thessaloniki, Greece

More information

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track

Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Challenges on Combining Open Web and Dataset Evaluation Results: The Case of the Contextual Suggestion Track Alejandro Bellogín 1,2, Thaer Samar 1, Arjen P. de Vries 1, and Alan Said 1 1 Centrum Wiskunde

More information

RSDC 09: Tag Recommendation Using Keywords and Association Rules

RSDC 09: Tag Recommendation Using Keywords and Association Rules RSDC 09: Tag Recommendation Using Keywords and Association Rules Jian Wang, Liangjie Hong and Brian D. Davison Department of Computer Science and Engineering Lehigh University, Bethlehem, PA 18015 USA

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION

LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION Evgeny Kharitonov *, ***, Anton Slesarev *, ***, Ilya Muchnik **, ***, Fedor Romanenko ***, Dmitry Belyaev ***, Dmitry Kotlyarov *** * Moscow Institute

More information

Identifying Spam Link Generators for Monitoring Emerging Web Spam

Identifying Spam Link Generators for Monitoring Emerging Web Spam Identifying Spam Link Generators for Monitoring Emerging Web Spam Young-joo Chung chung@tkl.iis.utokyo.ac.jp Masashi Toyoda toyoda@tkl.iis.utokyo.ac.jp Institute of Industrial Science, The University of

More information

Linking Entities in Chinese Queries to Knowledge Graph

Linking Entities in Chinese Queries to Knowledge Graph Linking Entities in Chinese Queries to Knowledge Graph Jun Li 1, Jinxian Pan 2, Chen Ye 1, Yong Huang 1, Danlu Wen 1, and Zhichun Wang 1(B) 1 Beijing Normal University, Beijing, China zcwang@bnu.edu.cn

More information

Detecting Spam Web Pages

Detecting Spam Web Pages Detecting Spam Web Pages Marc Najork Microsoft Research Silicon Valley About me 1989-1993: UIUC (home of NCSA Mosaic) 1993-2001: Digital Equipment/Compaq Started working on web search in 1997 Mercator

More information

A novel approach of web search based on community wisdom

A novel approach of web search based on community wisdom University of Wollongong Research Online Faculty of Engineering - Papers (Archive) Faculty of Engineering and Information Sciences 2008 A novel approach of web search based on community wisdom Weiliang

More information

Topical TrustRank: Using Topicality to Combat Web Spam

Topical TrustRank: Using Topicality to Combat Web Spam Topical TrustRank: Using Topicality to Combat Web Spam Baoning Wu Vinay Goel Brian D. Davison Department of Computer Science & Engineering Lehigh University Bethlehem, PA 18015 USA {baw4,vig204,davison}@cse.lehigh.edu

More information

Effective Page Refresh Policies for Web Crawlers

Effective Page Refresh Policies for Web Crawlers For CS561 Web Data Management Spring 2013 University of Crete Effective Page Refresh Policies for Web Crawlers and a Semantic Web Document Ranking Model Roger-Alekos Berkley IMSE 2012/2014 Paper 1: Main

More information

Web Data mining-a Research area in Web usage mining

Web Data mining-a Research area in Web usage mining IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 13, Issue 1 (Jul. - Aug. 2013), PP 22-26 Web Data mining-a Research area in Web usage mining 1 V.S.Thiyagarajan,

More information

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München Evaluation Measures Sebastian Pölsterl Computer Aided Medical Procedures Technische Universität München April 28, 2015 Outline 1 Classification 1. Confusion Matrix 2. Receiver operating characteristics

More information

Effective Latent Space Graph-based Re-ranking Model with Global Consistency

Effective Latent Space Graph-based Re-ranking Model with Global Consistency Effective Latent Space Graph-based Re-ranking Model with Global Consistency Feb. 12, 2009 1 Outline Introduction Related work Methodology Graph-based re-ranking model Learning a latent space graph A case

More information

A Reference Collection for Web Spam

A Reference Collection for Web Spam A Reference Collection for Web Spam Carlos Castillo 1,3, Debora Donato 1,3, Luca Becchetti 1, Paolo Boldi 2, Stefano Leonardi 1, Massimo Santini 2 and Sebastiano Vigna 2 1 Università di Roma 2 Università

More information

University of Delaware at Diversity Task of Web Track 2010

University of Delaware at Diversity Task of Web Track 2010 University of Delaware at Diversity Task of Web Track 2010 Wei Zheng 1, Xuanhui Wang 2, and Hui Fang 1 1 Department of ECE, University of Delaware 2 Yahoo! Abstract We report our systems and experiments

More information

Automated Online News Classification with Personalization

Automated Online News Classification with Personalization Automated Online News Classification with Personalization Chee-Hong Chan Aixin Sun Ee-Peng Lim Center for Advanced Information Systems, Nanyang Technological University Nanyang Avenue, Singapore, 639798

More information

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008 Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008 Overview Introduction Countering Email Spam Problem Description Classification

More information

Deep Character-Level Click-Through Rate Prediction for Sponsored Search

Deep Character-Level Click-Through Rate Prediction for Sponsored Search Deep Character-Level Click-Through Rate Prediction for Sponsored Search Bora Edizel - Phd Student UPF Amin Mantrach - Criteo Research Xiao Bai - Oath This work was done at Yahoo and will be presented as

More information

Mining Frequent Itemsets for data streams over Weighted Sliding Windows

Mining Frequent Itemsets for data streams over Weighted Sliding Windows Mining Frequent Itemsets for data streams over Weighted Sliding Windows Pauray S.M. Tsai Yao-Ming Chen Department of Computer Science and Information Engineering Minghsin University of Science and Technology

More information

Detecting Tag Spam in Social Tagging Systems with Collaborative Knowledge

Detecting Tag Spam in Social Tagging Systems with Collaborative Knowledge 2009 Sixth International Conference on Fuzzy Systems and Knowledge Discovery Detecting Tag Spam in Social Tagging Systems with Collaborative Knowledge Kaipeng Liu Research Center of Computer Network and

More information

A Time-based Recommender System using Implicit Feedback

A Time-based Recommender System using Implicit Feedback A Time-based Recommender System using Implicit Feedback T. Q. Lee Department of Mobile Internet Dongyang Technical College Seoul, Korea Abstract - Recommender systems provide personalized recommendations

More information

A Framework for Securing Databases from Intrusion Threats

A Framework for Securing Databases from Intrusion Threats A Framework for Securing Databases from Intrusion Threats R. Prince Jeyaseelan James Department of Computer Applications, Valliammai Engineering College Affiliated to Anna University, Chennai, India Email:

More information

Robust Relevance-Based Language Models

Robust Relevance-Based Language Models Robust Relevance-Based Language Models Xiaoyan Li Department of Computer Science, Mount Holyoke College 50 College Street, South Hadley, MA 01075, USA Email: xli@mtholyoke.edu ABSTRACT We propose a new

More information

An Archetype for Web Mining with Enhanced Topical Crawler and Link-Spam Trapper

An Archetype for Web Mining with Enhanced Topical Crawler and Link-Spam Trapper International Journal of Engineering Science Invention Volume 2 Issue 3 ǁ March. 2013 An Archetype for Web Mining with Enhanced Topical Crawler and Link-Spam Trapper * N.Kabilan *Student, BE Computer Science

More information

Songtao Hei. Thesis under the direction of Professor Christopher Jules White

Songtao Hei. Thesis under the direction of Professor Christopher Jules White COMPUTER SCIENCE A Decision Tree Based Approach to Filter Candidates for Software Engineering Jobs Using GitHub Data Songtao Hei Thesis under the direction of Professor Christopher Jules White A challenge

More information

Inferring User Search for Feedback Sessions

Inferring User Search for Feedback Sessions Inferring User Search for Feedback Sessions Sharayu Kakade 1, Prof. Ranjana Barde 2 PG Student, Department of Computer Science, MIT Academy of Engineering, Pune, MH, India 1 Assistant Professor, Department

More information

Byzantine Consensus in Directed Graphs

Byzantine Consensus in Directed Graphs Byzantine Consensus in Directed Graphs Lewis Tseng 1,3, and Nitin Vaidya 2,3 1 Department of Computer Science, 2 Department of Electrical and Computer Engineering, and 3 Coordinated Science Laboratory

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

RETRACTED ARTICLE. Web-Based Data Mining in System Design and Implementation. Open Access. Jianhu Gong 1* and Jianzhi Gong 2

RETRACTED ARTICLE. Web-Based Data Mining in System Design and Implementation. Open Access. Jianhu Gong 1* and Jianzhi Gong 2 Send Orders for Reprints to reprints@benthamscience.ae The Open Automation and Control Systems Journal, 2014, 6, 1907-1911 1907 Web-Based Data Mining in System Design and Implementation Open Access Jianhu

More information

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets Arjumand Younus 1,2, Colm O Riordan 1, and Gabriella Pasi 2 1 Computational Intelligence Research Group,

More information

Query Independent Scholarly Article Ranking

Query Independent Scholarly Article Ranking Query Independent Scholarly Article Ranking Shuai Ma, Chen Gong, Renjun Hu, Dongsheng Luo, Chunming Hu, Jinpeng Huai SKLSDE Lab, Beihang University, China Beijing Advanced Innovation Center for Big Data

More information

Detecting Content Spam on the Web through Text Diversity Analysis

Detecting Content Spam on the Web through Text Diversity Analysis Detecting Content Spam on the Web through Text Diversity Analysis Anton Pavlov pavvloff@yandex.ru M.V. Lomonosov Mosco State University, Faculty of Computational Mathematics and Cybernetics Boris Dobrov

More information

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity Tomáš Kramár, Michal Barla and Mária Bieliková Faculty of Informatics and Information Technology Slovak University

More information

REPORT DOCUMENTATION PAGE

REPORT DOCUMENTATION PAGE REPORT DOCUMENTATION PAGE Form Approved OMB NO. 0704-0188 The public reporting burden for this collection of information is estimated to average 1 hour per response, including the time for reviewing instructions,

More information

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems

A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems A modified and fast Perceptron learning rule and its use for Tag Recommendations in Social Bookmarking Systems Anestis Gkanogiannis and Theodore Kalamboukis Department of Informatics Athens University

More information

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming

Automatic New Topic Identification in Search Engine Transaction Log Using Goal Programming Proceedings of the 2012 International Conference on Industrial Engineering and Operations Management Istanbul, Turkey, July 3 6, 2012 Automatic New Topic Identification in Search Engine Transaction Log

More information

Web Usage Mining: A Research Area in Web Mining

Web Usage Mining: A Research Area in Web Mining Web Usage Mining: A Research Area in Web Mining Rajni Pamnani, Pramila Chawan Department of computer technology, VJTI University, Mumbai Abstract Web usage mining is a main research area in Web mining

More information

Improving the Quality of Search in Personalized Web Search

Improving the Quality of Search in Personalized Web Search Improving the Quality of Search in Personalized Web Search P.Ramya PG Scholar, Dept of CSE, Chiranjeevi Reddy Institute of Engineering & Technology, AP, India. S.Sravani Assistant Professor, Dept of CSE,

More information

Multi-label classification using rule-based classifier systems

Multi-label classification using rule-based classifier systems Multi-label classification using rule-based classifier systems Shabnam Nazmi (PhD candidate) Department of electrical and computer engineering North Carolina A&T state university Advisor: Dr. A. Homaifar

More information

Deep Web Crawling and Mining for Building Advanced Search Application

Deep Web Crawling and Mining for Building Advanced Search Application Deep Web Crawling and Mining for Building Advanced Search Application Zhigang Hua, Dan Hou, Yu Liu, Xin Sun, Yanbing Yu {hua, houdan, yuliu, xinsun, yyu}@cc.gatech.edu College of computing, Georgia Tech

More information

The Comparative Study of Machine Learning Algorithms in Text Data Classification*

The Comparative Study of Machine Learning Algorithms in Text Data Classification* The Comparative Study of Machine Learning Algorithms in Text Data Classification* Wang Xin School of Science, Beijing Information Science and Technology University Beijing, China Abstract Classification

More information

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers

Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers Tour-Based Mode Choice Modeling: Using An Ensemble of (Un-) Conditional Data-Mining Classifiers James P. Biagioni Piotr M. Szczurek Peter C. Nelson, Ph.D. Abolfazl Mohammadian, Ph.D. Agenda Background

More information

FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM

FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM FEATURE SELECTION-MODEL-BASED CONTENT ANALYSIS FOR COMBATING WEB SPAM Shipra Mittal 1* and Akanksha Juneja 2 Department of Computer Science & Engineering, National Institute of Technology, Delhi, India

More information

A Survey of Major Techniques for Combating Link Spamming

A Survey of Major Techniques for Combating Link Spamming Journal of Information & Computational Science 7: (00) 9 6 Available at http://www.joics.com A Survey of Major Techniques for Combating Link Spamming Yi Li a,, Jonathan J. H. Zhu b, Xiaoming Li c a CNDS

More information

Estimating Credibility of User Clicks with Mouse Movement and Eye-tracking Information

Estimating Credibility of User Clicks with Mouse Movement and Eye-tracking Information Estimating Credibility of User Clicks with Mouse Movement and Eye-tracking Information Jiaxin Mao, Yiqun Liu, Min Zhang, Shaoping Ma Department of Computer Science and Technology, Tsinghua University Background

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Creating a Classifier for a Focused Web Crawler

Creating a Classifier for a Focused Web Crawler Creating a Classifier for a Focused Web Crawler Nathan Moeller December 16, 2015 1 Abstract With the increasing size of the web, it can be hard to find high quality content with traditional search engines.

More information

ICTNET at Web Track 2010 Diversity Task

ICTNET at Web Track 2010 Diversity Task ICTNET at Web Track 2010 Diversity Task Yuanhai Xue 1,2, Zeying Peng 1,2, Xiaoming Yu 1, Yue Liu 1, Hongbo Xu 1, Xueqi Cheng 1 1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing,

More information

Inverted List Caching for Topical Index Shards

Inverted List Caching for Topical Index Shards Inverted List Caching for Topical Index Shards Zhuyun Dai and Jamie Callan Language Technologies Institute, Carnegie Mellon University {zhuyund, callan}@cs.cmu.edu Abstract. Selective search is a distributed

More information

Analysis of Dendrogram Tree for Identifying and Visualizing Trends in Multi-attribute Transactional Data

Analysis of Dendrogram Tree for Identifying and Visualizing Trends in Multi-attribute Transactional Data Analysis of Dendrogram Tree for Identifying and Visualizing Trends in Multi-attribute Transactional Data D.Radha Rani 1, A.Vini Bharati 2, P.Lakshmi Durga Madhuri 3, M.Phaneendra Babu 4, A.Sravani 5 Department

More information

DATA MINING - 1DL105, 1DL111

DATA MINING - 1DL105, 1DL111 1 DATA MINING - 1DL105, 1DL111 Fall 2007 An introductory class in data mining http://user.it.uu.se/~udbl/dut-ht2007/ alt. http://www.it.uu.se/edu/course/homepage/infoutv/ht07 Kjell Orsborn Uppsala Database

More information

Page Rank Link Farm Detection

Page Rank Link Farm Detection International Journal of Engineering Inventions e-issn: 2278-7461, p-issn: 2319-6491 Volume 4, Issue 1 (July 2014) PP: 55-59 Page Rank Link Farm Detection Akshay Saxena 1, Rohit Nigam 2 1, 2 Department

More information

User Exploration of Slider Facets in Interactive People Search System

User Exploration of Slider Facets in Interactive People Search System User Exploration of Slider Facets in Interactive People Search System Shuguang Han 1, Danchen Zhang 1, Daqing He 1 and Qikai Cheng 2 1 School of Information Sciences, University of Pittsburgh, USA 2 School

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures Springer Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web

More information

Personalized Search Engine using Social Networking Activity

Personalized Search Engine using Social Networking Activity Indian Journal of Science and Technology, Vol 8(4), 301 306, February 2015 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 DOI : 10.17485/ijst/2015/v8i4/60376 Personalized Search Engine using Social

More information

INTRODUCTION. Chapter GENERAL

INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which

More information

Web Spam Challenge 2008

Web Spam Challenge 2008 Web Spam Challenge 2008 Data Analysis School, Moscow, Russia K. Bauman, A. Brodskiy, S. Kacher, E. Kalimulina, R. Kovalev, M. Lebedev, D. Orlov, P. Sushin, P. Zryumov, D. Leshchiner, I. Muchnik The Data

More information

Music Recommendation with Implicit Feedback and Side Information

Music Recommendation with Implicit Feedback and Side Information Music Recommendation with Implicit Feedback and Side Information Shengbo Guo Yahoo! Labs shengbo@yahoo-inc.com Behrouz Behmardi Criteo b.behmardi@criteo.com Gary Chen Vobile gary.chen@vobileinc.com Abstract

More information

A Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence

A Network Intrusion Detection System Architecture Based on Snort and. Computational Intelligence 2nd International Conference on Electronics, Network and Computer Engineering (ICENCE 206) A Network Intrusion Detection System Architecture Based on Snort and Computational Intelligence Tao Liu, a, Da

More information