Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam

Size: px
Start display at page:

Download "Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam"

Transcription

1 Thwarting the Nigritude Ultramarine: Learning to Identify Link Spam Isabel Drost and Tobias Scheffer Humboldt-Universität zu Berlin, Department of Computer Science Unter den Linden 6, 0099 Berlin, Germany {drost, Abstract. The page rank of a commercial web site has an enormous economic impact because it directly influences the number of potential customers that find the site as a highly ranked search engine result. Link spamming inflating the page rank of a target page by artificially creating many referring pages has therefore become a common practice. In order to maintain the quality of their search results, search engine providers try to oppose efforts that decorrelate page rank and relevance and maintain blacklists of spamming pages while spammers, at the same time, try to camouflage their spam pages. We formulate the problem of identifying link spam and discuss a methodology for generating training data. Experiments reveal the effectiveness of classes of intrinsic and relational attributes and shed light on the robustness of classifiers against obfuscation of attributes by an adversarial spammer. We identify open research problems related to web spam. Introduction Search engines combine the similarity between query and page content with the page rank [8] of candidate pages to rank their results. Intuitively, every web page creates a small quantity of page rank, collects additional rank via inbound hyperlinks, and propagates its total rank via its outbound links. A web page that is referred to by many highly ranked pages thus becomes more likely to be returned in response to a search engine query. The success of commercial web sites crucially depends on the number of visitors that find the site while searching for a particular product. Because of the enormous commercial impact of a high page rank, an entire new business sector search engine optimization is rapidly developing. Search engine optimizers offer a service that is referred to as link spamming: theycreatelink farms, arrays of densely linked web pages that refer to the target page, thus inflating its page rank. In 2004, the search engine optimization industry tournamented in the DarkBlue SEO Challenge. The goal of this competition was to be ranked first on Google for the query nigritude ultramarine, a nonsense term that used to produce zero hits prior to the challenge and produced over 500,000 hits by the Proceedings of the European Conference on Machine Learning. 2005

2 competition deadline. Industry insiders believe that as many as 75 million out of the 50 million web servers that are online today may be operated with the sole purpose of increasing the page rank of their target sites. As the page rank becomes subject to manipulation, it loses its correlation to the true relevance of a web page. This deteriorates the quality of search engine results. Search engine companies maintain blacklists of link spamming pages, but they fight an uneven battle in which humans identify and penalize spamming pages and software tools automatically create new spamming domains, and camouflage them, for instance by filling in inconspicuous content. We formulate and analyze the problem of link spam identification, present a method for generation of labeled examples, and discuss intrinsic and relational features. Experiments with recursive feature elimination shed light on the relevance of several classes of attributes and their contribution to the classifier s robustness against adversarial obfuscation of discriminating properties. The rest of the paper is organized as follows. We discuss related work in Section 2 and introduce our problem setting in Section 3. Section 4 introduces the features and employed methods. Section 5 details our experimental results. We discuss open research problems in Section 6 and conclude in Section 7. 2 Related Work Henzinger [5] refers to automatic identification of link spam as one of the most important challenges for search engines. Davison [8] studies the problem of recognizing nepotistic links. A decision tree experiment using features that refer to URL, IP, content, and some linkage properties indicates a number of relevant features. Classifying links imposes the effort of labeling individual links, not just entire pages, on the user. Lempel et al. [7] use similar features to classify web pages into business, university, and other general classes. Fetterly et al. [] analyze the distribution of many web page features over 429 million pages. They find that many outliers in several features are link spam. Pages that change frequently and clusters of near-identical pages are also more likely to be spam [2, 0]. They conclude that experiments should be conducted to reveal whether these features can be used by a machine learning method. Similarly, Broder et al. [5] and Bharat et al. [3] analyze large amounts of web pages and observe that the in-degree and out-degree of web pages that should be governed by Zipf s law, empirically deviates from this distribution. They find that artificially generated link farms distort the distribution. The TrustRank approach [4] propagates trust weights along the hyperlinks. But while page rank is generated by every web page (including link farms), trust rank is generated only by manually selected trusted web pages. The manual selection of trusted pages, however, creates a perceptive bias as unknown and remote websites become less visible. Wu and Davison [9] study a simple heutistic that discovers some link farms: pages which have many inlinks and outlinks whose domains match are collected as candidate pages. They use a paopagation

3 strategy, and remove edges between likely link farms to compute a page rank that is less prone to manipulation. 3 Problem Description The correlation between page rank and relevance of a web page is based on the assumption that each hyperlink expresses a vote for the relevance of a site. The PageRank algorithm [8] calculates the rank (Equation ) of a page y that consists of a small constant amount (d corresponds to the probability that a surfer follows a link rather than restarting at a random site, N is the number of web pages), plus the accumulated rank of all referring pages x. R(y) = d N + d x y R(x) outlinks(x) () Link spamming is referred to as any intentional manipulation of the page rank of a web page. Pages created for this purpose are called link spam. Artificially generated arrays of web pages violate the assumption that links are independent votes of confidence, and therefore decorrelate page rank and relevance. Several common techniques inflate the page rank of a target site. They are often combined. We will describe these techniques briefly. A more detailed taxonomy of web spam techniques is presented by Gyöngyi [3]. Link farms are densely connected arrays of pages. A target page harvests the constant amount of page rank that each page creates and propagates through its links. Each page within the link farm has to be connected to the central, strongly connected component of the web. The farming pages have to propagate their page rank to the target. This can for instance be achieved by a or funnel-shaped architecture in which the majority of links points directly or indirectly towards the target page. In order to camouflage link farms, tools fill in inconspicuous content, for instance, by copying news bulletins. Link exchange services create listings of (often unrelated) hyperlinks. In order to be listed, businesses have to provide a back link that enhances the page rank of the exchange service and, in most cases, pay a fee. Guestbooks, discussion boards, and weblogs ( blogs ) allow readers to create HTML comments via a web interface. Automatic tools post large amounts of messages to many such boards, each message contains a hyperlink to the target website. In order to maintain a tight coupling between page rank and relevance, it is necessary to eliminate the influence that link spam has on the page rank. Search engines maintain a blacklist of spamming pages. It is believed that Google employs the BadRank algorithm. The bad rank (Equation 2) is initialized to a high value E(x) for blacklisted pages. It propagates bad rank to all referring pages (with a damping factor) thus penalizing pages that refer to spam.

4 BR(x) =E(x)( d)+d BR(y) inlinks(y) x y (2) This method, however, is not suited to automatically detect newly created link farms that do not include links to an older, already blacklisted site. Search engines therefore rely on manual detection of new link spam. We focus on the problem of automatically identifying link spam. We seek to construct a classifier that receives a URL as input and decides whether the encoded page is created for the sole purpose of manipulating the page rank of some page ( spam ), or whether it is a regular web page created for a different purpose ( ham ). The typical application scenario of a link spam identification method would be as follows. Search engines interleave crawling and page rank updating. After crawling a web page, the page s impact on the page rank of referenced pages is updated the BadRank and blacklist status of the page are considered at this point. In addition, the output of a spam classifier for the crawled page can now be taken into account. Depending on the output, the page s impact on the page rank can be reduced, or the page can be disregarded entirely. Since the class distribution of spam versus ham is not known, we use the area under the ROC curve () as evaluation metric. The equals the probability that a randomly drawn spam page receives a higher decision function value than a random ham page; the is invariant of the class prior. Link spam identification is an adversarial classification problem [7]. Spammers will probe any filter that is in effect and manipulate the properties of their generated pages in order to dodge the classifier. Therefore, a high classification accuracy is not sufficient to imply its practical usefulness. The practical benefit of a classifier is furthermore dependent of its robustness against purposeful obfuscation of attributes by an adversary. We study how the performance of classifiers deteriorates as an increasing number of attributes becomes obfuscated as the corresponding properties of spam pages are adapted to those of ham pages. 4 Representing and Obtaining Examples In this section, we address the issues of representing instances and obtaining training examples. We describe our publicly available link spam data set. Table provides an overview on the features that we use in our experiments to represent an instance x 0. Many of these features are reimplementations of, or have been inspired by, features suggested by [8] and []. Notably, however, the tfidf representation of the page, and also other features including the MD5 hash features, have not been studied for web spam problems before. Figure illustrates the neighborhood of the pivotal page x 0 represented by intrinsic and relational properties. Most elementarily, the tfidf (term frequency, inverse document frequency) representation of the page content provides an intrinsic representation. It creates one dimension in the feature vector for each word in the corpus. It gives large weights to terms that are frequent in the document but infrequent in the whole document corpus.

5 Table. Attributes of web page x 0. Textual content of the page x 0; tfidf vector. The following features for x 0 are computed () for the pivotal page itself (X = {x 0}). Here, no aggregation is necessary. (2) For the predecessors X = pred(x 0). The attributes are aggregated over the elements of X using aggregation functions sum and average. (3) For the successors (X = succ(x 0)). Aggregation functions sum and average. Boolean features are aggregated by treating true as, and false as 0. Number of tokens in keyword meta-tag, aggregated over all pages x X. Number of tokens in title, aggregated over all pages x X. Number of tokens in description meta-tag, aggregated over all pages x X. Is the page a redirection? Aggregated over all pages x X. Number of inlinks of x, aggregated over all pages x X. Number of outlinks of x, aggregated over all pages x X. Number of characters in the URL of x, aggregated over all pages x X. Number of characters in the domain name of x, aggregated over all pages x X. Number of subdomains in the URL of x, aggregated over all pages x X. Page length of x, aggregated over all pages x X. Domain ending.edu or.org? Aggregated over all pages x X Domain ending.com or.biz? Aggregated over all pages x X. URL contains tilde? Aggregated over all pages x X. The following block of context similarity features are calculated () for predecessors (X = pred(x 0)) and (2) for the successors (X = succ(x 0)); sum and ratio are used to aggregate the features over all elements of X. Clustering coefficient of X; sum and ratio of elements of X with links between them. Elements of X with the same IP address as some other element of X; sum and ratio. Elements of X that have the same length as some other element of X; sum and ratio. Pages that are referred to in x 0 andalsoinanelementsofx, sum and ratio. Pages referred to from an element of X that are also referred to from another element of X; sum and ratio. Pages in X that have the same MD5 hash code as some other element of X. This value is high if X contains many identical pages; sum and ratio. Elements of X that have the same IP as x 0. Elements of X that have the same length as x 0. Pages in X that have the same MD5 hash code as x 0. The next block of attributes is determined for the page x 0 itself as well as for its predecessors pred(x 0 ) and successors succ(x 0 ). The features are determined for every element x pred(x 0 )(orx succ(x 0 ), respectively) and aggregated over all elements x. Both, summation and averaging are used as aggregation functions. The third block quantifies collective features of x 0 and its predecessors, and x 0 and its successors. Some of them require further elaboration. Following [9], we define the clustering coefficient of a set X of web pages as the number of linked pairs divided by the number of all possible pairs, X ( X ). The clustering coefficient is if all elements in X are mutually linked and 0 if

6 ? Fig.. Neighborhood of page x 0 to classify. The tfidf vector is calculated for x 0;intrinsic features (second block of Table ) are calculated for x 0, pred(x 0), and succ(x 0); context similarity features (third block of Table ) for pred(x 0)andsucc(x 0). no links between elements of X exist. The MD5 hash is frequently used as a mechanism for digital signatures. It maps a text to a code word of 28 bit such that collisions are unlikely. That is, when two documents have the same MD5 hash code, then it is unlikely that they differ. The MD5 features quantify the number and ratio of predecessors and successors of x 0 that are textually equal. In total, each page is represented by 89 features plus its tfidf vector. Web pages explicitly contain outbound links but, of course, not their inbound links. To be able to determine the feature vector of a given page x 0, it is necessary to crawl its direct successors with standard web crawling tools (we use nutch [6] for our experiments). Crawling backwards (moving from a page to its references) can be achieved by specific search engine query tokens (e.g., link: in Google). 4. Crawling Examples Training a classifier requires labeled samples. Deciding whether an example web page is link spam requires human judgment, but we can exploit efforts already exercised by other persons. The Dmoz open directory project is a taxonomy of web pages that covers virtually all topics on the web. All entries have been reviewed by volunteers. While the Dmoz entries do contain pages whose rank is being inflated by link farms (e.g., online casinos) the listed pages themselves contain valid content and there are no instances of spam. We create examples of ham (non-link-spam) by drawing 854 Dmoz entries at random. Search engine operators maintain blacklists of spamming pages. They could, if public, provide a rich supply of examples of spam. In order to investigate their distinct characteristics, we differentiate between guestbook spam and a second class of spam that includes link farms and link exchange sites, and generate distinct samples for these sets. We obtain 25 examples of guestbook spam by drawing URLs from a publicly available manually edited blacklist that aims at helping guestbook and weblog operators to identify and remove spam entries. Link farms are never returned as a highly ranked search result this is not their goal but they promote the ranking of their target site. Afterposting36 search queries (e.g. gift links, shopping links, dvd links, wedding links,

7 seo links, pharmacy links, viagra, and casino links ), we draw some of the top-rated search results. We manually identify 80 link farm and link exchange pages. In order to correctly decide which pages are link spam, we review each page s content and context (referring and linked pages). This careful manual labeling consumed the largest part of the effort of assembling the data set. We do not distinguish between link exchange and link farm pages because this distinction cannot easily be made for many of the examples. 5 Experimental Evaluation tfidf - guestbook vs. ham tfidf - link farm vs. ham tfidf - spam vs. ham link - guestbook vs. ham link - link farm vs. ham link - spam vs. ham merged - guestbook vs. ham merged - link farm vs. ham merged - spam vs. ham Fig. 2. Comparison of kernels. In our experiments, we want to explore how well guestbook spam and link farms can be discriminated against regular web pages, and which features contribute the most to a discrimination. In addition we are interested in the robustness of classifiers against obfuscation of attributes by an adversarial spammer who purposefully adapts properties of spam pages to those of ham pages. We use the Support Vector Machine SVM light [6] with standard parameters. We would also like to find out which kernel is appropriate.

8 5. Finding Discriminative Features In the following experiments we discriminate ham versus spam, where spam is the union of all categories discussed in Section 3. In order to investigate possible differences between guestbook spam and link farms, we furthermore discriminate guestbook spam against ham, and link farm plus link exchange pages versus ham. In order to obtain learning curves, we randomly draw a training subset of specified size from the sample, use the remaining examples for testing, and average the results over 0 randomly resampled iterations. We first study the suitability of several kernels for three instance representations: we consider the tfidf representation, a representation based only on link based features (Table ), and the joint attribute vectors of concatenated tfidf and link features. We tune the degree (for kernels), the kernel width for RBF kernels (on a separate tuning set) and use default parameters otherwise. Figure 2 shows that, on average, kernels perform best for the tfidf vectors and link features. Both, RBF and kernels work for the combined representation; kernels perform poorly for the link and combined representation. Next, we study learning curves for the tfidf representation, the attributes of Table, and the joint features. Figure 3 shows that link and merged representations work equally well for guestbook spam, the merged representation performs best for link farms. For the mixed spam dataset, tfidf is the most discriminative feature set, but the offset to the combined representation is small. Using the combined represantation is always better than using only the link based features. 5 guestbook vs. ham tfidf () link () merged () link farm vs. ham tfidf link () merged () spam vs. ham 5 5 tfidf () link () merged () Fig. 3. Comparison of feature representations. Which features are discriminative? We use 0 resampled iterations of recursive feature elimination and determine the average rank of all features. The recursive feature elimination procedure initialy uses an active feature set containing all features. It then trains a Support Vector Machine, eliminates the feature with the least weight from the active set, and recurs. The best (highest ranked) features are those that remain in the active set longest. Table 2 presents the 20 highest ranked link features for the spam vs. ham discrimination. We do not rank tfidf features for individual words. Treated as a whole, the entire block of tfidf features has the highest rank. Among the most discriminative link features are the number of inbound and outbound links of pred(x 0 ), the title length of succ(x 0 ); also, clustering coefficient and the MD5 feature are relevant.

9 Table 2. RFE ranking for spam vs. ham Sign Rank Attribute - Average number of inlinks of pages in pred(x 0). + 2 Average number of tokens in title of pages in succ(x 0). + 3 Number of elements of pred(x 0) that have the same length as some other element of pred(x 0). - 4 Average number of in- and outlinks of pages in pred(x 0). + 5 Average number of outlinks of pages in pred(x 0). + 6 Number of tokens in title of x Summed number of outlinks of pages in succ(x 0). - 8 Summed number of inlinks of pages in pred(x 0). + 9 Clustering coefficient of pages in pred(x 0). + 0 Summed number of tokens in title of pages in succ(x 0). + Number of outlinks of pages in succ(x 0). + 2 Average number of characters of URLs of pages in pred(x 0). - 3 Number of pages in pred(x 0)andsucc(x 0) with same MD5 hash as x Number of characters in domain name of (x 0). - 5 Number of pages in pred(x 0)withsameIPasx Average number of characters in domain name of pages in succ(x 0). - 7 Average Number of in- and outlinks of pages in succ(x 0). + 8 Number of elements of succ(x 0) that have the same length as some other element of succ(x 0). + 9 Average number of in- and outlinks of elements in succ(x 0) Number of pages in succ(x 0)withsameIPasx Robustness against Adversarial Obfuscation The previous experiments show that today the tfidf representation is the most discriminative feature set. Once a spam filter is in effect, spammers can be expected to probe the filter and to modify properties of their link farms in order to deceive the filter. Any single feature of a set of web pages can, at some cost, be adapted to the distribution which governs that attribute in ham pages. For instance, the tfidf feature can be obfuscated by copying the content of a randomly drawn ham page from the Dmoz directory, the number of inbound or outbound links can be decreased by splitting pages or increased by merging pages. We consider the following classifiers and study their robustness against such obfuscations of discriminative attributes.. A purely text-based classifier which refers to only the tfidf features. 2. A simple contextual classifier uses the set of intrinsic features of the pivotal page x 0 with aggregation function identity. 3. The complex contextual classifier utilizes all features. We use the following experimental protocol to assess the three classifiers robustness. We first train the classifiers using their respective attribute sets. Now we simulate the adversary s attempts to deceive the classifiers. The adversary obfuscates an increasing number of attributes, starting with the entire block of tfidf features (this can be achieved by pasting the textual content of a ham

10 page into the spam page), and subsequently obfuscating additional attributes in the order of their discriminatory power (Table 2). The obfuscation of attributes is simulated as follows. For each instance, the value of the obfuscated attributes is replaced by the attribute value of a randomly drawn ham page. This simulates that the adversary has adapted the generation program to match some property of the link farm pages to that of natural web pages. We evaluate the performance of the four classifiers on the obfuscated test sets guestbook vs. ham simple contextual complex text-based number of features manipulated link farm vs. ham simple contextual complex text-based number of features manipulated spam vs. ham simple contextual complex text-based number of features manipulated Fig. 4. Influence of attribute obfuscation Figure 4 shows that the purely text-based classifier is immediately rendered useless when the content of the spam pages is replaced by the content of a randomly drawn page from the Dmoz directory. The combined classifier that utilizes multiple features deteriorates slightly slower. However, Figure 4 emphasizes the need to re-train a classifier quickly when the underlying distribution changes. 6 Open Problems The problem area of link spam contains a plentitude of open research questions. We summarize some of them. Collective Classification. Rather than classifying individual web pages, search engine operators will have to classify all pages on the web. Hence, the problem intrinsically is a collective classification problem. A benchmark collective classification data set has to include a network of example instances with a reasonable degree of context. Because of the small world property of the web, such a dataset quickly becomes very large. Game Theory. Link spam identification is an adversarial classification problem [7]. Rather than being stationary, the distribution of instances is changed by an adversary over time. The adversary probes any link spam filter that is being used by a search engine, and modifies properties of the generated link spam such as to dodge the filtering algorithm. Possible modifications include, for instance, changing the link topology, generating pages with differing content and experimenting with various sub-domain names. The topic of learning the ranking function of a search engine from rankings and page features is adressed by Bifet et al. [4]. Yet, identifying conditions under which a filtering strategy can be shown to be an equilibrium of the spam filtering game is a great challenge.

11 Identifying Google Bombers. Search engines associate the anchor text that is used to refer to a page with that page. By referring to target pages with anchor terms that have a negative connotation, malicious sites cause these targets to become search results for negative query terms. This form of web spam is often referred to as Google bombing. For instance the query miserable failure usually returns the CVs of George W. Bush or Michael Moore as the first result, depending on whose supporters are currently heading this particular Google bombing arms race. Adali et. al [] study the layout of an optimal link bomb. The influence of more general collusion topologies on page rank is examined by Baeza-Yates et al. [2]. But the development of methods that decide whether a reference is unbiased or malicious is still an open research goal. Other Forms of Web Spam. Click spamming is a particularly vicious form of web spam. Companies allocate a fixed budged to the sponsored links program of, for instance, Google. The sponsored link is returned along with results of related search queries until the budget is used up. Rivaling companies now employ click bots that post a search query, and then automatically click on their competitor s sponsored link until the budged is exceeded and the link disappears. This practice undermines the benefit of the sponsored link program, and search engine operators therefore have to identify whether a reference to a sponsored link has been made by a human, or by a rogue bot. This classification task is extremely challenging because the state-less HTTP protocol provides hardly any information about the client. 7 Conclusion We motivated and introduced the problem of link farm discovery. We discussed intrinsic and contextual features of web pages, and presented a methodology for collecting training examples. Our experiments show that today the tfidf representation of the page provides the most discriminatory attribute set. Many additional contextual attributes contribute to a more accurate discrimination. We identify the most discriminatory relational attributes. Our experiments also show that a purely text-based classifier is brittle and can easily be deceived; the contextual classifiers have the potential to be more robust because deceiving them requires to adapt a larger number of properties to the distribution of values observed in ham pages. In order to be able to react to purposeful obfuscation of characteristic properties of link farms, a repository of discriminatory features is required. Web spam is a major challenge for search engines. We sketched open research challenges; research in this direction has the potential to substantially improve search engine technology and make it more robust to manipulations. Acknowledgment This work has been supported by the German Science Foundation DFG under grant SCHE 540/0-.

12 References. S. Adali, T. Liu, and M. Magdon-Ismail. Optimal link bombs are uncoordinated. In Proc. of the Workshop on Adversarial IR on the Web, R. Baeza-Yates, C. Castillo, and V. López. Pagerank increase under different collusion topologies. In Proc. of the Workshop on Adversarial IR on the Web, K. Bharat, B. Chang, M. Henzinger, and M. Ruhl. Who links to whom: Mining linkage between web sites. In Proc. of the IEEE International Conference on Data Mining, A. Bifet, C. Castillo, P.-A. Chirita, and I. Weber. An analysis of factors used in search engine ranking. In Proc. of the Workshop on Adversarial IR on the Web, A. Broder, R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, and J. Wiener. Graph structure in the web. In Proc. of the International WWW Conference, M. Cafarella and D. Cutting. Building Nutch: Open source search. ACM Queue, 2(2), N. Dalvi, P. Domingos, Mausam, S. Sanghai, and D. Verma. Adversarial classification. In Proc. of the ACM International Conference on Knowledge Discovery and Data Mining, B. Davison. Recognizing nepotistic links on the web, In Proceedings of the AAAI-2000 Workshop on Artificial Intelligence for Web Search. 9. Holger Ebel, Lutz-Ingo Mielsch, and Stefan Bornholdt. Scale free topology of networks. Physical Review E, D. Fetterly, M. Manasse, and M. Najork. On the evolution of clusters of nearduplicate web pages. In Proc.oftheLatinAmericanWebCongress, D. Fetterly, M. Manasse, and M. Najork. Spam, damn spam, and statistics: Using statistical analysis to locate spam web pages. In Proc. of the International Workshop on the Web and Databases, D. Fetterly, M. Manasse, M. Najork, and J. Wiener. A large-scale study of the evolution of web pages. In Proc. of the International WWW Conference, Zoltn Gyngyi and Hector Garcia. Web spam taxonomy. In Proc. of the Workshop on Adversarial IR on the Web, Z. Gyongyi, H. Garcia-Molina, and J. Pedersen. Combating web spam with TrustRank. In Proc. of the International Conf. on Very Large Data Bases, M. Henzinger, R. Motwani, and C. Silverstein. Challenges in web search engines. In Proc. of the International Joint Conference on Artificial Intelligence, T. Joachims. Making large-scale SVM learning practical. In Advances in Kernel Methods Support Vector Learning. MIT Press, R. Lempel, E. Amitay, D. Carmel, A. Darlow, and A. Soffer. The connectivity sonar: Detecting site functionality by structural patterns. Journal of Digital Information, 4(3), L. Page and S. Brin. The anatomy of a large-scale hypertextual web search engine. In Proc. of the Seventh International World-Wide Web Conference, Baoning Wu and Brian D. Davison. Identifying link farm spam pages. In Proc. of the 4th International WWW Conference, 2005.

Developing Intelligent Search Engines

Developing Intelligent Search Engines Developing Intelligent Search Engines Isabel Drost Abstract Developers of search engines today do not only face technical problems such as designing an efficient crawler or distributing search requests

More information

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology

Web Spam. Seminar: Future Of Web Search. Know Your Neighbors: Web Spam Detection using the Web Topology Seminar: Future Of Web Search University of Saarland Web Spam Know Your Neighbors: Web Spam Detection using the Web Topology Presenter: Sadia Masood Tutor : Klaus Berberich Date : 17-Jan-2008 The Agenda

More information

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai

CS 8803 AIAD Prof Ling Liu. Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai CS 8803 AIAD Prof Ling Liu Project Proposal for Automated Classification of Spam Based on Textual Features Gopal Pai Under the supervision of Steve Webb Motivations and Objectives Spam, which was until

More information

Page Rank Link Farm Detection

Page Rank Link Farm Detection International Journal of Engineering Inventions e-issn: 2278-7461, p-issn: 2319-6491 Volume 4, Issue 1 (July 2014) PP: 55-59 Page Rank Link Farm Detection Akshay Saxena 1, Rohit Nigam 2 1, 2 Department

More information

Web Spam Detection with Anti-Trust Rank

Web Spam Detection with Anti-Trust Rank Web Spam Detection with Anti-Trust Rank Viay Krishnan Computer Science Department Stanford University Stanford, CA 4305 viayk@cs.stanford.edu Rashmi Ra Computer Science Department Stanford University Stanford,

More information

CS47300 Web Information Search and Management

CS47300 Web Information Search and Management CS47300 Web Information Search and Management Search Engine Optimization Prof. Chris Clifton 31 October 2018 What is Search Engine Optimization? 90% of search engine clickthroughs are on the first page

More information

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008

Countering Spam Using Classification Techniques. Steve Webb Data Mining Guest Lecture February 21, 2008 Countering Spam Using Classification Techniques Steve Webb webb@cc.gatech.edu Data Mining Guest Lecture February 21, 2008 Overview Introduction Countering Email Spam Problem Description Classification

More information

Exploring both Content and Link Quality for Anti-Spamming

Exploring both Content and Link Quality for Anti-Spamming Exploring both Content and Link Quality for Anti-Spamming Lei Zhang, Yi Zhang, Yan Zhang National Laboratory on Machine Perception Peking University 100871 Beijing, China zhangl, zhangyi, zhy @cis.pku.edu.cn

More information

Link Analysis in Web Mining

Link Analysis in Web Mining Problem formulation (998) Link Analysis in Web Mining Hubs and Authorities Spam Detection Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained

More information

World Wide Web has specific challenges and opportunities

World Wide Web has specific challenges and opportunities 6. Web Search Motivation Web search, as offered by commercial search engines such as Google, Bing, and DuckDuckGo, is arguably one of the most popular applications of IR methods today World Wide Web has

More information

Characterizing the Splogosphere

Characterizing the Splogosphere Characterizing the Splogosphere Pranam Kolari and Akshay Java and Tim Finin University of Maryland Baltimore County 1 Hilltop Circle Baltimore, MD, USA {kolari1, aks1, finin}@cs.umbc.edu ABSTRACT Weblogs

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

Characterizing the Splogosphere

Characterizing the Splogosphere Characterizing the Splogosphere Pranam Kolari and Akshay Java and Tim Finin University of Maryland Baltimore County 1 Hilltop Circle Baltimore, MD, USA {kolari1, aks1, finin}@cs.umbc.edu ABSTRACT Weblogs

More information

Previous: how search engines work

Previous: how search engines work detection Ricardo Baeza-Yates,3 ricardo@baeza.cl With: L. Becchetti 2, P. Boldi 5, C. Castillo, D. Donato, A. Gionis, S. Leonardi 2, V.Murdock, M. Santini 5, F. Silvestri 4, S. Vigna 5. Yahoo! Research

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

Detecting Spam Web Pages

Detecting Spam Web Pages Detecting Spam Web Pages Marc Najork Microsoft Research Silicon Valley About me 1989-1993: UIUC (home of NCSA Mosaic) 1993-2001: Digital Equipment/Compaq Started working on web search in 1997 Mercator

More information

Measuring Similarity to Detect Qualified Links

Measuring Similarity to Detect Qualified Links Measuring Similarity to Detect Qualified Links Xiaoguang Qi, Lan Nie and Brian D. Davison Department of Computer Science & Engineering Lehigh University {xiq204,lan2,davison}@cse.lehigh.edu December 2006

More information

arxiv:cs/ v1 [cs.ir] 26 Apr 2002

arxiv:cs/ v1 [cs.ir] 26 Apr 2002 Navigating the Small World Web by Textual Cues arxiv:cs/0204054v1 [cs.ir] 26 Apr 2002 Filippo Menczer Department of Management Sciences The University of Iowa Iowa City, IA 52242 Phone: (319) 335-0884

More information

Topical TrustRank: Using Topicality to Combat Web Spam

Topical TrustRank: Using Topicality to Combat Web Spam Topical TrustRank: Using Topicality to Combat Web Spam Baoning Wu Vinay Goel Brian D. Davison Department of Computer Science & Engineering Lehigh University Bethlehem, PA 18015 USA {baw4,vig204,davison}@cse.lehigh.edu

More information

Anti-Trust Rank for Detection of Web Spam and Seed Set Expansion

Anti-Trust Rank for Detection of Web Spam and Seed Set Expansion International Journal of Information and Computation Technology. ISSN 0974-2239 Volume 3, Number 4 (2013), pp. 241-250 International Research Publications House http://www. irphouse.com /ijict.htm Anti-Trust

More information

A Method for Finding Link Hijacking Based on Modified PageRank Algorithms

A Method for Finding Link Hijacking Based on Modified PageRank Algorithms DEWS2008 A10-1 A Method for Finding Link Hijacking Based on Modified PageRank Algorithms Young joo Chung Masashi Toyoda Masaru Kitsuregawa Institute of Industrial Science, University of Tokyo 4-6-1 Komaba

More information

Detecting Semantic Cloaking on the Web

Detecting Semantic Cloaking on the Web Detecting Semantic Cloaking on the Web Baoning Wu and Brian D. Davison Department of Computer Science & Engineering Lehigh University Bethlehem, PA 18015 USA {baw4,davison}@cse.lehigh.edu ABSTRACT By supplying

More information

A Method for Finding Link Hijacking Based on Modified PageRank Algorithms

A Method for Finding Link Hijacking Based on Modified PageRank Algorithms DEWS2008 A10-1 A Method for Finding Link Hijacking Based on Modified PageRank Algorithms Young joo Chung Masashi Toyoda Masaru Kitsuregawa Institute of Industrial Science, University of Tokyo 4-6-1 Komaba

More information

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures Anatomy of a search engine Design criteria of a search engine Architecture Data structures Step-1: Crawling the web Google has a fast distributed crawling system Each crawler keeps roughly 300 connection

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Unit VIII. Chapter 9. Link Analysis

Unit VIII. Chapter 9. Link Analysis Unit VIII Link Analysis: Page Ranking in web search engines, Efficient Computation of Page Rank using Map-Reduce and other approaches, Topic-Sensitive Page Rank, Link Spam, Hubs and Authorities (Text Book:2

More information

Information Retrieval May 15. Web retrieval

Information Retrieval May 15. Web retrieval Information Retrieval May 15 Web retrieval What s so special about the Web? The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically

More information

Optimizing Search Engines using Click-through Data

Optimizing Search Engines using Click-through Data Optimizing Search Engines using Click-through Data By Sameep - 100050003 Rahee - 100050028 Anil - 100050082 1 Overview Web Search Engines : Creating a good information retrieval system Previous Approaches

More information

A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET

A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET Shervin Daneshpajouh, Mojtaba Mohammadi Nasiri¹ Computer Engineering Department, Sharif University of Technology, Tehran, Iran daneshpajouh@ce.sharif.edu,

More information

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS Satwinder Kaur 1 & Alisha Gupta 2 1 Research Scholar (M.tech

More information

Analyzing and Detecting Review Spam

Analyzing and Detecting Review Spam Seventh IEEE International Conference on Data Mining Analyzing and Detecting Review Spam Nitin Jindal and Bing Liu Department of Computer Science University of Illinois at Chicago nitin.jindal@gmail.com,

More information

Measuring Similarity to Detect

Measuring Similarity to Detect Measuring Similarity to Detect Qualified Links Xiaoguang Qi, Lan Nie, and Brian D. Davison Dept. of Computer Science & Engineering Lehigh University Introduction Approach Experiments Discussion & Conclusion

More information

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE Bohar Singh 1, Gursewak Singh 2 1, 2 Computer Science and Application, Govt College Sri Muktsar sahib Abstract The World Wide Web is a popular

More information

Searching the Web What is this Page Known for? Luis De Alba

Searching the Web What is this Page Known for? Luis De Alba Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS

WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS WEB SPAM IDENTIFICATION THROUGH LANGUAGE MODEL ANALYSIS Juan Martinez-Romo and Lourdes Araujo Natural Language Processing and Information Retrieval Group at UNED * nlp.uned.es Fifth International Workshop

More information

Local Methods for Estimating PageRank Values

Local Methods for Estimating PageRank Values Local Methods for Estimating PageRank Values Yen-Yu Chen Qingqing Gan Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 yenyu, qq gan, suel @photon.poly.edu Abstract The Google search

More information

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur

More information

Method to Study and Analyze Fraud Ranking In Mobile Apps

Method to Study and Analyze Fraud Ranking In Mobile Apps Method to Study and Analyze Fraud Ranking In Mobile Apps Ms. Priyanka R. Patil M.Tech student Marri Laxman Reddy Institute of Technology & Management Hyderabad. Abstract: Ranking fraud in the mobile App

More information

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN:

IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, ISSN: IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 1, Issue 5, Oct-Nov, 20131 Improve Search Engine Relevance with Filter session Addlin Shinney R 1, Saravana Kumar T

More information

Reading Time: A Method for Improving the Ranking Scores of Web Pages

Reading Time: A Method for Improving the Ranking Scores of Web Pages Reading Time: A Method for Improving the Ranking Scores of Web Pages Shweta Agarwal Asst. Prof., CS&IT Deptt. MIT, Moradabad, U.P. India Bharat Bhushan Agarwal Asst. Prof., CS&IT Deptt. IFTM, Moradabad,

More information

Site Level Noise Removal for Search Engines

Site Level Noise Removal for Search Engines Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho 1, Paul - Alexandru Chirita 2, Edleno Silva de Moura 1 Pável Calado 3,Wolfgang Nejdl 2 1 Federal University of Amazonas,Av.RodrigoOctávioRamos3000,Manaus,Brazil

More information

Ranking Algorithms For Digital Forensic String Search Hits

Ranking Algorithms For Digital Forensic String Search Hits DIGITAL FORENSIC RESEARCH CONFERENCE Ranking Algorithms For Digital Forensic String Search Hits By Nicole Beebe and Lishu Liu Presented At The Digital Forensic Research Conference DFRWS 2014 USA Denver,

More information

Information Retrieval. Lecture 9 - Web search basics

Information Retrieval. Lecture 9 - Web search basics Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general

More information

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion

A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion A Semi-Supervised Approach for Web Spam Detection using Combinatorial Feature-Fusion Ye Tian, Gary M. Weiss, Qiang Ma Department of Computer and Information Science Fordham University 441 East Fordham

More information

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page International Journal of Soft Computing and Engineering (IJSCE) ISSN: 31-307, Volume-, Issue-3, July 01 Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page Neelam Tyagi, Simple

More information

Brian D. Davison Department of Computer Science Rutgers, The State University of New Jersey New Brunswick, NJ USA

Brian D. Davison Department of Computer Science Rutgers, The State University of New Jersey New Brunswick, NJ USA From: AAAI Technical Report WS-00-01. Compilation copyright 2000, AAAI (www.aaai.org). All rights reserved. Recognizing Nepotistic Links on the Web Brian D. Davison Department of Computer Science Rutgers,

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #11: Link Analysis 3 Seoul National University 1 In This Lecture WebSpam: definition and method of attacks TrustRank: how to combat WebSpam HITS algorithm: another algorithm

More information

On Finding Power Method in Spreading Activation Search

On Finding Power Method in Spreading Activation Search On Finding Power Method in Spreading Activation Search Ján Suchal Slovak University of Technology Faculty of Informatics and Information Technologies Institute of Informatics and Software Engineering Ilkovičova

More information

Slides based on those in:

Slides based on those in: Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6 2 y 0.8 ½+0.2 ⅓ M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N]

More information

HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL

HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL International Journal of Mechanical Engineering & Computer Sciences, Vol.1, Issue 1, Jan-Jun, 2017, pp 12-17 HYBRIDIZED MODEL FOR EFFICIENT MATCHING AND DATA PREDICTION IN INFORMATION RETRIEVAL BOMA P.

More information

Web Spam Challenge 2008

Web Spam Challenge 2008 Web Spam Challenge 2008 Data Analysis School, Moscow, Russia K. Bauman, A. Brodskiy, S. Kacher, E. Kalimulina, R. Kovalev, M. Lebedev, D. Orlov, P. Sushin, P. Zryumov, D. Leshchiner, I. Muchnik The Data

More information

CLOAK OF VISIBILITY : DETECTING WHEN MACHINES BROWSE A DIFFERENT WEB

CLOAK OF VISIBILITY : DETECTING WHEN MACHINES BROWSE A DIFFERENT WEB CLOAK OF VISIBILITY : DETECTING WHEN MACHINES BROWSE A DIFFERENT WEB CIS 601: Graduate Seminar Prof. S. S. Chung Presented By:- Amol Chaudhari CSU ID 2682329 AGENDA About Introduction Contributions Background

More information

Dynamic Visualization of Hubs and Authorities during Web Search

Dynamic Visualization of Hubs and Authorities during Web Search Dynamic Visualization of Hubs and Authorities during Web Search Richard H. Fowler 1, David Navarro, Wendy A. Lawrence-Fowler, Xusheng Wang Department of Computer Science University of Texas Pan American

More information

RSDC 09: Tag Recommendation Using Keywords and Association Rules

RSDC 09: Tag Recommendation Using Keywords and Association Rules RSDC 09: Tag Recommendation Using Keywords and Association Rules Jian Wang, Liangjie Hong and Brian D. Davison Department of Computer Science and Engineering Lehigh University, Bethlehem, PA 18015 USA

More information

Bruno Martins. 1 st Semester 2012/2013

Bruno Martins. 1 st Semester 2012/2013 Link Analysis Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 4

More information

IDENTIFYING SIMILAR WEB PAGES BASED ON AUTOMATED AND USER PREFERENCE VALUE USING SCORING METHODS

IDENTIFYING SIMILAR WEB PAGES BASED ON AUTOMATED AND USER PREFERENCE VALUE USING SCORING METHODS IDENTIFYING SIMILAR WEB PAGES BASED ON AUTOMATED AND USER PREFERENCE VALUE USING SCORING METHODS Gandhimathi K 1 and Vijaya MS 2 1 MPhil Research Scholar, Department of Computer science, PSGR Krishnammal

More information

Compact Encoding of the Web Graph Exploiting Various Power Laws

Compact Encoding of the Web Graph Exploiting Various Power Laws Compact Encoding of the Web Graph Exploiting Various Power Laws Statistical Reason Behind Link Database Yasuhito Asano, Tsuyoshi Ito 2, Hiroshi Imai 2, Masashi Toyoda 3, and Masaru Kitsuregawa 3 Department

More information

ASCERTAINING THE RELEVANCE MODEL OF A WEB SEARCH-ENGINE BIPIN SURESH

ASCERTAINING THE RELEVANCE MODEL OF A WEB SEARCH-ENGINE BIPIN SURESH ASCERTAINING THE RELEVANCE MODEL OF A WEB SEARCH-ENGINE BIPIN SURESH Abstract We analyze the factors contributing to the relevance of a web-page as computed by popular industry web search-engines. We also

More information

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS

A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS A PRELIMINARY STUDY ON THE EXTRACTION OF SOCIO-TOPICAL WEB KEYWORDS KULWADEE SOMBOONVIWAT Graduate School of Information Science and Technology, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-0033,

More information

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM

CHAPTER THREE INFORMATION RETRIEVAL SYSTEM CHAPTER THREE INFORMATION RETRIEVAL SYSTEM 3.1 INTRODUCTION Search engine is one of the most effective and prominent method to find information online. It has become an essential part of life for almost

More information

Brief (non-technical) history

Brief (non-technical) history Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University

More information

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,

More information

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node A Modified Algorithm to Handle Dangling Pages using Hypothetical Node Shipra Srivastava Student Department of Computer Science & Engineering Thapar University, Patiala, 147001 (India) Rinkle Rani Aggrawal

More information

Evaluating the Usefulness of Sentiment Information for Focused Crawlers

Evaluating the Usefulness of Sentiment Information for Focused Crawlers Evaluating the Usefulness of Sentiment Information for Focused Crawlers Tianjun Fu 1, Ahmed Abbasi 2, Daniel Zeng 1, Hsinchun Chen 1 University of Arizona 1, University of Wisconsin-Milwaukee 2 futj@email.arizona.edu,

More information

Detecting Malicious Web Links and Identifying Their Attack Types

Detecting Malicious Web Links and Identifying Their Attack Types Detecting Malicious Web Links and Identifying Their Attack Types Anti-Spam Team Cellopoint July 3, 2013 Introduction References A great effort has been directed towards detection of malicious URLs Blacklisting

More information

An introduction to Web Mining part II

An introduction to Web Mining part II An introduction to Web Mining part II Ricardo Baeza-Yates, Aristides Gionis Yahoo! Research Barcelona, Spain & Santiago, Chile ECML/PKDD 2008 Antwerp Yahoo! Research Agenda Statistical methods: the size

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

Abstract. 1. Introduction

Abstract. 1. Introduction A Visualization System using Data Mining Techniques for Identifying Information Sources on the Web Richard H. Fowler, Tarkan Karadayi, Zhixiang Chen, Xiaodong Meng, Wendy A. L. Fowler Department of Computer

More information

Personalizing PageRank Based on Domain Profiles

Personalizing PageRank Based on Domain Profiles Personalizing PageRank Based on Domain Profiles Mehmet S. Aktas, Mehmet A. Nacar, and Filippo Menczer Computer Science Department Indiana University Bloomington, IN 47405 USA {maktas,mnacar,fil}@indiana.edu

More information

Searching the Web [Arasu 01]

Searching the Web [Arasu 01] Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web

More information

Combinatorial Algorithms for Web Search Engines - Three Success Stories

Combinatorial Algorithms for Web Search Engines - Three Success Stories Combinatorial Algorithms for Web Search Engines - Three Success Stories Monika Henzinger Abstract How much can smart combinatorial algorithms improve web search engines? To address this question we will

More information

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004 Web Search Engines: Solutions to Final Exam, Part I December 13, 2004 Problem 1: A. In using the vector model to compare the similarity of two documents, why is it desirable to normalize the vectors to

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org J.

More information

CS Search Engine Technology

CS Search Engine Technology CS236620 - Search Engine Technology Ronny Lempel Winter 2008/9 The course consists of 14 2-hour meetings, divided into 4 main parts. It aims to cover both engineering and theoretical aspects of search

More information

Robust PageRank and Locally Computable Spam Detection Features

Robust PageRank and Locally Computable Spam Detection Features Robust PageRank and Locally Computable Spam Detection Features Reid Andersen reidan@microsoftcom John Hopcroft Cornell University jeh@cscornelledu Christian Borgs borgs@microsoftcom Kamal Jain kamalj@microsoftcom

More information

COMP5331: Knowledge Discovery and Data Mining

COMP5331: Knowledge Discovery and Data Mining COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank

More information

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google Web Search Engines 1 Web Search before Google Web Search Engines (WSEs) of the first generation (up to 1998) Identified relevance with topic-relateness Based on keywords inserted by web page creators (META

More information

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p. 6 What is Web Mining? p. 6 Summary of Chapters p. 8 How

More information

A Document-centered Approach to a Natural Language Music Search Engine

A Document-centered Approach to a Natural Language Music Search Engine A Document-centered Approach to a Natural Language Music Search Engine Peter Knees, Tim Pohle, Markus Schedl, Dominik Schnitzer, and Klaus Seyerlehner Dept. of Computational Perception, Johannes Kepler

More information

New Classification Method Based on Decision Tree for Web Spam Detection

New Classification Method Based on Decision Tree for Web Spam Detection Research Article International Journal of Current Engineering and Technology E-ISSN 2277 4106, P-ISSN 2347-5161 2014 INPRESSCO, All Rights Reserved Available at http://inpressco.com/category/ijcet Rashmi

More information

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today 3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Information Discovery, Extraction and Integration for the Hidden Web

Information Discovery, Extraction and Integration for the Hidden Web Information Discovery, Extraction and Integration for the Hidden Web Jiying Wang Department of Computer Science University of Science and Technology Clear Water Bay, Kowloon Hong Kong cswangjy@cs.ust.hk

More information

Automatic Identification of User Goals in Web Search [WWW 05]

Automatic Identification of User Goals in Web Search [WWW 05] Automatic Identification of User Goals in Web Search [WWW 05] UichinLee @ UCLA ZhenyuLiu @ UCLA JunghooCho @ UCLA Presenter: Emiran Curtmola@ UC San Diego CSE 291 4/29/2008 Need to improve the quality

More information

Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3

Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3 Selection of Best Web Site by Applying COPRAS-G method Bindu Madhuri.Ch #1, Anand Chandulal.J #2, Padmaja.M #3 Department of Computer Science & Engineering, Gitam University, INDIA 1. binducheekati@gmail.com,

More information

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes

More information

Information Retrieval Issues on the World Wide Web

Information Retrieval Issues on the World Wide Web Information Retrieval Issues on the World Wide Web Ashraf Ali 1 Department of Computer Science, Singhania University Pacheri Bari, Rajasthan aali1979@rediffmail.com Dr. Israr Ahmad 2 Department of Computer

More information

Link Analysis in Web Information Retrieval

Link Analysis in Web Information Retrieval Link Analysis in Web Information Retrieval Monika Henzinger Google Incorporated Mountain View, California monika@google.com Abstract The analysis of the hyperlink structure of the web has led to significant

More information

Automated Path Ascend Forum Crawling

Automated Path Ascend Forum Crawling Automated Path Ascend Forum Crawling Ms. Joycy Joy, PG Scholar Department of CSE, Saveetha Engineering College,Thandalam, Chennai-602105 Ms. Manju. A, Assistant Professor, Department of CSE, Saveetha Engineering

More information

Comment Extraction from Blog Posts and Its Applications to Opinion Mining

Comment Extraction from Blog Posts and Its Applications to Opinion Mining Comment Extraction from Blog Posts and Its Applications to Opinion Mining Huan-An Kao, Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan

More information

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW ISSN: 9 694 (ONLINE) ICTACT JOURNAL ON COMMUNICATION TECHNOLOGY, MARCH, VOL:, ISSUE: WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW V Lakshmi Praba and T Vasantha Department of Computer

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Incorporating Trust into Web Search

Incorporating Trust into Web Search Incorporating Trust into Web Search Lan Nie Baoning Wu Brian D. Davison Department of Computer Science & Engineering Lehigh University {lan2,baw4,davison}@cse.lehigh.edu January 2007 Abstract The Web today

More information

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University Instead of generic popularity, can we measure popularity within a topic? E.g., computer science, health Bias the random walk When

More information

A P2P-based Incremental Web Ranking Algorithm

A P2P-based Incremental Web Ranking Algorithm A P2P-based Incremental Web Ranking Algorithm Sumalee Sangamuang Pruet Boonma Juggapong Natwichai Computer Engineering Department Faculty of Engineering, Chiang Mai University, Thailand sangamuang.s@gmail.com,

More information

The Illusion in the Presentation of the Rank of a Web Page with Dangling Links

The Illusion in the Presentation of the Rank of a Web Page with Dangling Links JASEM ISSN 1119-8362 All rights reserved Full-text Available Online at www.ajol.info and www.bioline.org.br/ja J. Appl. Sci. Environ. Manage. December 2013 Vol. 17 (4) 551-558 The Illusion in the Presentation

More information