THE Web, with its popularity, fast and uncontrollable

Size: px
Start display at page:

Download "THE Web, with its popularity, fast and uncontrollable"

Transcription

1 4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004 A Unified Probabilistic Framework for Web Page Scoring Systems Michelangelo Diligenti, Marco Gori, Fellow, IEEE, and Marco Maggini, Member, IEEE Computer Society Abstract The definition of efficient page ranking algorithms is becoming an important issue in the design of the query interface of Web search engines. Information flooding is a common experience especially when broad topic queries are issued. Queries containing only one or two keywords usually match a huge number of documents, while users can only afford to visit the first positions of the returned list, which do not necessarily refer to the most appropriate answers. Some successful approaches to page ranking in a hyperlinked environment, like the Web, are based on link analysis. In this paper, we propose a general probabilistic framework for Web Page Scoring Systems (WPSS), which incorporates and extends many of the relevant models proposed in the literature. In particular, we introduce scoring systems for both generic (horizontal) and focused (vertical) search engines. Whereas horizontal scoring algorithms are only based on the topology of the Web graph, vertical ranking also takes the page contents into account and are the base for focused and user adapted search interfaces. Experimental results are reported to show the properties of some of the proposed scoring systems with special emphasis on vertical search. Index Terms Web page scoring systems, random walks, HITS, PageRank, focused PageRank. æ 1 INTRODUCTION THE Web, with its popularity, fast and uncontrollable growth, and heterogeneity poses serious challenges to search engine designers, even if most of the techniques required for searching a collection of documents had already been studied in the related fields of databases and information retrieval. The new scenario is quite different with respect to traditional information retrieval applications, which deals with more controlled environments. The Web is extremely dynamic, its rate of growth is impressive, but one of the basic issues is that there is not central control of which (where and when) documents are published. Nowadays, almost any user of the Web can publish his/her own pages, with any contents, being a good author and expert on the topic he/she is writing on or being just a spammer. Thus, there are three main challenges which a search engine has to face: The first problem is how to find and index the documents on the Web. Search engines are de facto the only central indexes of the information available on the Web that otherwise would be accessible only navigating through hyperlinks. Unfortunately, search engines are not able to track the publication of new pages, and the only way they have to build their indexes is to collect the documents by crawling the Web graph. Crawling must be performed continuously and, nowadays, a complete crawl of the whole Web is not feasible. Both the size and the structure of the Web, as well as freshness requirements, force search engines to cover only a fraction of the whole Web [1], [2]. The second issue concerns the size and the heterogeneity of the information available on the Web. The different document formats, the various authoring styles used to. The authors are with the Dipartimento di Ingegneria dell Informazione, Università degli Studi di Siena, Via Roma 56 - I Siena, Italy. {michi,marco,maggini}@dii.unisi.it. Manuscript received 1 Sept. 2002; revised 1 Apr. 2003; accepted 10 Apr For information on obtaining reprints of this article, please send to: tkde@computer.org, and reference IEEECS Log Number write Web documents, and the huge quantity of information require an accurate processing to create reliable and efficient indexes. Finally, the user search interface is probably one of the principal success keys for the search engine. Even if most of the search engines offer advanced search interfaces, most of the users just use the simple keyword-based interface. The user issues his/her query as a list of words looking for documents which contain all (or some) of these words. While the techniques to retrieve the documents which match the query are relatively simple using inverted indexes, it is difficult to provide high quality and relevant results to the user. Typical queries are based only on few words (often just one or two) and, thus, they can be described as broad-topic queries. When thousands of documents match a query, the user is really flooded by information and can typically only afford to check a very small fraction of the returned answers. Thus, the definition of good document ranking functions turns out to be a crucial issue in search engine design. Proper criteria must be devised to compute automatically a score which evaluates both the relevance of the document with respect to the query and the quality of its contents. The analysis of the hyperlinks on the Web [3] has been proposed as a way to derive a quality measure for the information on the Web. The structure of the hyperlinks on the Web is the result of the collaborative activity of the community of Web authors. Authors usually like to link resources they consider authoritative and authority emerges from the dynamics of popularity of the resources on the Web. Sophisticated algorithms have been studied to compute reliable measures of authority from the topological structure of interconnections among the Web pages. A simple counting of the number of links to a page does not take into account the fact that not all the citations have the same authority. PageRank [4], used by the Google search /04/$17.00 ß 2004 IEEE Published by the IEEE Computer Society

2 DILIGENTI ET AL.: A UNIFIED PROBABILISTIC FRAMEWORK FOR WEB PAGE SCORING SYSTEMS 5 engine, is a noticeable example of a topological-based ranking criterion. The authority of a page is computed recursively as a function of the authorities of the pages that link the target page. HITS [5] is another well-known algorithm which computes two values related to topological properties of the Web pages, the authority and the hubness. The HITS scheme is query-dependent. User queries are issued to a search engine in order to create a set of seed pages. Crawling the Web forward and backward from that seed is performed to mirror the Web portion containing the information which is likely to be useful. A ranking criterion based on topological analyses can be applied to the pages belonging to the selected Web portion. Very interesting results in this direction have been proposed in [6], [7], [8]. In [9], a Bayesian approach is used to compute hubs and authorities, whereas in [10], both topological information and information about the page content are included in the distillation of information sources performed by a Bayesian approach. Recently, other approaches which also include the page contents in the score computation have been proposed to define focused measures of document quality [11], [12]. In this paper, we propose a general probabilistic framework for Web Page Scoring Systems (WPSS) which incorporates and extends many of the relevant models proposed in the literature. A first report of the research described in this paper can be found in [13]. Here, we propose a further extension of WPSS and provide additional experimental results. The general Web page scoring model proposed in this paper extends both PageRank [4] and the HITS scheme [5]. In addition, the proposed model exhibits a number of novel features, which turn out to be very useful, especially for focused (vertical) search. The content of the pages is combined with the Web graphical structure giving rise to scoring mechanisms which are focused on a specific topic. Moreover, in the proposed model, vertical search schemes can take into account the mutual relationship among different topics. In so doing, the discovery of pages with high score for a given topic affects the score of pages with related topics. Experimental results were carried out to assess the features of the proposed scoring systems with special emphasis on vertical search. The very promising experimental results reported in the paper provide a clear validation of the proposed general scheme for Web page scoring systems. The paper is organized as follows: The next section introduces the general probabilistic framework based on random walks which can be used to describe the different WPSSs. Section 3 describes the horizontal ranking scheme using the proposed framework. Horizontal rankings are only based on the graph topology and do not consider the page contents. In particular, the well-known PageRank and HITS algorithms and some extensions are derived from the common framework. Vertical scoring systems are described in Section 4. Vertical ranking functions are useful for focused search interfaces and for user adapted applications. Some different models are described as examples of focused ranking functions which can be derived in the proposed framework. Finally, Section 5 presents a set of experimental evaluations of both horizontal and vertical WPSSs and, in Section 6, the conclusions are drawn. 2 RANDOM WALKS AND PAGE RANKING The Web can be viewed as a graph G whose nodes correspond to the pages and whose arcs are defined by the hyperlinks between the pages. If p and q are the nodes corresponding to the pages D p and D q, then there is the arc ðp; qþ in G if the page D p contains a hyperlink to the page D q. The topology of the Web graph is quite complex and it is the result of the behavior of the community of Web authors. Thus, the graph topology carries much information related to the cooperative interaction of many agents. One of the emerging properties of the resulting graph is that high quality resources tend to be referred by many Web authors. The idea of using the collaborative judgments on Web resources hidden in the structure of the Web topology has been proposed as the basis to define page ranking criteria. In particular, random walk theory has been proposed as a framework to define models to compute the absolute relevance of a page [4], [8]. The relevance x p of the page p is computed as the probability of visiting that page in a random walk on the Web graph. The most popular pages (i.e., the most linked ones) are the most likely to be visited during the random walk on the Web. 2.1 The Single-Surfer Walk In order to define a general probabilistic framework for random walks, we model the actions of a generic Web surfer. At each step of the walk, the surfer can perform one out of the following atomic actions: jump to any node of the graph (action j), follow a hyperlink from the current page (action l), follow a hyperlink in the inverse direction (action b), and stay in the current node (action s). Thus, the set of the atomic actions is O¼fj; l; b; sg. At each step, the behavior of the surfer depends on the current page. For example, if the surfer considers the current page relevant, a hyperlink contained in that page will likely be followed. Whereas, if the page is not interesting, the surfer is likely to jump to a page not linked by the current one. Thus, the surfer s behavior can be modeled by a set of conditional probabilities which depend on the current page q:. xðljqþ: the probability of following a hyperlink from q,. xðbjqþ: the probability of following a back-link from q,. xðjjqþ: the probability of jumping from q, and. xðsjqþ: the probability of remaining in q. These P values must satisfy the normalization constraint o2o xðojqþ ¼1. Most of these actions need to specify their targets. We assume that the surfer s behavior is time-invariant and that the model can assign a specific weight to each link of a page (like in [14]). Thus, we can model the targets for jumps, hyperlink, or back-link by using the following parameters:. xðpjq; jþ: probability of jumping from page q to page p,. xðpjq; lþ: probability of selecting a hyperlink from page q to page p; this value is not null only for those pages p linked directly by page q, i.e., p 2 chðqþ, chðqþ being the set of the children of node q in the graph G, and

3 6 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY xðpjq; bþ: probability of going back from page q to page p; this value is not null only for the pages p which link directly page q, i.e., p 2 paðqþ, being paðqþ the set of the parents of node q in the graph G. These sets of values must satisfy the following probability normalization constraints for each page q 2 G: X X X xðpjq; jþ ¼1; xðpjq; lþ ¼1; xðpjq; bþ ¼1: p2g p2chðqþ p2paðqþ The random walk is defined by a sequence of actions performed by the surfer. The probabilistic model can be used to compute the probability that the surfer is located in each page p at time t, x p ðtþ. The probability distribution on all the pages is represented by the vector xðtþ ¼½x 1 ðtþ;...;x N ðtþš 0, being N the total number of pages. The probabilities x p ðtþ are updated at each step of the random walk taking into account the surfer model and, in particular, the probabilities associated to the actions that can be taken, using the following equation: x p ðt þ 1Þ ¼ X q2g xðpjqþx q ðtþ ¼ X xðpjq; jþxðjjqþx q ðtþþ X xðpjq; lþ q2g q2paðpþ xðljqþx q ðtþ þ X xðpjq; bþxðbjqþx q ðtþþxðsjpþx p ðtþ; q2chðpþ where the probability xðpjqþ of moving from page q to page p is expanded by considering the user s actions. The probabilities defining the surfer model can be organized in the following N N matrices:. the forward matrix whose element ðp; qþ is the probability xðpjq; lþ;. the backward matrix collecting the probabilities xðpjq; bþ; and. the jump matrix which is defined by the jump probabilities xðpjq; jþ. The forward and backward matrices are related to the Web adjacency matrix W, whose entry ðp; qþ is one if page p links page q, 0 otherwise. In particular, the forward matrix does not have null entries only in the positions corresponding to the entries equal to 1 in the matrix W, and the backward matrix does not have null entries in the positions corresponding to the entries equal to 1 in the matrix W 0. We can also define the set of action matrices which collect the probabilities of taking one of the possible actions from a given page q. These are N N diagonal matrices defined as follows: D j whose diagonal values ðq; qþ are the probabilities xðjjqþ, D l collecting the probabilities xðljqþ, D b containing the values xðbjqþ, and D s having the probabilities xðsjqþ on its diagonal. Hence, (1) can be written in matrix form as xðt þ 1Þ ¼ðD j Þ 0 xðtþþð D l Þ 0 xðtþ þð D b Þ 0 xðtþþðd s Þ 0 xðtþ: ð1þ ð2þ By defining the transition matrix as 0; T ¼ D j þ D l þ D b þ D s (2) can be written as xðt þ 1Þ ¼T xðtþ: Given the initial distribution xð0þ, (3) can be applied recursively to compute the probability distribution at a given time step t yielding xðtþ ¼T t xð0þ: The absolute page rank for the pages on the Web is obtained by considering the stationary distribution of the Markov chain defined by the previous equation. T 0 is the state transition matrix of the Markov chain. T 0 is stable since it is a stochastic matrix having its maximum eigenvalue equal to 1. Since the state vector xðtþ evolves following the equation of a Markov chain, it is guaranteed that, if P q2g x qð0þ ¼1, then P q2g x qðtþ ¼1;t¼ 1; 2;... Proposition 1. If xðjjqþ 6¼ 0 and xðpjq; jþ 6¼ 0; 8p; q 2 G, then there exists x? such that lim t!1 xðtþ ¼x? and x? does not depend on the initial state vector xð0þ. Proof. Because of the hypotheses, D j is strictly positive, i.e., all its entries are greater than 0. Since the transition matrix T of the Markov chain is obtained by adding nonnegative matrices, then also the transition matrix T is strictly positive. Thus, the resulting Markov chain is irreducible and, consequently, it has a unique stationary distribution given by the solution of the equation x? ¼ Tx?, where x? satisfies ðx? Þ 0 ¼ 1, being the N-dimensional vector whose entries are all equal to 1 (see, e.g., [15]). tu 2.2 The Multisurfer Walk A model based on a single variable may not be able to capture the complex relationships and dependencies among Web pages to compute their absolute relevance. Ranking schemes based on multiple variables have been proposed in [5], [8], where two variables are used to measure the hubness and the authority of each page. The random walk framework can be extended by considering a pool of Web surfers having different behaviors in order to capture different properties of the Web. Each surfer can be modeled by using different values for the parameters in the random walk (2) in order to define different policies for evaluating the absolute importance of the pages. Moreover, the surfers may interact by accepting suggestions from each other. The multisurfer model considers M different surfers. For each surfer i; i ¼ 1;...;M, x ðiþ q ðtþ represents the probability of the surfer i to be visiting page q at time t. Each surfer may accept the suggestion of another surfer before taking an action. The interaction among the surfers is modeled by a set of parameters bðijkþ which define the probability of the surfer k to jump to the page currently visited by the surfer i. Thus, in this model, we hypothesize that the interaction does not depend on the pages which the surfers are currently ð3þ ð4þ

4 DILIGENTI ET AL.: A UNIFIED PROBABILISTIC FRAMEWORK FOR WEB PAGE SCORING SYSTEMS 7 TABLE 1 Main Features of the Proposed Ranking Functions The H (V) labels refer to functions for, respectively, horizontal (vertical) scoring systems. The S and M labels indicate whether the ranking functionis underlaid by, respectively, a single surfer and a pool of collaborative surfers. The jump, back, and forward columns indicate whether the corresponding parameter, describing a surfer behavior, is focused (F) or uniform (U) for each proposed ranking function. This table is not exhaustive: other ranking functions (with specific features) could be derived from the proposed general framework, choosing appropriate settings. visiting but only on how much the surfers trust each other. The values bðijkþ must satisfy the probability normalization constraint P M s¼1 bðsjkþ ¼1 8k ¼ 1;...;M. Hence, before taking any action, the surfer i moves to page p with probability v ðiþ p ðtþ due to the suggestions of the other surfers. This probability is computed as vp ðiþ ðtþ ¼XM s¼1 bðsjiþx ðsþ p ðtþ: ð5þ The intermediate distribution vp ðiþ is computed before taking the action that generates the new probability distribution xp ðsþ at the following time step. This intermediate step is introduced to synchronize the pool of surfers. We can organize the probability distributions on the pages for the M surfers, x ðiþ ðtþ, as the columns of a N M matrix XðtÞ. Moreover, we can define the M M matrix A whose ði; kþ element is bðijkþ. The matrix A will be referred to as the interaction matrix. The modified probability distributions v ðiþ p ðtþ, due to the interaction among the surfers, can be collected in an N M matrix V ðtþ, which is obtained as V ðtþ ¼XðtÞA. Finally, the behavior of surfer i is modeled by the set of forward, backward, and jump matrices ðiþ, ðiþ, ðiþ, and by the action matrices D ðiþ j, DðiÞ l, D ðiþ b, DðiÞ s,as in (2). Thus, the transition matrix for the Markov chain associated to the surfer i is 0: T ðiþ ¼ ðiþ D ðiþ j þ ðiþ D ðiþ l þ D ðiþ b þ Ds ðiþ The set of the M interacting surfers can be described by the following equations: 8 >< x ð1þ ðt þ 1Þ ¼T ð1þ XðtÞA ð1þ. ð6þ >: x ðmþ ðt þ 1Þ ¼T ðmþ XðtÞA ðmþ ; where A ðkþ denotes the kth column of the matrix A. When the surfers are independent of each other (i.e., bðijiþ ¼1, i ¼ 0;...;M, and bðijjþ ¼0, i ¼ 0;...M, j ¼ 0;...;M, j 6¼ i), the model reduces to a set of M independent models as described by (4). The general model herein described gives rise to many different ranking schemes, some of which are properly classified in Table 1 and will be analyzed in detail in the remainder of the paper. 3 HORIZONTAL WPSS Horizontal WPSSs define an absolute ranking on a set of Web pages using only the topological information represented by the Web graph. These approaches are validated by the idea that, in a hyperlinked environment, the structure of the interconnections should reflect the quality of the resources, i.e., scarcely linked pages are low quality pages, whereas highly referred pages are relevant sources of information. Different criteria can be defined by refining this simple idea to define the authority of a page in the hyperlinked environment. In the proposed probabilistic framework, horizontal WPSSs are characterized by the fact that the parameters used in the probability computations are independent on the page contents. In particular, in this section, we show how the two most popular WPSSs, PageRank, and HITS can be described as special cases of the random walk framework, even if the original HITS algorithm violates the probabilistic assumptions. We also introduce a hybrid version of these two algorithms. 3.1 PageRank The computation of the PageRank [16] can be modeled by a Single-Surfer random walk by choosing a surfer model based only on two actions: the surfer jumps to a new random page with probability xðjjpþ ¼1 d or he follows one link from the current page with probability xðljpþ ¼d. The probabilities of the other two actions, considered in the general model, are null, i.e., xðbjpþ ¼0 and xðsjpþ ¼0. All these values are clearly independent on the page p. Given that a jump is taken, its target is selected using a uniform probability distribution over all the N Web pages, i.e., xðpjjþ ¼ 1=N; 8p 2 G. Finally, the probability of following the hyperlink from page q to page p does not depend on the page p, i.e., xðpjq; lþ ¼ q. In order to meet the normalization constraint, q ¼ 1=h q, where the hubness of

5 8 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004 page q, h q ¼jchðqÞj, is the number of links exiting from page q (the number of children of the node q in G). This requirement cannot be met by sink pages, i.e., the pages which do not contain any link to other pages. In order to keep the probabilistic interpretation of PageRank, all sink nodes must be removed, unless the computation is slightly modified as described further. By using the PageRank surfer model, (1) can be rewritten as x p ðt þ 1Þ ¼ ¼ ð1 dþ N X x q ðtþþd X q 2 G ð1 dþ N þ d X q x q ðtþ; q 2 paðpþ q 2 paðpþ q x q ðtþ where P q 2 G x qðtþ ¼1 because the probabilistic interpretation is valid. The fact that 0 <d<1 and, thus, xðjjpþ ¼ 1 d>0 guarantees that the PageRank vector converges to a distribution of page scores that does not depend on the initial distribution. The matrix form of the PageRank equation is ð1 dþ xðt þ 1Þ ¼ N þ d W 0 xðtþ; ð8þ where W is the adjacency matrix of the Web graph and is the diagonal matrix whose ðp; pþ element is the inverse of the hubness h p of page p. Thus, because of the hypothesis of independence of the parameters xðpjq; lþ from the page p, it follows that ð D l Þ 0 ¼ d W 0, i.e., the matrix can be factorized into the product of the adjacency matrix of the graph W and the hubness diagonal matrix. Sink nodes violate the probabilistic constraints since no links can actually be followed from a sink node, while the surfer model considers this possibility as a valid action (i.e., xðljqþ¼d 6¼ 0). In order to overcome this problem, it should be xðljqþ ¼0 and, consequently, xðjjqþ ¼1 for any sink node q. Thus, in order to consider also the sink nodes, the PageRank computation should be modified by using xðjjqþ ¼1 d if chðqþ 6¼ ; ð9þ xðjjqþ ¼1 if chðqþ ¼;: In this case, the contribution of the jump probabilities does not sum to a constant term as it happens in (7), but the value xðpjj; tþ ¼ 1 P N q 2 G xðjjqþx qðtþ, which represents the probability of jumping to page p at time t, must be computed at the beginning of each iteration. This is the computational scheme we used in our experiments. 3.2 HITS The HITS algorithm was proposed to model authoritative documents only relying on the information hidden in the connections among them due to cocitation or Web hyperlinks [5]. The algorithm assigns two values to each page p: the authority is a measure of the page relevance as information source, while the hubness refers to the quality of a page as a link to authoritative resources. Thus, the two values computed by the HITS algorithm allow us to distinguish among pages which are authorities and pages which are hubs. In the original formulation, these values are computed by applying iteratively the following equations: ð7þ a q ðt þ 1Þ ¼ P p 2 paðqþ h pðtþ h q ðt þ 1Þ ¼ P p 2 chðqþ a pðtþ; ð10þ where a q indicates the authority of page q and h q its hubness. If aðtþ is the vector collecting all the authorities at step t, and hðtþ is the hubness vector at step t, the previous equation can be rewritten in matrix form as aðt þ 1Þ ¼W hðtþ hðt þ 1Þ ¼W 0 ð11þ aðtþ; where W is the adjacency matrix of the Web graph. At each time step, the HITS algorithm requires normalizing the two vectors aðtþ and hðtþ to have unit length. It can be demonstrated that as t tends to infinity, the direction of the authority vector tends to be parallel to the main eigenvector of the W W 0 matrix (bibliographic coupling matrix 1 ), whereas the hubness vector tends to be parallel to the main eigenvector of the W 0 W matrix (cocitation matrix 2 ). See [5] for further details. The HITS ranking scheme can be represented in the general Web surfer framework, even if some of the assumptions violate the probabilistic interpretation. Since HITS uses two state variables, the hubness and the authority of a page, the corresponding random walk model is a multisurfer scheme based on the activity of two surfers. Surfer 1 is associated to the hubness of pages whereas surfer 2 is associated to the authority of pages. For both surfers, the probabilities of remaining in the same page x ðiþ ðsjpþ and of jumping to a random page x ðiþ ðjjpþ are null. Surfer 1 never follows a link, i.e., x ð1þ ðljpþ ¼0; 8p 2 G, whereas he always follows a back-link, i.e., x ð1þ ðbjpþ ¼1; 8p 2 G. In order to obtain the original HITS computation, we must set x ð1þ ðpjq; bþ ¼1 for each page q linked by page p. This assumption violates the probability normalization constraints since P p 2 paðqþ xð1þ ðpjq; bþ ¼jpaðqÞj 1. Surfer 2 has the opposite behavior with respect to surfer 1. He always follows a link, i.e., x ð2þ ðljpþ ¼1; 8p 2 G and he never follows a back-link, i.e., x ð2þ ðbjpþ ¼0. In this case, the normalization constraint is violated for the values of x ð2þ ðpjq; lþ because the HITS scheme defines x ð2þ ðpjq; lþ ¼1 for each page p linked by page q and, thus, P p2chðqþ xð2þ ðpjq; lþ ¼jchðqÞj 1. The HITS equations can be easily modified in order to obtain a probabilistically coherent model. We just need to choose x ð1þ ðpjq; bþ ¼ 1 jpaðqþj and x ð2þ ðpjq; lþ ¼ 1 jchðqþj. This model is analyzed in [8]. Thus, the action matrices describing the HITS surfers are D ð1þ b ¼ I; D ð2þ l ¼ I, being I the identity matrix, whereas D ð1þ j, D ð1þ l, D ð1þ s, Dð2Þ j, D ð2þ b, D ð2þ s are all equal to the null matrix. Moreover, the interaction between the surfers is described by the matrix: A ¼ 0 1 : ð12þ 1 0 The interpretation of the interactions represented by this matrix is that surfer 1 considers surfer 2 as an expert in discovering authorities and always moves to the position 1. The entry ðp; qþ of the bibliographic coupling matrix is the number of pages jointly linked by the pages q and p [17]. 2. The entry ðp; qþ of the cocitation matrix is the number of pages which jointly link the pages q and p [17].

6 DILIGENTI ET AL.: A UNIFIED PROBABILISTIC FRAMEWORK FOR WEB PAGE SCORING SYSTEMS 9 suggested by that surfer before taking his own action. On the other hand, surfer 2 considers surfer 1 as an expert in finding hubs and then he always moves to the position suggested by that surfer before choosing the next action. In this case, (6) is 8 < 0 x ð1þ ðt þ 1Þ ¼ ð1þ XðtÞA ð1þ x ð2þ ðt þ 1Þ ¼ ð2þ 0 ð13þ : XðtÞA ð2þ : Using (12) and the HITS assumption ð1þ0 ¼ W 0, ð2þ0 ¼ W, we obtain x ð1þ ðt þ 1Þ ¼W 0 x ð2þ ðtþ x ð2þ ðt þ 1Þ ¼W x ð1þ ð14þ ðtþ which, redefining aðtþ ¼x ð1þ ðtþ and hðtþ ¼x ð2þ ðtþ, is equivalent to the HITS computation of (11). The HITS model violates the probabilistic interpretation and this makes the computation unstable since the W W 0 matrix has a principal eigenvalue larger than 1. Hence, the HITS algorithm requires the score vector to be normalized at the end of each iteration. Finally, the HITS scheme suffers from other drawbacks. In particular, large highly connected communities of Web pages tend to attract the principal eigenvector of W W 0, thus pushing to zero the relevance of all other pages. As a result, the page scores tend to decrease rapidly to zero for pages outside those communities. In [8], this effect is analyzed in details and referred to as the Tightly Knit Community Effect. Because of this, the HITS algorithm can be reliably applied only on small subgraphs of the whole Web after an accurate pruning of the links. In fact, the HITS computation has been proposed as a scoring algorithm to be applied on the result set of a query (root set) augmented by the pages which link and are linked by those in the root set and not on the whole Web. Recently, some heuristics have been proposed to reduce the problems affecting the original HITS algorithm even if such behavior cannot be generally avoided because of the properties of the dynamic system associated to the HITS computation [18]. 3.3 PageRank-HITS The multisurfer model allows us to combine the properties of the PageRank and HITS algorithms. Both of these algorithms have benefits and limitations and the aim of the PageRank-HITS model is to combine the positive characteristics of the two techniques. In fact, the computation of the PageRank is stable and has a well-defined behavior because of its probabilistic interpretation. Moreover, it can be applied to large page collections since small communities are not overwhelmed by larger ones but are still influencing the ranking. On the other hand, PageRank is too simple to take into account the complex relationships of Web page citations. The HITS algorithm is not stable, only the largest Web community influences the ranking, and, thus, it cannot be applied to large page collections. On the other hand, the hub and authority model can capture the relationships among Web pages with more details than PageRank. The PageRank-HITS model employs two surfers: Surfer 1 follows a back-link with probability x ð1þ ðbjqþ ¼d ð1þ or jumps to a random page with probability x ð1þ ðjjqþ ¼1 d ð1þ, 8q 2 G. In both cases, the target page p is selected using a uniform probability distribution, i.e., x ð1þ ðpjq; bþ ¼ 1 jpaðqþj and x ð1þ ðpjq; jþ ¼ 1 N. Surfer 2 follows a forward link with probability x ð2þ ðljqþ ¼d ð2þ or jumps to a random page with probability x ð2þ ðjjqþ ¼1 d ð2þ, 8p 2 G. In both cases, the target page p is selected using a uniform probability distribution, i.e., x ð1þ ðpjq; lþ ¼ 1 jchðqþj and xð1þ ðpjq; jþ ¼ 1 N. Thus, surfer 2 implements the PageRank model while surfer 1 can be defined to follow a backward PageRank computation. 3 As in the HITS scheme, the interaction between the surfers is described by the matrix A ¼ 0 1 : 1 0 In this case, (6) becomes 8 < : x ð1þ ðt þ 1Þ ¼ ð1 dð1þ Þ N þ d ð1þ W x ð2þ ðtþ x ð2þ ðt þ 1Þ ¼ ð1 dð2þ Þ N þ d ð2þ W 0 x ð1þ ðtþ; ð15þ where ð1þ0 ¼ W 0, ð2þ0 ¼ W, being the diagonal matrix with element ðp; pþ equal to 1=jpaðpÞj and the diagonal matrix with element ðp; pþ equal to 1=jchðpÞj. This page rank is stable, the scores sum up to 1 and no normalization is required at the end of each iteration. Moreover, the two state variables can capture and process more complex relationships among pages. In particular, setting d ð1þ ¼ d ð2þ ¼ 1 yields a normalized version of HITS, which has been proposed in [6]. 4 VERTICAL WPSS Vertical WPSSs compute a relative ranking of pages when focusing on a specific topic. When applying scoring techniques to focused search the page contents should be taken into account beside the graph topology. A vertical WPSS uses a set of features (e.g., a set of keywords) representing the page contents and a classifier which assigns the degree of relevance with respect to the topic of interest to each page. The general random walk framework for WPSSs proposed in this paper can be used to define a vertical approach to page scoring. Several models can be derived which combine the ideas of the topologybased criteria and the topic relevance measure provided by the text classifier. In particular, the text classifier can be used to compute the values of the probabilities needed by the random walk model. As shown by the experimental results, vertical WPSSs produce much more accurate results in ranking topic specific pages. 4.1 Focused PageRank In the PageRank framework, when choosing to follow a link from a page q, each link has the same probability 1=jchðqÞj to be followed. In the focused domain, we can consider the model of a surfer who follows the links according to the 3. As shown in Section 3.1, sink nodes violate the probabilistic constraints for surfer 2. In this model, supersource nodes (i.e., the nodes q having jpaðqþj ¼ 0) also violate the probabilistic constraints for surfer 1. The equations can be modified straightforwardly to eliminate these problems.

7 10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004 suggestions provided by a text classifier. Thus, this approach removes the assumption of complete randomness in the movements of the Web surfer. In this case, the surfer is aware of what he is searching and he will trust the classifier suggestions following the links with a probability proportional to the topic-relevance of the page which is the target of the link. This allows us to derive a topic-specific page ranking. For example: the Microsoft home page is highly authoritative according to the topic-generic PageRank, whereas it is not highly authoritative when searching for Perl language tutorials. In fact, even if that page is highly linked, most of the links are scarcely related to the target topic and their contribution will be negligible. If the surfer is located at page q and the pages linked by page q are assigned the scores sðch 1 ðqþþ;...;sðch hq ðqþþ by the classifier, the probability of the surfer to follow the ith link is defined as xðch i ðqþjq; lþ ¼ sðch iðqþþ P hq j¼1 sðch jðqþþ : ð16þ Thus, the forward matrix depends on the classifier outputs on the pages in the data set. Hence, the modified equation to compute the combined page scores using a PageRank-like scheme is x p ðt þ 1Þ ¼ ð1 dþ N þ d X q 2 paðpþ xðpjq; lþx q ðtþ; ð17þ where xðpjq; lþ is computed as in (16). 4.2 Double Focused PageRank The focused PageRank surfer, described in the previous section, uses a topic specific distribution for selecting the link to follow, but the decision on the action to take does not depend on the contents of the current page. A more accurate model should consider that the decision about which action to take is usually dependent on the contents of the current page. For example, let us suppose that two surfers are searching for a Perl Language tutorial and that the first one is located at the page while the second is located at the page Clearly, it is more likely that the first surfer will decide to follow a link from the current page, while the second one will prefer to jump to another page which is related to the topic he is interested in. We can model this behavior by adapting the action probabilities using the contents of the current page, thus modeling a focused choice of the surfer s actions. In particular, the probability of following a hyperlink can be chosen to be proportional to the degree of relevance sðqþ of the current page with respect to the target topic, i.e., sðpþ xðljpþ ¼d max q2g sðqþ ; ð18þ where sðqþ is computed by the text classifier. On the other hand, the probability of jumping away from a page decreases proportionally to sðqþ, i.e., sðpþ xðjjpþ ¼1 d max q2g sðqþ : ð19þ Finally, we assume that the probability of landing into a page after a jump is proportional to its relevance sðpþ, i.e., xðpjjþ ¼ sðpþ : ð20þ sðqþ Pq2G 4.3 Focused HITS The multisurfer model may be used to derive a modification of the HITS algorithm similar to that proposed in [19], [20]. This model also takes into account textual information, in order to enforce the influence of the links pointing to ontopic pages and to filter the noise introduced by links to offtopic pages. The HITS model is based on two coupled surfers implementing the HITS algorithm, as shown in Section 3.2. In the focused version, each surfer selects the links (backlinks) using the scores assigned by the text classifier to the target pages. Given the scores sðch 1 ðqþþ;...;sðch hq ðqþþ assigned by the text classifier to the pages ch i ðqþ linked by page q, the first surfer selects the link i to follow from page q by using the probability distribution defined by x ð1þ ðch i ðqþjq; lþ ¼ sðch iðqþþ P n j¼1 sðch jðqþþ : ð21þ Likewise, given the scores sðpa 1 ðqþþ;...;sðpa m ðqþþ assigned by the text classifier to the pages pa i ðqþ linking page q, the second surfer selects the back-link i to follow from page q using the probability distribution x ð2þ ðpa i ðqþjq; bþ ¼ sðpa iðqþþ P m j¼1 sðpa jðqþþ : ð22þ The focused HITS model is thus represented by the two equations ( a p ðt þ 1Þ ¼ P q 2 paðpþ h qðtþx ð1þ ðpjq; lþ h p ðt þ 1Þ ¼ P q2chðpþ a qðtþ; x ð2þ ð23þ ðpjq; bþ; where a p ðtþ ¼xp ð1þ ðtþ represents the focused authority computed by the first surfer and h p ðtþ ¼x ð2þ p ðtþ is the focused hubness measured by the second surfer, when the probabilities x ð1þ ðpjq; lþ and x ð2þ ðpjq; bþ are computed as in (21) and (22). 4.4 Multitopic Rank Topic hierarchies and topic correlations are of fundamental importance to perform focused search in the Web. Typically, pages on a specific topic may be reached through a path of pages belonging to correlated topics. In particular, Fig. 1 shows a typical scenario on the Web, where a set of Researcher Home Pages can be reached following a path through pages belonging to different categories, like, for example, home pages of a university department, home pages of a university faculty, and home pages of a university. In order to enhance the ranking functions for focused information, a model based on a Multisurfer walk can be devised. This model can capture the correlations among topics and reveal more complex properties of the pages due to the topological structure of the topics on the Web.

8 DILIGENTI ET AL.: A UNIFIED PROBABILISTIC FRAMEWORK FOR WEB PAGE SCORING SYSTEMS 11 Fig. 1. Example of topic transitions among connected pages on the Web. In particular, we consider Researcher Home Pages, which are likely to be connected to pages of different categories, among which we consider Department Home Pages, Faculty Home Pages, and University Home Pages. (a) An example of the neighborhood of a set of researcher home pages. (b) Transition probabilities estimated from the sample Web portion shown in (a), taking into account the four considered categories. Each table row collects the probabilities that a page in the corresponding category points to pages belonging to each considered category. Fig. 2. The distribution of the page scores for two different topics. (a) Linux and (b) cooking recipes. By analyzing Web portions densely populated by interesting documents, we can identify a set of topics related to the target one. For example, the set of correlated topics can be discovered automatically using a clustering algorithm. This is the approach we used in the experiments described further. Once a set of T topics is defined, the probability of the transition between each pair of topics can be estimated from a sample of the Web. Pð 0 jþ indicates the probability that a page on topic links a page of topic 0. We can use these probability values which reflect the correlations among the T topics to define the interaction matrix of a pool of T surfers where the th surfer is focused on the th topic. Thus, if topic 0 is highly correlated to topic, then the surfer 0 will be strongly influenced by the activity of the surfer. Formally, the probability v ð0 Þ p ðtþ that surfer 0 moves to page p due the other surfers suggestions is v ð 0 Þ p ðtþ ¼ XjT j ¼1 Pð 0 jþxp ðþ ðtþ: ð24þ Thus, the multitopic rank considers an interaction matrix A whose element ð 0 ;Þ is equal to Pð 0 jþ. This choice allows each surfer to move to a position which is more likely to lead to a page q on his topic of interest. Finally, each surfer can be modeled using one of the focused approaches described in the previous sections. 5 EXPERIMENTAL RESULTS We performed a set of experiments in order to analyze the properties of some of the proposed scoring systems and to make comparisons on the different rankings. Since we were mainly interested in evaluating the performance of scoring

9 12 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004 Fig. 3. The eight top score pages for the data set Linux. systems for vertical (topic-specific) applications, we based our test on a set of single-topic data sets. Each data set was collected using the focus crawler described in [21]. In particular, the focus crawler employs a Naive Bayes classifier ([22, chapter 6]), which computes the correlation between the content of each downloaded page and the considered topic. The classifier directs the crawl to the most promising Web regions by selecting the links starting from the pages having the highest scores. About 150,000 pages were downloaded for each single crawl. The topics of the page collections were selected to be not too specific in order to cover many different subtopics in each data set. The selected topics were: pages on the operating system Linux (data set Linux ), pages on cooking recipes (data set cooking recipes ), pages concerning the sport golf (data set golf ), and documents related to wines (data set wine ). For each selected topic, a relavance score was assigned to each page by the Naive Bayes classifiers which were previously used to focus-crawl the Web. The scores produced by the models were stored to be used in the computation of the vertical page ranks. Considering the hyperlinks contained in each page, a Web subgraph was created from each data set to perform the evaluation of the different WPSSs proposed in the previous sections. Beside the ranking systems described in the previous sections, we report the results also for the In-link surfer. Such a surfer is located in a page with probability proportional to the number of in-links of that page. For all the PageRank surfers (focused or not), we set the d parameter to 0: Score Distributions We performed an analysis of the distribution of page scores using the different algorithms proposed in this paper. For each ranking function, we normalized the score values by their maximum over all the pages (thus, yielding values in [0,1]). Then, we sorted the pages according to their ranks and then we plotted the distribution of the normalized rank values. Fig. 2 reports the plots for the two categories, Linux and cooking recipes. In both cases, the HITS surfer assigns a score value significantly greater than zero only to the small set of pages associated to the principal community of the subgraph. On the other hand, the PageRank yields a smoother curve for the score distribution. This is the effect of the homogeneous term 1 d in (7). The focused versions of the PageRank are still smooth but concentrate the scores on a smaller set of authoritative pages which are more specific for the considered topic. This reflects the fact that the vertical WPSSs are able to discriminate the authorities on the specific topic, whereas the classical PageRank scheme considers the authoritative pages regardless of their topic. 5.2 Top Lists Figs. 3 and 4 show the eight top score pages for four different WPSSs on the data set Linux and cooking recipes, respectively. For the HITS surfer pool, we report the pages with the top authority values.

10 DILIGENTI ET AL.: A UNIFIED PROBABILISTIC FRAMEWORK FOR WEB PAGE SCORING SYSTEMS 13 PageRank can filter many off-topic authoritative pages from the top list. In particular, the Double Focused PageRank WPSS pushes all the authorities on the relevant topic to the top positions. Fig. 4. The eight top score pages for the data set cooking recipes. As shown in Fig. 3, all pages selected by the HITS algorithm are from the same site. This is due to the wellknown property of the HITS algorithm which produces a score vector in the direction of the most interconnected communities. In many cases, it is difficult to reduce this undesirable behavior of the HITS ranking by properly pruning the links among pages. For example, in order to reduce the nepotism of Web pages, for the data set cooking recipes, the connectivity map of pages was pruned removing all the intrasite links. However, as shown in the HITS section of Fig. 4, the Web site com, which is subdivided into a collection of Web sites ( etc.) which are strongly interconnected, occupies all the top positions in the ranking list. In [18], the content of pages is considered in order to propagate relevance scores only over the subset of links pointing to pages on a specific topic. However, in this case, the performance cannot be improved even by using this approach since all the sites in the community are effectively on-topic and, thus, the interconnections are semantically coherent. The PageRank algorithm is not topic dependent, and, consequently, highly interconnected pages result in being authoritative regardless of the topic of interest. For example, pages like com, etc., are shown in the top list even if they are not closely related to the specific topic. The focused versions of 5.3 Comparison of the WPPSs In this section, compare the results obtained by the In-link surfer, the Page Rank surfer, the Focused Page Rank scheme, the Double Focused Page Rank scheme, and the HITS surfer pool. We follow a methodology similar to that one presented in [23]. For each topic, we created a collection of pages which were evaluated by a pool of 10 human experts. The experts independently labeled each page in the collection as authoritative or not authoritative for the specific topic. In particular, the top 15 pages for each ranking function were shown to a set of experts. Each expert provided either a positive, negative, or null feedback on each single page. The labeled pages were used to measure the percentage of positive (negative) results in the top list returned by each ranking function. The length of the top list was varied between 1 and 300. The evaluation was performed on the two data sets Linux and Golf. Fig. 5 reports the percentage of all pages labeled as authoritative by experts among the first N pages in the top list produced by five different WPSSs for the two data sets. In both cases, the HITS algorithm produces the worst ranking. This result confirms the fact that HITS can only be used as a query-dependent ranking schema [3]. As previously reported in [23], in spite of its simplicity, the In-link algorithm has a performance similar to PageRank. In our experiments, PageRank outperformed the In-Links algorithm on the category Golf, whereas it was outperformed on the category Linux. However, in both cases, the gap is small. The two focused ranking functions clearly outperformed all the not focused ones, demonstrating that when searching focused authorities, a higher accuracy is provided by taking into account the page contents. In both cases, more than 60 percent of the authoritative pages are in the top 50 pages suggested by the Double Focused PageRank. 5.4 The Multitopic WPSS We evaluated the scoring model which considers the correlation among different topics proposed in Section 4.4. Each surfer was associated to a subtopic correlated to the main topic by using a text classifier. The set of subtopics was determined automatically by the following procedure. First, a set of seed pages for the topic of interest was selected. Then, the context graph of each of these pages was built by back-crawling the Web up to three levels (i.e., the pages one, two, and three clicks away from the seed ones). A hierarchical clustering algorithm on the bag-of-words vectors representing the documents was used to split the set of the pages in the context graph into subsets corresponding to the subtopics. In the experiments, we fixed the maximum number of clusters to be 10. Each cluster obtained in the previous step is associated to a surfer. In order to facilitate the integration with the probabilistic model which is used to compute the page scores, a set of naive Bayes classifiers (see e.g., [22, chapter 6])

11 14 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004 Fig. 5. The percentage of the authoritative pages as defined by a set of 10 users in the best N pages returned by the various WPSSs, respectively, for the topic (a) Linux and (b) Golf (the highest the best). Vice-versa, in (c), the percentage of the nonauthoritative pages for topic Golf in the best N pages returned by the various WPSSs (the lowest the best) is shown. Similar results hold for the Linux topic. was trained using the documents in each cluster. Finally, the matrix of topic-transition probabilities was estimated from the context graph by counting the number of pages in cluster j which link a page in cluster i and by normalizing this value using the total number of links to the pages in cluster i. The estimated topic-transition matrix was used as the interaction matrix for the multiple surfer model used to compute the page scores. Each surfer was focused on the particular subtopic corresponding to the associated cluster and the surfer behavior was chosen to be focused in the choice of the links to follow, in the jumps to take, and in the bias among these two actions (the value of the parameter d). We performed a set of experiments on the three topics wine, golf, and cooking recipes. Fig. 6 shows the plots of the scores assigned to the pages by each of the 10 surfers for the three data sets. Surfer 0 is the one associated to the topic of interest as defined by the seed pages. Each curve is normalized with respect to the maximum score assigned to a page. As can be seen from the curve corresponding to surfer 0, which is mostly flat, for the topic wine (plot c), only one page in the data set receives a high score by surfer 0 (winelibrary.com), while many pages are assigned similar scores. The scores assigned by the other surfers correspond to the context subtopics and they show a less uniform distribution. For the other data sets, the distribution of the scores assigned by surfer 0 is less uniform. 6 CONCLUSIONS In this paper, we have proposed a general probabilistic framework based on random walks for the definition of ranking functions on a set of hyperlinked documents. The proposed framework allows us the definition of both horizontal (topology-based) and vertical (topic-topology based) rankings. The proposed scheme incorporates many relevant scoring models proposed in the literature. Moreover, it contains novel features which look very appropriate especially for vertical (focused) search engines. In particular, in some of the proposed ranking algorithms, the topological structure of the Web, as well as the content of the Web pages, jointly play a crucial role for the computation of the scores. The experimental results support the effectiveness of the proposal which clearly

12 DILIGENTI ET AL.: A UNIFIED PROBABILISTIC FRAMEWORK FOR WEB PAGE SCORING SYSTEMS 15 Fig. 6. The distribution of page scores for each surfer when using a multisurfer model with 10 surfers. (a) Data set golf. (b) Data set cooking recipes. (c) Data set wine. emerge especially for focused search. Finally, it is worth mentioning that the model described in this paper is very well-suited for the construction of learning-based WPSS, which can, in principle, incorporate the user information while surfing the Web. ACKNOWLEDGMENTS The authors would like to thank Ottavio Calzone and Francesco Scali (DII, University of Siena) who performed some of the experimental evaluations of the scoring systems. Some fruitful discussions with Nicola Baldini concerning the focuseek project ( were also very stimulating and useful for the development of the general framework described in the paper. Finally, the authors would like to thank the anonymous reviewers for the useful suggestions. REFERENCES [1] S. Lawrence and C.L. Giles, Searching the Web, Science, vol. 281, no. 5374, p. 175, [2] S. Lawrence and C.L. Giles, Accessibility of Information on the Web, Nature, vol. 400, no. 8, pp , [3] M. Henzinger, Hyperlink Analysis for the Web, IEEE Internet Computing, vol. 1, no. 5, pp , [4] L. Page, S. Brin, R. Motwani, and T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, technical report, Computer Science Dept., Stanford Univ., [5] J.M. Kleinberg, Authoritative Sources in a Hyperlinked Environment, J. ACM, vol. 46, no. 5, pp , [6] K. Bharat and M.R. Henzinger, Improved Algorithms for Topic Distillation in a Hyperlinked Environment, Proc. 21st Ann. Int l ACM SIGIR Conf. Research and Development in Information Retrieval, pp , [7] R. Lempel and S. Moran, The Stochastic Approach for Link- Structure Analysis (SALSA) and the TKC Effect, Proc. Ninth World Wide Web Conf. (WWW9), pp , [8] R. Lempel and S. Moran, Salsa: The Stochastic Approach for Link-Structure Analysis, ACM Trans. Information Systems, vol. 19, no. 2, pp , [9] D. Cohn and H. Chang, Learning to Probabilistically Identify Authoritative Documents, Proc. 17th Int l Conf. Machine Learning (ICML), pp , [10] D. Cohn and T. Hofmann, The Missing Link: A Probabilistic Model of Document Content and Hypertext Connectivity, Advances in Neural Information Processing Systems 13, pp , [11] M. Richardson and P. Domingos, The Intelligent Surfer: Probabilistic Combination of Link and Content Information in Pagerank, Advances in Neural Information Processing Systems 14, pp , [12] T. H. Haveliwala, Topic-Sensitive Pagerank, Proc. 11th World Wide Web Conf. (WWW2002), pp , [13] M. Diligenti, M. Gori, and M. Maggini, Web Page Scoring Systems for Horizontal and Vertical Search, Proc. 11th World Wide Web Conf. (WWW2002), pp , [14] G. Greco, S. Greco, and E. Zumpano, A Probabilistic Approach for Distillation and Ranking of Web Pages, World Wide Web, vol. 4, no. 3, pp , 2001.

13 16 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004 [15] E. Seneta, Non-Negative Matrices and Markov Chains. Springer- Verlag, [16] S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Proc. Seventh World Wide Web Conf. (WWW7), pp , [17] M.M. Kessler, Bibliographic Coupling between Scientific Papers, Am. Documentation, vol. 14, pp , [18] S. Chakrabarti, M. Joshi, and V. Tawde, Enhanced Topic Distillation Using Text, Markup Tags, and Hyperlinks, Proc. 24th Ann. Int l ACM SIGIR Conf. Research and Development in Information Retrieval, pp , [19] S. Chakrabarti, M. Van der Berg, and B. Dom, Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery, Proc. Eighth Int l World Wide Web Conf. (WWW8), pp , [20] S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg, Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text, Proc. Seventh World Wide Web Conf. (WWW7), pp , [21] M. Diligenti, F. Coetzee, S. Lawrence, L. Giles, and M. Gori, Focus Crawling by Context Graphs, Proc. 26th Int l Conf. Very Large Databases (VLDB 2000), pp , [22] T.M. Mitchell, Machine Learning. McGraw-Hill, [23] B. Amento, L. Terveen, and W. Hill, Does Authority Mean Quality? Predicting Expert Quality Ratings of Web Documents, Proc. 23rd Ann. Int l ACM SIGIR Conf. Research and Development in Information Retrieval, pp , Michelangelo Diligenti received the PhD degree in computer science and system engineering in 2002 from the University of Florence, Italy. Currently, he is a research associate at the University of Siena, Italy. He has collaborated with the University of Wollongong and the NEC Research Institute, Pricetown, New Jersey. His main research interests are pattern recognition, text categorization, visual databases, and machine learning applied to the World Wide Web. Marco Gori received the Laurea degree in electronic engineering from Università di Firenze, Italy, in 1984, and the PhD degree in 1990 from Università di Bologna, Italy. From October 1988 to June 1989, he was a visiting student at the School of Computer Science, McGill University, Montreal. In 1992, he became an associate professor of computer science at Università di Firenze and, in November 1995, he joined the University of Siena, where he is currently full professor. His main research interests are in neural networks, pattern recognition, and applications of machine learning to information retrieval on the Internet. He has led a number of research projects on these themes with either national or international partners, and has been involved in the organization of many scientific events, including the IEEE-INNS International Joint Conference on Neural Networks, for which he acted as the program chair (2000). Dr. Gori serves (served) as an associate editor of a number of technical journals related to his areas of expertise, including Pattern Recognition, the IEEE Transactions Neural Networks, Neurocomputing, and the International Journal on Pattern Recognition and Artificial Intelligence. He is the Italian chairman of the IEEE Neural Network Council (R.I.G.), is acting as the cochair of the TC3 technical committee of the IAPR (International Association for Pattern Recognition) on Neural Networks, and is the president of the Italian Association for Artificial Intelligence. Dr. Gori is a fellow of the IEEE. Marco Maggini received the Laurea degree (cum laude) in electronic engineering and the PhD degree in computer science and control systems from the University of Firenze in 1991 and 1995, respectively. In February 1996, he became assistant professor of computer engineering in the School of Engineering at the University of Siena, where, since March 2001, he has been an associate professor. His main research interests are: machine learning, neural networks, human-machine interaction, technologies for distributing and searching information on the Internet, and nonstructured databases. He has been collaborating with the NEC Research Institute, Princeton, New Jersey, on parallel processing, neural networks, and financial time series prediction. He is member of the editorial board of the Electronic Letters on Computer Vision and Image Analysis and associate editor of the ACM Transaction on Internet Technology. He has been guest editor of a special issue of the ACM Transactions on Internet Technology on machine learning for the Internet. He contributed to the organization of international and national scientific events. He is member of the IAPR-IC and the IEEE Computer Society.. For more information on this or any computing topic, please visit our Digital Library at

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Learning Web Page Scores by Error Back-Propagation

Learning Web Page Scores by Error Back-Propagation Learning Web Page Scores by Error Back-Propagation Michelangelo Diligenti, Marco Gori, Marco Maggini Dipartimento di Ingegneria dell Informazione Università di Siena Via Roma 56, I-53100 Siena (Italy)

More information

Information Retrieval. Lecture 11 - Link analysis

Information Retrieval. Lecture 11 - Link analysis Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

COMP 4601 Hubs and Authorities

COMP 4601 Hubs and Authorities COMP 4601 Hubs and Authorities 1 Motivation PageRank gives a way to compute the value of a page given its position and connectivity w.r.t. the rest of the Web. Is it the only algorithm: No! It s just one

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Information Networks: PageRank

Information Networks: PageRank Information Networks: PageRank Web Science (VU) (706.716) Elisabeth Lex ISDS, TU Graz June 18, 2018 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 1 / 38 Repetition Information Networks Shape of the

More information

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page Link Analysis Links Web consists of web pages and hyperlinks between pages A page receiving many links from other pages may be a hint of the authority of the page Links are also popular in some other information

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,

More information

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) ' Sta306b May 11, 2012 $ PageRank: 1 Web search before Google (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) & % Sta306b May 11, 2012 PageRank: 2 Web search

More information

CS6200 Information Retreival. The WebGraph. July 13, 2015

CS6200 Information Retreival. The WebGraph. July 13, 2015 CS6200 Information Retreival The WebGraph The WebGraph July 13, 2015 1 Web Graph: pages and links The WebGraph describes the directed links between pages of the World Wide Web. A directed edge connects

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Einführung in Web und Data Science Community Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Today s lecture Anchor text Link analysis for ranking Pagerank and variants

More information

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google,

In the recent past, the World Wide Web has been witnessing an. explosive growth. All the leading web search engines, namely, Google, 1 1.1 Introduction In the recent past, the World Wide Web has been witnessing an explosive growth. All the leading web search engines, namely, Google, Yahoo, Askjeeves, etc. are vying with each other to

More information

Searching the Web What is this Page Known for? Luis De Alba

Searching the Web What is this Page Known for? Luis De Alba Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse

More information

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a !"#$ %#& ' Introduction ' Social network analysis ' Co-citation and bibliographic coupling ' PageRank ' HIS ' Summary ()*+,-/*,) Early search engines mainly compare content similarity of the query and

More information

Bruno Martins. 1 st Semester 2012/2013

Bruno Martins. 1 st Semester 2012/2013 Link Analysis Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 4

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Link Structure Analysis

Link Structure Analysis Link Structure Analysis Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!) Link Analysis In the Lecture HITS: topic-specific algorithm Assigns each page two scores a hub score

More information

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches

More information

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur

More information

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5) INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5) Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS)

More information

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge Centralities (4) By: Ralucca Gera, NPS Excellence Through Knowledge Some slide from last week that we didn t talk about in class: 2 PageRank algorithm Eigenvector centrality: i s Rank score is the sum

More information

Ranking web pages using machine learning approaches

Ranking web pages using machine learning approaches University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2008 Ranking web pages using machine learning approaches Sweah Liang Yong

More information

Lecture 8: Linkage algorithms and web search

Lecture 8: Linkage algorithms and web search Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

Information Retrieval

Information Retrieval Information Retrieval CSC 375, Fall 2016 An information retrieval system will tend not to be used whenever it is more painful and troublesome for a customer to have information than for him not to have

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Searching the Web [Arasu 01]

Searching the Web [Arasu 01] Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web

More information

PageRank and related algorithms

PageRank and related algorithms PageRank and related algorithms PageRank and HITS Jacob Kogan Department of Mathematics and Statistics University of Maryland, Baltimore County Baltimore, Maryland 21250 kogan@umbc.edu May 15, 2006 Basic

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Collaborative filtering based on a random walk model on a graph

Collaborative filtering based on a random walk model on a graph Collaborative filtering based on a random walk model on a graph Marco Saerens, Francois Fouss, Alain Pirotte, Luh Yen, Pierre Dupont (UCL) Jean-Michel Renders (Xerox Research Europe) Some recent methods:

More information

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies Large-Scale Networks PageRank Dr Vincent Gramoli Lecturer School of Information Technologies Introduction Last week we talked about: - Hubs whose scores depend on the authority of the nodes they point

More information

The PageRank Citation Ranking

The PageRank Citation Ranking October 17, 2012 Main Idea - Page Rank web page is important if it points to by other important web pages. *Note the recursive definition IR - course web page, Brian home page, Emily home page, Steven

More information

COMP5331: Knowledge Discovery and Data Mining

COMP5331: Knowledge Discovery and Data Mining COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank

More information

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Pagerank Scoring. Imagine a browser doing a random walk on web pages: Ranking Sec. 21.2 Pagerank Scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably

More information

A ew Algorithm for Community Identification in Linked Data

A ew Algorithm for Community Identification in Linked Data A ew Algorithm for Community Identification in Linked Data Nacim Fateh Chikhi, Bernard Rothenburger, Nathalie Aussenac-Gilles Institut de Recherche en Informatique de Toulouse 118, route de Narbonne 31062

More information

An Improved Computation of the PageRank Algorithm 1

An Improved Computation of the PageRank Algorithm 1 An Improved Computation of the PageRank Algorithm Sung Jin Kim, Sang Ho Lee School of Computing, Soongsil University, Korea ace@nowuri.net, shlee@computing.ssu.ac.kr http://orion.soongsil.ac.kr/ Abstract.

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

Mathematical Analysis of Google PageRank

Mathematical Analysis of Google PageRank INRIA Sophia Antipolis, France Ranking Answers to User Query Ranking Answers to User Query How a search engine should sort the retrieved answers? Possible solutions: (a) use the frequency of the searched

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press. Link Analysis SEEM5680 Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press. 1 The Web as a Directed Graph Page A Anchor hyperlink Page

More information

Path Analysis References: Ch.10, Data Mining Techniques By M.Berry, andg.linoff Dr Ahmed Rafea

Path Analysis References: Ch.10, Data Mining Techniques By M.Berry, andg.linoff  Dr Ahmed Rafea Path Analysis References: Ch.10, Data Mining Techniques By M.Berry, andg.linoff http://www9.org/w9cdrom/68/68.html Dr Ahmed Rafea Outline Introduction Link Analysis Path Analysis Using Markov Chains Applications

More information

The application of Randomized HITS algorithm in the fund trading network

The application of Randomized HITS algorithm in the fund trading network The application of Randomized HITS algorithm in the fund trading network Xingyu Xu 1, Zhen Wang 1,Chunhe Tao 1,Haifeng He 1 1 The Third Research Institute of Ministry of Public Security,China Abstract.

More information

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page Agenda Math 104 1 Google PageRank algorithm 2 Developing a formula for ranking web pages 3 Interpretation 4 Computing the score of each page Google: background Mid nineties: many search engines often times

More information

Using Non-Linear Dynamical Systems for Web Searching and Ranking

Using Non-Linear Dynamical Systems for Web Searching and Ranking Using Non-Linear Dynamical Systems for Web Searching and Ranking Panayiotis Tsaparas Dipartmento di Informatica e Systemistica Universita di Roma, La Sapienza tsap@dis.uniroma.it ABSTRACT In the recent

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy {ienco,meo,botta}@di.unito.it Abstract. Feature selection is an important

More information

A Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks

A Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks IEEE/ACM TRANSACTIONS ON NETWORKING, VOL. 8, NO. 6, DECEMBER 2000 747 A Path Decomposition Approach for Computing Blocking Probabilities in Wavelength-Routing Networks Yuhong Zhu, George N. Rouskas, Member,

More information

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds MAE 298, Lecture 9 April 30, 2007 Web search and decentralized search on small-worlds Search for information Assume some resource of interest is stored at the vertices of a network: Web pages Files in

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Evaluation Methods for Focused Crawling

Evaluation Methods for Focused Crawling Evaluation Methods for Focused Crawling Andrea Passerini, Paolo Frasconi, and Giovanni Soda DSI, University of Florence, ITALY {passerini,paolo,giovanni}@dsi.ing.unifi.it Abstract. The exponential growth

More information

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit Page 1 of 14 Retrieving Information from the Web Database and Information Retrieval (IR) Systems both manage data! The data of an IR system is a collection of documents (or pages) User tasks: Browsing

More information

Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank. Hans De Sterck

Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank. Hans De Sterck Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank Hans De Sterck Department of Applied Mathematics University of Waterloo, Ontario, Canada joint work with Steve McCormick,

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris Manning at Stanford U.) The Web as a Directed Graph

More information

TODAY S LECTURE HYPERTEXT AND

TODAY S LECTURE HYPERTEXT AND LINK ANALYSIS TODAY S LECTURE HYPERTEXT AND LINKS We look beyond the content of documents We begin to look at the hyperlinks between them Address questions like Do the links represent a conferral of authority

More information

Controlling the spread of dynamic self-organising maps

Controlling the spread of dynamic self-organising maps Neural Comput & Applic (2004) 13: 168 174 DOI 10.1007/s00521-004-0419-y ORIGINAL ARTICLE L. D. Alahakoon Controlling the spread of dynamic self-organising maps Received: 7 April 2004 / Accepted: 20 April

More information

Keyword Extraction by KNN considering Similarity among Features

Keyword Extraction by KNN considering Similarity among Features 64 Int'l Conf. on Advances in Big Data Analytics ABDA'15 Keyword Extraction by KNN considering Similarity among Features Taeho Jo Department of Computer and Information Engineering, Inha University, Incheon,

More information

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy CENTRALITIES Carlo PICCARDI DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy email carlo.piccardi@polimi.it http://home.deib.polimi.it/piccardi Carlo Piccardi

More information

Information Retrieval Spring Web retrieval

Information Retrieval Spring Web retrieval Information Retrieval Spring 2016 Web retrieval The Web Large Changing fast Public - No control over editing or contents Spam and Advertisement How big is the Web? Practically infinite due to the dynamic

More information

Bibliometrics: Citation Analysis

Bibliometrics: Citation Analysis Bibliometrics: Citation Analysis Many standard documents include bibliographies (or references), explicit citations to other previously published documents. Now, if you consider citations as links, academic

More information

Weka ( )

Weka (  ) Weka ( http://www.cs.waikato.ac.nz/ml/weka/ ) The phases in which classifier s design can be divided are reflected in WEKA s Explorer structure: Data pre-processing (filtering) and representation Supervised

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank

More information

On Finding Power Method in Spreading Activation Search

On Finding Power Method in Spreading Activation Search On Finding Power Method in Spreading Activation Search Ján Suchal Slovak University of Technology Faculty of Informatics and Information Technologies Institute of Informatics and Software Engineering Ilkovičova

More information

Lecture 17 November 7

Lecture 17 November 7 CS 559: Algorithmic Aspects of Computer Networks Fall 2007 Lecture 17 November 7 Lecturer: John Byers BOSTON UNIVERSITY Scribe: Flavio Esposito In this lecture, the last part of the PageRank paper has

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 21: Link Analysis Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-06-18 1/80 Overview

More information

Adaptive Ranking of Web Pages

Adaptive Ranking of Web Pages Adaptive Ranking of Web Ah Chung Tsoi Office of Pro Vice-Chancellor (IT) University of Wollongong Wollongong, NSW 2522, Australia Markus Hagenbuchner Office of Pro Vice-Chancellor (IT) University of Wollongong

More information

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE Bohar Singh 1, Gursewak Singh 2 1, 2 Computer Science and Application, Govt College Sri Muktsar sahib Abstract The World Wide Web is a popular

More information

Unit VIII. Chapter 9. Link Analysis

Unit VIII. Chapter 9. Link Analysis Unit VIII Link Analysis: Page Ranking in web search engines, Efficient Computation of Page Rank using Map-Reduce and other approaches, Topic-Sensitive Page Rank, Link Spam, Hubs and Authorities (Text Book:2

More information

Evolutionary Linkage Creation between Information Sources in P2P Networks

Evolutionary Linkage Creation between Information Sources in P2P Networks Noname manuscript No. (will be inserted by the editor) Evolutionary Linkage Creation between Information Sources in P2P Networks Kei Ohnishi Mario Köppen Kaori Yoshida Received: date / Accepted: date Abstract

More information

Information Sciences

Information Sciences Information Sciences 179 (2009) 3286 3308 Contents lists available at ScienceDirect Information Sciences journal homepage: www.elsevier.com/locate/ins The partitioned-layer index: Answering monotone top-k

More information

Brief (non-technical) history

Brief (non-technical) history Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS3245 12 Lecture 12: Crawling and Link Analysis Information Retrieval Last Time Chapter 11 1. Probabilistic Approach to Retrieval / Basic Probability Theory 2. Probability

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 12: Link Analysis January 28 th, 2016 Wolf-Tilo Balke and Younes Ghammad Institut für Informationssysteme Technische Universität Braunschweig An Overview

More information

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul 1 CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul Introduction Our problem is crawling a static social graph (snapshot). Given

More information

Using PageRank in Feature Selection

Using PageRank in Feature Selection Using PageRank in Feature Selection Dino Ienco, Rosa Meo, and Marco Botta Dipartimento di Informatica, Università di Torino, Italy fienco,meo,bottag@di.unito.it Abstract. Feature selection is an important

More information

Slides based on those in:

Slides based on those in: Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6 2 y 0.8 ½+0.2 ⅓ M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N]

More information

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search

Design and Implementation of Search Engine Using Vector Space Model for Personalized Search Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 1, January 2014,

More information

Abstract. 1. Introduction

Abstract. 1. Introduction A Visualization System using Data Mining Techniques for Identifying Information Sources on the Web Richard H. Fowler, Tarkan Karadayi, Zhixiang Chen, Xiaodong Meng, Wendy A. L. Fowler Department of Computer

More information

On Page Rank. 1 Introduction

On Page Rank. 1 Introduction On Page Rank C. Hoede Faculty of Electrical Engineering, Mathematics and Computer Science University of Twente P.O.Box 217 7500 AE Enschede, The Netherlands Abstract In this paper the concept of page rank

More information

Web Document Clustering using Semantic Link Analysis

Web Document Clustering using Semantic Link Analysis Web Document Clustering using Semantic Link Analysis SOMJIT ARCH-INT, Ph.D. Semantic Information Technology Innovation (SITI) LAB Department of Computer Science, Faculty of Science, Khon Kaen University,

More information

Link Analysis. Hongning Wang

Link Analysis. Hongning Wang Link Analysis Hongning Wang CS@UVa Structured v.s. unstructured data Our claim before IR v.s. DB = unstructured data v.s. structured data As a result, we have assumed Document = a sequence of words Query

More information

A P2P-based Incremental Web Ranking Algorithm

A P2P-based Incremental Web Ranking Algorithm A P2P-based Incremental Web Ranking Algorithm Sumalee Sangamuang Pruet Boonma Juggapong Natwichai Computer Engineering Department Faculty of Engineering, Chiang Mai University, Thailand sangamuang.s@gmail.com,

More information

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani

International Journal of Data Mining & Knowledge Management Process (IJDKP) Vol.7, No.3, May Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani LINK MINING PROCESS Dr.Zakea Il-Agure and Mr.Hicham Noureddine Itani Higher Colleges of Technology, United Arab Emirates ABSTRACT Many data mining and knowledge discovery methodologies and process models

More information

Network community detection with edge classifiers trained on LFR graphs

Network community detection with edge classifiers trained on LFR graphs Network community detection with edge classifiers trained on LFR graphs Twan van Laarhoven and Elena Marchiori Department of Computer Science, Radboud University Nijmegen, The Netherlands Abstract. Graphs

More information

Dynamic Visualization of Hubs and Authorities during Web Search

Dynamic Visualization of Hubs and Authorities during Web Search Dynamic Visualization of Hubs and Authorities during Web Search Richard H. Fowler 1, David Navarro, Wendy A. Lawrence-Fowler, Xusheng Wang Department of Computer Science University of Texas Pan American

More information

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University Instead of generic popularity, can we measure popularity within a topic? E.g., computer science, health Bias the random walk When

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis Content Anchor text Link analysis for ranking Pagerank and variants HITS The Web as a Directed Graph Page A Anchor

More information

Bipartite Graph Partitioning and Content-based Image Clustering

Bipartite Graph Partitioning and Content-based Image Clustering Bipartite Graph Partitioning and Content-based Image Clustering Guoping Qiu School of Computer Science The University of Nottingham qiu @ cs.nott.ac.uk Abstract This paper presents a method to model the

More information

Recent Researches on Web Page Ranking

Recent Researches on Web Page Ranking Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India Importance of Web Page Internet Surfers generally do not bother to go through

More information

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material. Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material. 1 Contents Introduction Network properties Social network analysis Co-citation

More information

Part I: Data Mining Foundations

Part I: Data Mining Foundations Table of Contents 1. Introduction 1 1.1. What is the World Wide Web? 1 1.2. A Brief History of the Web and the Internet 2 1.3. Web Data Mining 4 1.3.1. What is Data Mining? 6 1.3.2. What is Web Mining?

More information

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 188 CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES 6.1 INTRODUCTION Image representation schemes designed for image retrieval systems are categorized into two

More information

Game theory and non-linear dynamics: the Parrondo Paradox case study

Game theory and non-linear dynamics: the Parrondo Paradox case study Chaos, Solitons and Fractals 17 (2003) 545 555 www.elsevier.com/locate/chaos Game theory and non-linear dynamics: the Parrondo Paradox case study P. Arena, S. Fazzino, L. Fortuna *, P. Maniscalco Dipartimento

More information

CS224W Final Report Emergence of Global Status Hierarchy in Social Networks

CS224W Final Report Emergence of Global Status Hierarchy in Social Networks CS224W Final Report Emergence of Global Status Hierarchy in Social Networks Group 0: Yue Chen, Jia Ji, Yizheng Liao December 0, 202 Introduction Social network analysis provides insights into a wide range

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Statistical Testing of Software Based on a Usage Model

Statistical Testing of Software Based on a Usage Model SOFTWARE PRACTICE AND EXPERIENCE, VOL. 25(1), 97 108 (JANUARY 1995) Statistical Testing of Software Based on a Usage Model gwendolyn h. walton, j. h. poore and carmen j. trammell Department of Computer

More information

Link Analysis in Web Mining

Link Analysis in Web Mining Problem formulation (998) Link Analysis in Web Mining Hubs and Authorities Spam Detection Suppose we are given a collection of documents on some broad topic e.g., stanford, evolution, iraq perhaps obtained

More information