THE Web, with its popularity, fast and uncontrollable

Size: px

Start display at page:

Download "THE Web, with its popularity, fast and uncontrollable"

Homer Raymond Stephens
6 years ago
Views:

1 4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004 A Unified Probabilistic Framework for Web Page Scoring Systems Michelangelo Diligenti, Marco Gori, Fellow, IEEE, and Marco Maggini, Member, IEEE Computer Society Abstract The definition of efficient page ranking algorithms is becoming an important issue in the design of the query interface of Web search engines. Information flooding is a common experience especially when broad topic queries are issued. Queries containing only one or two keywords usually match a huge number of documents, while users can only afford to visit the first positions of the returned list, which do not necessarily refer to the most appropriate answers. Some successful approaches to page ranking in a hyperlinked environment, like the Web, are based on link analysis. In this paper, we propose a general probabilistic framework for Web Page Scoring Systems (WPSS), which incorporates and extends many of the relevant models proposed in the literature. In particular, we introduce scoring systems for both generic (horizontal) and focused (vertical) search engines. Whereas horizontal scoring algorithms are only based on the topology of the Web graph, vertical ranking also takes the page contents into account and are the base for focused and user adapted search interfaces. Experimental results are reported to show the properties of some of the proposed scoring systems with special emphasis on vertical search. Index Terms Web page scoring systems, random walks, HITS, PageRank, focused PageRank. æ 1 INTRODUCTION THE Web, with its popularity, fast and uncontrollable growth, and heterogeneity poses serious challenges to search engine designers, even if most of the techniques required for searching a collection of documents had already been studied in the related fields of databases and information retrieval. The new scenario is quite different with respect to traditional information retrieval applications, which deals with more controlled environments. The Web is extremely dynamic, its rate of growth is impressive, but one of the basic issues is that there is not central control of which (where and when) documents are published. Nowadays, almost any user of the Web can publish his/her own pages, with any contents, being a good author and expert on the topic he/she is writing on or being just a spammer. Thus, there are three main challenges which a search engine has to face: The first problem is how to find and index the documents on the Web. Search engines are de facto the only central indexes of the information available on the Web that otherwise would be accessible only navigating through hyperlinks. Unfortunately, search engines are not able to track the publication of new pages, and the only way they have to build their indexes is to collect the documents by crawling the Web graph. Crawling must be performed continuously and, nowadays, a complete crawl of the whole Web is not feasible. Both the size and the structure of the Web, as well as freshness requirements, force search engines to cover only a fraction of the whole Web [1], [2]. The second issue concerns the size and the heterogeneity of the information available on the Web. The different document formats, the various authoring styles used to. The authors are with the Dipartimento di Ingegneria dell Informazione, Università degli Studi di Siena, Via Roma 56 - I Siena, Italy. {michi,marco,maggini}@dii.unisi.it. Manuscript received 1 Sept. 2002; revised 1 Apr. 2003; accepted 10 Apr For information on obtaining reprints of this article, please send to: tkde@computer.org, and reference IEEECS Log Number write Web documents, and the huge quantity of information require an accurate processing to create reliable and efficient indexes. Finally, the user search interface is probably one of the principal success keys for the search engine. Even if most of the search engines offer advanced search interfaces, most of the users just use the simple keyword-based interface. The user issues his/her query as a list of words looking for documents which contain all (or some) of these words. While the techniques to retrieve the documents which match the query are relatively simple using inverted indexes, it is difficult to provide high quality and relevant results to the user. Typical queries are based only on few words (often just one or two) and, thus, they can be described as broad-topic queries. When thousands of documents match a query, the user is really flooded by information and can typically only afford to check a very small fraction of the returned answers. Thus, the definition of good document ranking functions turns out to be a crucial issue in search engine design. Proper criteria must be devised to compute automatically a score which evaluates both the relevance of the document with respect to the query and the quality of its contents. The analysis of the hyperlinks on the Web [3] has been proposed as a way to derive a quality measure for the information on the Web. The structure of the hyperlinks on the Web is the result of the collaborative activity of the community of Web authors. Authors usually like to link resources they consider authoritative and authority emerges from the dynamics of popularity of the resources on the Web. Sophisticated algorithms have been studied to compute reliable measures of authority from the topological structure of interconnections among the Web pages. A simple counting of the number of links to a page does not take into account the fact that not all the citations have the same authority. PageRank [4], used by the Google search /04/$17.00 ß 2004 IEEE Published by the IEEE Computer Society

2 DILIGENTI ET AL.: A UNIFIED PROBABILISTIC FRAMEWORK FOR WEB PAGE SCORING SYSTEMS 5 engine, is a noticeable example of a topological-based ranking criterion. The authority of a page is computed recursively as a function of the authorities of the pages that link the target page. HITS [5] is another well-known algorithm which computes two values related to topological properties of the Web pages, the authority and the hubness. The HITS scheme is query-dependent. User queries are issued to a search engine in order to create a set of seed pages. Crawling the Web forward and backward from that seed is performed to mirror the Web portion containing the information which is likely to be useful. A ranking criterion based on topological analyses can be applied to the pages belonging to the selected Web portion. Very interesting results in this direction have been proposed in [6], [7], [8]. In [9], a Bayesian approach is used to compute hubs and authorities, whereas in [10], both topological information and information about the page content are included in the distillation of information sources performed by a Bayesian approach. Recently, other approaches which also include the page contents in the score computation have been proposed to define focused measures of document quality [11], [12]. In this paper, we propose a general probabilistic framework for Web Page Scoring Systems (WPSS) which incorporates and extends many of the relevant models proposed in the literature. A first report of the research described in this paper can be found in [13]. Here, we propose a further extension of WPSS and provide additional experimental results. The general Web page scoring model proposed in this paper extends both PageRank [4] and the HITS scheme [5]. In addition, the proposed model exhibits a number of novel features, which turn out to be very useful, especially for focused (vertical) search. The content of the pages is combined with the Web graphical structure giving rise to scoring mechanisms which are focused on a specific topic. Moreover, in the proposed model, vertical search schemes can take into account the mutual relationship among different topics. In so doing, the discovery of pages with high score for a given topic affects the score of pages with related topics. Experimental results were carried out to assess the features of the proposed scoring systems with special emphasis on vertical search. The very promising experimental results reported in the paper provide a clear validation of the proposed general scheme for Web page scoring systems. The paper is organized as follows: The next section introduces the general probabilistic framework based on random walks which can be used to describe the different WPSSs. Section 3 describes the horizontal ranking scheme using the proposed framework. Horizontal rankings are only based on the graph topology and do not consider the page contents. In particular, the well-known PageRank and HITS algorithms and some extensions are derived from the common framework. Vertical scoring systems are described in Section 4. Vertical ranking functions are useful for focused search interfaces and for user adapted applications. Some different models are described as examples of focused ranking functions which can be derived in the proposed framework. Finally, Section 5 presents a set of experimental evaluations of both horizontal and vertical WPSSs and, in Section 6, the conclusions are drawn. 2 RANDOM WALKS AND PAGE RANKING The Web can be viewed as a graph G whose nodes correspond to the pages and whose arcs are defined by the hyperlinks between the pages. If p and q are the nodes corresponding to the pages D p and D q, then there is the arc ðp; qþ in G if the page D p contains a hyperlink to the page D q. The topology of the Web graph is quite complex and it is the result of the behavior of the community of Web authors. Thus, the graph topology carries much information related to the cooperative interaction of many agents. One of the emerging properties of the resulting graph is that high quality resources tend to be referred by many Web authors. The idea of using the collaborative judgments on Web resources hidden in the structure of the Web topology has been proposed as the basis to define page ranking criteria. In particular, random walk theory has been proposed as a framework to define models to compute the absolute relevance of a page [4], [8]. The relevance x p of the page p is computed as the probability of visiting that page in a random walk on the Web graph. The most popular pages (i.e., the most linked ones) are the most likely to be visited during the random walk on the Web. 2.1 The Single-Surfer Walk In order to define a general probabilistic framework for random walks, we model the actions of a generic Web surfer. At each step of the walk, the surfer can perform one out of the following atomic actions: jump to any node of the graph (action j), follow a hyperlink from the current page (action l), follow a hyperlink in the inverse direction (action b), and stay in the current node (action s). Thus, the set of the atomic actions is O¼fj; l; b; sg. At each step, the behavior of the surfer depends on the current page. For example, if the surfer considers the current page relevant, a hyperlink contained in that page will likely be followed. Whereas, if the page is not interesting, the surfer is likely to jump to a page not linked by the current one. Thus, the surfer s behavior can be modeled by a set of conditional probabilities which depend on the current page q:. xðljqþ: the probability of following a hyperlink from q,. xðbjqþ: the probability of following a back-link from q,. xðjjqþ: the probability of jumping from q, and. xðsjqþ: the probability of remaining in q. These P values must satisfy the normalization constraint o2o xðojqþ ¼1. Most of these actions need to specify their targets. We assume that the surfer s behavior is time-invariant and that the model can assign a specific weight to each link of a page (like in [14]). Thus, we can model the targets for jumps, hyperlink, or back-link by using the following parameters:. xðpjq; jþ: probability of jumping from page q to page p,. xðpjq; lþ: probability of selecting a hyperlink from page q to page p; this value is not null only for those pages p linked directly by page q, i.e., p 2 chðqþ, chðqþ being the set of the children of node q in the graph G, and

3 6 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY xðpjq; bþ: probability of going back from page q to page p; this value is not null only for the pages p which link directly page q, i.e., p 2 paðqþ, being paðqþ the set of the parents of node q in the graph G. These sets of values must satisfy the following probability normalization constraints for each page q 2 G: X X X xðpjq; jþ ¼1; xðpjq; lþ ¼1; xðpjq; bþ ¼1: p2g p2chðqþ p2paðqþ The random walk is defined by a sequence of actions performed by the surfer. The probabilistic model can be used to compute the probability that the surfer is located in each page p at time t, x p ðtþ. The probability distribution on all the pages is represented by the vector xðtþ ¼½x 1 ðtþ;...;x N ðtþš 0, being N the total number of pages. The probabilities x p ðtþ are updated at each step of the random walk taking into account the surfer model and, in particular, the probabilities associated to the actions that can be taken, using the following equation: x p ðt þ 1Þ ¼ X q2g xðpjqþx q ðtþ ¼ X xðpjq; jþxðjjqþx q ðtþþ X xðpjq; lþ q2g q2paðpþ xðljqþx q ðtþ þ X xðpjq; bþxðbjqþx q ðtþþxðsjpþx p ðtþ; q2chðpþ where the probability xðpjqþ of moving from page q to page p is expanded by considering the user s actions. The probabilities defining the surfer model can be organized in the following N N matrices:. the forward matrix whose element ðp; qþ is the probability xðpjq; lþ;. the backward matrix collecting the probabilities xðpjq; bþ; and. the jump matrix which is defined by the jump probabilities xðpjq; jþ. The forward and backward matrices are related to the Web adjacency matrix W, whose entry ðp; qþ is one if page p links page q, 0 otherwise. In particular, the forward matrix does not have null entries only in the positions corresponding to the entries equal to 1 in the matrix W, and the backward matrix does not have null entries in the positions corresponding to the entries equal to 1 in the matrix W 0. We can also define the set of action matrices which collect the probabilities of taking one of the possible actions from a given page q. These are N N diagonal matrices defined as follows: D j whose diagonal values ðq; qþ are the probabilities xðjjqþ, D l collecting the probabilities xðljqþ, D b containing the values xðbjqþ, and D s having the probabilities xðsjqþ on its diagonal. Hence, (1) can be written in matrix form as xðt þ 1Þ ¼ðD j Þ 0 xðtþþð D l Þ 0 xðtþ þð D b Þ 0 xðtþþðd s Þ 0 xðtþ: ð1þ ð2þ By defining the transition matrix as 0; T ¼ D j þ D l þ D b þ D s (2) can be written as xðt þ 1Þ ¼T xðtþ: Given the initial distribution xð0þ, (3) can be applied recursively to compute the probability distribution at a given time step t yielding xðtþ ¼T t xð0þ: The absolute page rank for the pages on the Web is obtained by considering the stationary distribution of the Markov chain defined by the previous equation. T 0 is the state transition matrix of the Markov chain. T 0 is stable since it is a stochastic matrix having its maximum eigenvalue equal to 1. Since the state vector xðtþ evolves following the equation of a Markov chain, it is guaranteed that, if P q2g x qð0þ ¼1, then P q2g x qðtþ ¼1;t¼ 1; 2;... Proposition 1. If xðjjqþ 6¼ 0 and xðpjq; jþ 6¼ 0; 8p; q 2 G, then there exists x? such that lim t!1 xðtþ ¼x? and x? does not depend on the initial state vector xð0þ. Proof. Because of the hypotheses, D j is strictly positive, i.e., all its entries are greater than 0. Since the transition matrix T of the Markov chain is obtained by adding nonnegative matrices, then also the transition matrix T is strictly positive. Thus, the resulting Markov chain is irreducible and, consequently, it has a unique stationary distribution given by the solution of the equation x? ¼ Tx?, where x? satisfies ðx? Þ 0 ¼ 1, being the N-dimensional vector whose entries are all equal to 1 (see, e.g., [15]). tu 2.2 The Multisurfer Walk A model based on a single variable may not be able to capture the complex relationships and dependencies among Web pages to compute their absolute relevance. Ranking schemes based on multiple variables have been proposed in [5], [8], where two variables are used to measure the hubness and the authority of each page. The random walk framework can be extended by considering a pool of Web surfers having different behaviors in order to capture different properties of the Web. Each surfer can be modeled by using different values for the parameters in the random walk (2) in order to define different policies for evaluating the absolute importance of the pages. Moreover, the surfers may interact by accepting suggestions from each other. The multisurfer model considers M different surfers. For each surfer i; i ¼ 1;...;M, x ðiþ q ðtþ represents the probability of the surfer i to be visiting page q at time t. Each surfer may accept the suggestion of another surfer before taking an action. The interaction among the surfers is modeled by a set of parameters bðijkþ which define the probability of the surfer k to jump to the page currently visited by the surfer i. Thus, in this model, we hypothesize that the interaction does not depend on the pages which the surfers are currently ð3þ ð4þ

4 DILIGENTI ET AL.: A UNIFIED PROBABILISTIC FRAMEWORK FOR WEB PAGE SCORING SYSTEMS 7 TABLE 1 Main Features of the Proposed Ranking Functions The H (V) labels refer to functions for, respectively, horizontal (vertical) scoring systems. The S and M labels indicate whether the ranking functionis underlaid by, respectively, a single surfer and a pool of collaborative surfers. The jump, back, and forward columns indicate whether the corresponding parameter, describing a surfer behavior, is focused (F) or uniform (U) for each proposed ranking function. This table is not exhaustive: other ranking functions (with specific features) could be derived from the proposed general framework, choosing appropriate settings. visiting but only on how much the surfers trust each other. The values bðijkþ must satisfy the probability normalization constraint P M s¼1 bðsjkþ ¼1 8k ¼ 1;...;M. Hence, before taking any action, the surfer i moves to page p with probability v ðiþ p ðtþ due to the suggestions of the other surfers. This probability is computed as vp ðiþ ðtþ ¼XM s¼1 bðsjiþx ðsþ p ðtþ: ð5þ The intermediate distribution vp ðiþ is computed before taking the action that generates the new probability distribution xp ðsþ at the following time step. This intermediate step is introduced to synchronize the pool of surfers. We can organize the probability distributions on the pages for the M surfers, x ðiþ ðtþ, as the columns of a N M matrix XðtÞ. Moreover, we can define the M M matrix A whose ði; kþ element is bðijkþ. The matrix A will be referred to as the interaction matrix. The modified probability distributions v ðiþ p ðtþ, due to the interaction among the surfers, can be collected in an N M matrix V ðtþ, which is obtained as V ðtþ ¼XðtÞA. Finally, the behavior of surfer i is modeled by the set of forward, backward, and jump matrices ðiþ, ðiþ, ðiþ, and by the action matrices D ðiþ j, DðiÞ l, D ðiþ b, DðiÞ s,as in (2). Thus, the transition matrix for the Markov chain associated to the surfer i is 0: T ðiþ ¼ ðiþ D ðiþ j þ ðiþ D ðiþ l þ D ðiþ b þ Ds ðiþ The set of the M interacting surfers can be described by the following equations: 8 >< x ð1þ ðt þ 1Þ ¼T ð1þ XðtÞA ð1þ. ð6þ >: x ðmþ ðt þ 1Þ ¼T ðmþ XðtÞA ðmþ ; where A ðkþ denotes the kth column of the matrix A. When the surfers are independent of each other (i.e., bðijiþ ¼1, i ¼ 0;...;M, and bðijjþ ¼0, i ¼ 0;...M, j ¼ 0;...;M, j 6¼ i), the model reduces to a set of M independent models as described by (4). The general model herein described gives rise to many different ranking schemes, some of which are properly classified in Table 1 and will be analyzed in detail in the remainder of the paper. 3 HORIZONTAL WPSS Horizontal WPSSs define an absolute ranking on a set of Web pages using only the topological information represented by the Web graph. These approaches are validated by the idea that, in a hyperlinked environment, the structure of the interconnections should reflect the quality of the resources, i.e., scarcely linked pages are low quality pages, whereas highly referred pages are relevant sources of information. Different criteria can be defined by refining this simple idea to define the authority of a page in the hyperlinked environment. In the proposed probabilistic framework, horizontal WPSSs are characterized by the fact that the parameters used in the probability computations are independent on the page contents. In particular, in this section, we show how the two most popular WPSSs, PageRank, and HITS can be described as special cases of the random walk framework, even if the original HITS algorithm violates the probabilistic assumptions. We also introduce a hybrid version of these two algorithms. 3.1 PageRank The computation of the PageRank [16] can be modeled by a Single-Surfer random walk by choosing a surfer model based only on two actions: the surfer jumps to a new random page with probability xðjjpþ ¼1 d or he follows one link from the current page with probability xðljpþ ¼d. The probabilities of the other two actions, considered in the general model, are null, i.e., xðbjpþ ¼0 and xðsjpþ ¼0. All these values are clearly independent on the page p. Given that a jump is taken, its target is selected using a uniform probability distribution over all the N Web pages, i.e., xðpjjþ ¼ 1=N; 8p 2 G. Finally, the probability of following the hyperlink from page q to page p does not depend on the page p, i.e., xðpjq; lþ ¼ q. In order to meet the normalization constraint, q ¼ 1=h q, where the hubness of

5 8 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004 page q, h q ¼jchðqÞj, is the number of links exiting from page q (the number of children of the node q in G). This requirement cannot be met by sink pages, i.e., the pages which do not contain any link to other pages. In order to keep the probabilistic interpretation of PageRank, all sink nodes must be removed, unless the computation is slightly modified as described further. By using the PageRank surfer model, (1) can be rewritten as x p ðt þ 1Þ ¼ ¼ ð1 dþ N X x q ðtþþd X q 2 G ð1 dþ N þ d X q x q ðtþ; q 2 paðpþ q 2 paðpþ q x q ðtþ where P q 2 G x qðtþ ¼1 because the probabilistic interpretation is valid. The fact that 0 <d<1 and, thus, xðjjpþ ¼ 1 d>0 guarantees that the PageRank vector converges to a distribution of page scores that does not depend on the initial distribution. The matrix form of the PageRank equation is ð1 dþ xðt þ 1Þ ¼ N þ d W 0 xðtþ; ð8þ where W is the adjacency matrix of the Web graph and is the diagonal matrix whose ðp; pþ element is the inverse of the hubness h p of page p. Thus, because of the hypothesis of independence of the parameters xðpjq; lþ from the page p, it follows that ð D l Þ 0 ¼ d W 0, i.e., the matrix can be factorized into the product of the adjacency matrix of the graph W and the hubness diagonal matrix. Sink nodes violate the probabilistic constraints since no links can actually be followed from a sink node, while the surfer model considers this possibility as a valid action (i.e., xðljqþ¼d 6¼ 0). In order to overcome this problem, it should be xðljqþ ¼0 and, consequently, xðjjqþ ¼1 for any sink node q. Thus, in order to consider also the sink nodes, the PageRank computation should be modified by using xðjjqþ ¼1 d if chðqþ 6¼ ; ð9þ xðjjqþ ¼1 if chðqþ ¼;: In this case, the contribution of the jump probabilities does not sum to a constant term as it happens in (7), but the value xðpjj; tþ ¼ 1 P N q 2 G xðjjqþx qðtþ, which represents the probability of jumping to page p at time t, must be computed at the beginning of each iteration. This is the computational scheme we used in our experiments. 3.2 HITS The HITS algorithm was proposed to model authoritative documents only relying on the information hidden in the connections among them due to cocitation or Web hyperlinks [5]. The algorithm assigns two values to each page p: the authority is a measure of the page relevance as information source, while the hubness refers to the quality of a page as a link to authoritative resources. Thus, the two values computed by the HITS algorithm allow us to distinguish among pages which are authorities and pages which are hubs. In the original formulation, these values are computed by applying iteratively the following equations: ð7þ a q ðt þ 1Þ ¼ P p 2 paðqþ h pðtþ h q ðt þ 1Þ ¼ P p 2 chðqþ a pðtþ; ð10þ where a q indicates the authority of page q and h q its hubness. If aðtþ is the vector collecting all the authorities at step t, and hðtþ is the hubness vector at step t, the previous equation can be rewritten in matrix form as aðt þ 1Þ ¼W hðtþ hðt þ 1Þ ¼W 0 ð11þ aðtþ; where W is the adjacency matrix of the Web graph. At each time step, the HITS algorithm requires normalizing the two vectors aðtþ and hðtþ to have unit length. It can be demonstrated that as t tends to infinity, the direction of the authority vector tends to be parallel to the main eigenvector of the W W 0 matrix (bibliographic coupling matrix 1 ), whereas the hubness vector tends to be parallel to the main eigenvector of the W 0 W matrix (cocitation matrix 2 ). See [5] for further details. The HITS ranking scheme can be represented in the general Web surfer framework, even if some of the assumptions violate the probabilistic interpretation. Since HITS uses two state variables, the hubness and the authority of a page, the corresponding random walk model is a multisurfer scheme based on the activity of two surfers. Surfer 1 is associated to the hubness of pages whereas surfer 2 is associated to the authority of pages. For both surfers, the probabilities of remaining in the same page x ðiþ ðsjpþ and of jumping to a random page x ðiþ ðjjpþ are null. Surfer 1 never follows a link, i.e., x ð1þ ðljpþ ¼0; 8p 2 G, whereas he always follows a back-link, i.e., x ð1þ ðbjpþ ¼1; 8p 2 G. In order to obtain the original HITS computation, we must set x ð1þ ðpjq; bþ ¼1 for each page q linked by page p. This assumption violates the probability normalization constraints since P p 2 paðqþ xð1þ ðpjq; bþ ¼jpaðqÞj 1. Surfer 2 has the opposite behavior with respect to surfer 1. He always follows a link, i.e., x ð2þ ðljpþ ¼1; 8p 2 G and he never follows a back-link, i.e., x ð2þ ðbjpþ ¼0. In this case, the normalization constraint is violated for the values of x ð2þ ðpjq; lþ because the HITS scheme defines x ð2þ ðpjq; lþ ¼1 for each page p linked by page q and, thus, P p2chðqþ xð2þ ðpjq; lþ ¼jchðqÞj 1. The HITS equations can be easily modified in order to obtain a probabilistically coherent model. We just need to choose x ð1þ ðpjq; bþ ¼ 1 jpaðqþj and x ð2þ ðpjq; lþ ¼ 1 jchðqþj. This model is analyzed in [8]. Thus, the action matrices describing the HITS surfers are D ð1þ b ¼ I; D ð2þ l ¼ I, being I the identity matrix, whereas D ð1þ j, D ð1þ l, D ð1þ s, Dð2Þ j, D ð2þ b, D ð2þ s are all equal to the null matrix. Moreover, the interaction between the surfers is described by the matrix: A ¼ 0 1 : ð12þ 1 0 The interpretation of the interactions represented by this matrix is that surfer 1 considers surfer 2 as an expert in discovering authorities and always moves to the position 1. The entry ðp; qþ of the bibliographic coupling matrix is the number of pages jointly linked by the pages q and p [17]. 2. The entry ðp; qþ of the cocitation matrix is the number of pages which jointly link the pages q and p [17].

6 DILIGENTI ET AL.: A UNIFIED PROBABILISTIC FRAMEWORK FOR WEB PAGE SCORING SYSTEMS 9 suggested by that surfer before taking his own action. On the other hand, surfer 2 considers surfer 1 as an expert in finding hubs and then he always moves to the position suggested by that surfer before choosing the next action. In this case, (6) is 8 < 0 x ð1þ ðt þ 1Þ ¼ ð1þ XðtÞA ð1þ x ð2þ ðt þ 1Þ ¼ ð2þ 0 ð13þ : XðtÞA ð2þ : Using (12) and the HITS assumption ð1þ0 ¼ W 0, ð2þ0 ¼ W, we obtain x ð1þ ðt þ 1Þ ¼W 0 x ð2þ ðtþ x ð2þ ðt þ 1Þ ¼W x ð1þ ð14þ ðtþ which, redefining aðtþ ¼x ð1þ ðtþ and hðtþ ¼x ð2þ ðtþ, is equivalent to the HITS computation of (11). The HITS model violates the probabilistic interpretation and this makes the computation unstable since the W W 0 matrix has a principal eigenvalue larger than 1. Hence, the HITS algorithm requires the score vector to be normalized at the end of each iteration. Finally, the HITS scheme suffers from other drawbacks. In particular, large highly connected communities of Web pages tend to attract the principal eigenvector of W W 0, thus pushing to zero the relevance of all other pages. As a result, the page scores tend to decrease rapidly to zero for pages outside those communities. In [8], this effect is analyzed in details and referred to as the Tightly Knit Community Effect. Because of this, the HITS algorithm can be reliably applied only on small subgraphs of the whole Web after an accurate pruning of the links. In fact, the HITS computation has been proposed as a scoring algorithm to be applied on the result set of a query (root set) augmented by the pages which link and are linked by those in the root set and not on the whole Web. Recently, some heuristics have been proposed to reduce the problems affecting the original HITS algorithm even if such behavior cannot be generally avoided because of the properties of the dynamic system associated to the HITS computation [18]. 3.3 PageRank-HITS The multisurfer model allows us to combine the properties of the PageRank and HITS algorithms. Both of these algorithms have benefits and limitations and the aim of the PageRank-HITS model is to combine the positive characteristics of the two techniques. In fact, the computation of the PageRank is stable and has a well-defined behavior because of its probabilistic interpretation. Moreover, it can be applied to large page collections since small communities are not overwhelmed by larger ones but are still influencing the ranking. On the other hand, PageRank is too simple to take into account the complex relationships of Web page citations. The HITS algorithm is not stable, only the largest Web community influences the ranking, and, thus, it cannot be applied to large page collections. On the other hand, the hub and authority model can capture the relationships among Web pages with more details than PageRank. The PageRank-HITS model employs two surfers: Surfer 1 follows a back-link with probability x ð1þ ðbjqþ ¼d ð1þ or jumps to a random page with probability x ð1þ ðjjqþ ¼1 d ð1þ, 8q 2 G. In both cases, the target page p is selected using a uniform probability distribution, i.e., x ð1þ ðpjq; bþ ¼ 1 jpaðqþj and x ð1þ ðpjq; jþ ¼ 1 N. Surfer 2 follows a forward link with probability x ð2þ ðljqþ ¼d ð2þ or jumps to a random page with probability x ð2þ ðjjqþ ¼1 d ð2þ, 8p 2 G. In both cases, the target page p is selected using a uniform probability distribution, i.e., x ð1þ ðpjq; lþ ¼ 1 jchðqþj and xð1þ ðpjq; jþ ¼ 1 N. Thus, surfer 2 implements the PageRank model while surfer 1 can be defined to follow a backward PageRank computation. 3 As in the HITS scheme, the interaction between the surfers is described by the matrix A ¼ 0 1 : 1 0 In this case, (6) becomes 8 < : x ð1þ ðt þ 1Þ ¼ ð1 dð1þ Þ N þ d ð1þ W x ð2þ ðtþ x ð2þ ðt þ 1Þ ¼ ð1 dð2þ Þ N þ d ð2þ W 0 x ð1þ ðtþ; ð15þ where ð1þ0 ¼ W 0, ð2þ0 ¼ W, being the diagonal matrix with element ðp; pþ equal to 1=jpaðpÞj and the diagonal matrix with element ðp; pþ equal to 1=jchðpÞj. This page rank is stable, the scores sum up to 1 and no normalization is required at the end of each iteration. Moreover, the two state variables can capture and process more complex relationships among pages. In particular, setting d ð1þ ¼ d ð2þ ¼ 1 yields a normalized version of HITS, which has been proposed in [6]. 4 VERTICAL WPSS Vertical WPSSs compute a relative ranking of pages when focusing on a specific topic. When applying scoring techniques to focused search the page contents should be taken into account beside the graph topology. A vertical WPSS uses a set of features (e.g., a set of keywords) representing the page contents and a classifier which assigns the degree of relevance with respect to the topic of interest to each page. The general random walk framework for WPSSs proposed in this paper can be used to define a vertical approach to page scoring. Several models can be derived which combine the ideas of the topologybased criteria and the topic relevance measure provided by the text classifier. In particular, the text classifier can be used to compute the values of the probabilities needed by the random walk model. As shown by the experimental results, vertical WPSSs produce much more accurate results in ranking topic specific pages. 4.1 Focused PageRank In the PageRank framework, when choosing to follow a link from a page q, each link has the same probability 1=jchðqÞj to be followed. In the focused domain, we can consider the model of a surfer who follows the links according to the 3. As shown in Section 3.1, sink nodes violate the probabilistic constraints for surfer 2. In this model, supersource nodes (i.e., the nodes q having jpaðqþj ¼ 0) also violate the probabilistic constraints for surfer 1. The equations can be modified straightforwardly to eliminate these problems.

7 10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004 suggestions provided by a text classifier. Thus, this approach removes the assumption of complete randomness in the movements of the Web surfer. In this case, the surfer is aware of what he is searching and he will trust the classifier suggestions following the links with a probability proportional to the topic-relevance of the page which is the target of the link. This allows us to derive a topic-specific page ranking. For example: the Microsoft home page is highly authoritative according to the topic-generic PageRank, whereas it is not highly authoritative when searching for Perl language tutorials. In fact, even if that page is highly linked, most of the links are scarcely related to the target topic and their contribution will be negligible. If the surfer is located at page q and the pages linked by page q are assigned the scores sðch 1 ðqþþ;...;sðch hq ðqþþ by the classifier, the probability of the surfer to follow the ith link is defined as xðch i ðqþjq; lþ ¼ sðch iðqþþ P hq j¼1 sðch jðqþþ : ð16þ Thus, the forward matrix depends on the classifier outputs on the pages in the data set. Hence, the modified equation to compute the combined page scores using a PageRank-like scheme is x p ðt þ 1Þ ¼ ð1 dþ N þ d X q 2 paðpþ xðpjq; lþx q ðtþ; ð17þ where xðpjq; lþ is computed as in (16). 4.2 Double Focused PageRank The focused PageRank surfer, described in the previous section, uses a topic specific distribution for selecting the link to follow, but the decision on the action to take does not depend on the contents of the current page. A more accurate model should consider that the decision about which action to take is usually dependent on the contents of the current page. For example, let us suppose that two surfers are searching for a Perl Language tutorial and that the first one is located at the page while the second is located at the page Clearly, it is more likely that the first surfer will decide to follow a link from the current page, while the second one will prefer to jump to another page which is related to the topic he is interested in. We can model this behavior by adapting the action probabilities using the contents of the current page, thus modeling a focused choice of the surfer s actions. In particular, the probability of following a hyperlink can be chosen to be proportional to the degree of relevance sðqþ of the current page with respect to the target topic, i.e., sðpþ xðljpþ ¼d max q2g sðqþ ; ð18þ where sðqþ is computed by the text classifier. On the other hand, the probability of jumping away from a page decreases proportionally to sðqþ, i.e., sðpþ xðjjpþ ¼1 d max q2g sðqþ : ð19þ Finally, we assume that the probability of landing into a page after a jump is proportional to its relevance sðpþ, i.e., xðpjjþ ¼ sðpþ : ð20þ sðqþ Pq2G 4.3 Focused HITS The multisurfer model may be used to derive a modification of the HITS algorithm similar to that proposed in [19], [20]. This model also takes into account textual information, in order to enforce the influence of the links pointing to ontopic pages and to filter the noise introduced by links to offtopic pages. The HITS model is based on two coupled surfers implementing the HITS algorithm, as shown in Section 3.2. In the focused version, each surfer selects the links (backlinks) using the scores assigned by the text classifier to the target pages. Given the scores sðch 1 ðqþþ;...;sðch hq ðqþþ assigned by the text classifier to the pages ch i ðqþ linked by page q, the first surfer selects the link i to follow from page q by using the probability distribution defined by x ð1þ ðch i ðqþjq; lþ ¼ sðch iðqþþ P n j¼1 sðch jðqþþ : ð21þ Likewise, given the scores sðpa 1 ðqþþ;...;sðpa m ðqþþ assigned by the text classifier to the pages pa i ðqþ linking page q, the second surfer selects the back-link i to follow from page q using the probability distribution x ð2þ ðpa i ðqþjq; bþ ¼ sðpa iðqþþ P m j¼1 sðpa jðqþþ : ð22þ The focused HITS model is thus represented by the two equations ( a p ðt þ 1Þ ¼ P q 2 paðpþ h qðtþx ð1þ ðpjq; lþ h p ðt þ 1Þ ¼ P q2chðpþ a qðtþ; x ð2þ ð23þ ðpjq; bþ; where a p ðtþ ¼xp ð1þ ðtþ represents the focused authority computed by the first surfer and h p ðtþ ¼x ð2þ p ðtþ is the focused hubness measured by the second surfer, when the probabilities x ð1þ ðpjq; lþ and x ð2þ ðpjq; bþ are computed as in (21) and (22). 4.4 Multitopic Rank Topic hierarchies and topic correlations are of fundamental importance to perform focused search in the Web. Typically, pages on a specific topic may be reached through a path of pages belonging to correlated topics. In particular, Fig. 1 shows a typical scenario on the Web, where a set of Researcher Home Pages can be reached following a path through pages belonging to different categories, like, for example, home pages of a university department, home pages of a university faculty, and home pages of a university. In order to enhance the ranking functions for focused information, a model based on a Multisurfer walk can be devised. This model can capture the correlations among topics and reveal more complex properties of the pages due to the topological structure of the topics on the Web.

8 DILIGENTI ET AL.: A UNIFIED PROBABILISTIC FRAMEWORK FOR WEB PAGE SCORING SYSTEMS 11 Fig. 1. Example of topic transitions among connected pages on the Web. In particular, we consider Researcher Home Pages, which are likely to be connected to pages of different categories, among which we consider Department Home Pages, Faculty Home Pages, and University Home Pages. (a) An example of the neighborhood of a set of researcher home pages. (b) Transition probabilities estimated from the sample Web portion shown in (a), taking into account the four considered categories. Each table row collects the probabilities that a page in the corresponding category points to pages belonging to each considered category. Fig. 2. The distribution of the page scores for two different topics. (a) Linux and (b) cooking recipes. By analyzing Web portions densely populated by interesting documents, we can identify a set of topics related to the target one. For example, the set of correlated topics can be discovered automatically using a clustering algorithm. This is the approach we used in the experiments described further. Once a set of T topics is defined, the probability of the transition between each pair of topics can be estimated from a sample of the Web. Pð 0 jþ indicates the probability that a page on topic links a page of topic 0. We can use these probability values which reflect the correlations among the T topics to define the interaction matrix of a pool of T surfers where the th surfer is focused on the th topic. Thus, if topic 0 is highly correlated to topic, then the surfer 0 will be strongly influenced by the activity of the surfer. Formally, the probability v ð0 Þ p ðtþ that surfer 0 moves to page p due the other surfers suggestions is v ð 0 Þ p ðtþ ¼ XjT j ¼1 Pð 0 jþxp ðþ ðtþ: ð24þ Thus, the multitopic rank considers an interaction matrix A whose element ð 0 ;Þ is equal to Pð 0 jþ. This choice allows each surfer to move to a position which is more likely to lead to a page q on his topic of interest. Finally, each surfer can be modeled using one of the focused approaches described in the previous sections. 5 EXPERIMENTAL RESULTS We performed a set of experiments in order to analyze the properties of some of the proposed scoring systems and to make comparisons on the different rankings. Since we were mainly interested in evaluating the performance of scoring

9 12 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004 Fig. 3. The eight top score pages for the data set Linux. systems for vertical (topic-specific) applications, we based our test on a set of single-topic data sets. Each data set was collected using the focus crawler described in [21]. In particular, the focus crawler employs a Naive Bayes classifier ([22, chapter 6]), which computes the correlation between the content of each downloaded page and the considered topic. The classifier directs the crawl to the most promising Web regions by selecting the links starting from the pages having the highest scores. About 150,000 pages were downloaded for each single crawl. The topics of the page collections were selected to be not too specific in order to cover many different subtopics in each data set. The selected topics were: pages on the operating system Linux (data set Linux ), pages on cooking recipes (data set cooking recipes ), pages concerning the sport golf (data set golf ), and documents related to wines (data set wine ). For each selected topic, a relavance score was assigned to each page by the Naive Bayes classifiers which were previously used to focus-crawl the Web. The scores produced by the models were stored to be used in the computation of the vertical page ranks. Considering the hyperlinks contained in each page, a Web subgraph was created from each data set to perform the evaluation of the different WPSSs proposed in the previous sections. Beside the ranking systems described in the previous sections, we report the results also for the In-link surfer. Such a surfer is located in a page with probability proportional to the number of in-links of that page. For all the PageRank surfers (focused or not), we set the d parameter to 0: Score Distributions We performed an analysis of the distribution of page scores using the different algorithms proposed in this paper. For each ranking function, we normalized the score values by their maximum over all the pages (thus, yielding values in [0,1]). Then, we sorted the pages according to their ranks and then we plotted the distribution of the normalized rank values. Fig. 2 reports the plots for the two categories, Linux and cooking recipes. In both cases, the HITS surfer assigns a score value significantly greater than zero only to the small set of pages associated to the principal community of the subgraph. On the other hand, the PageRank yields a smoother curve for the score distribution. This is the effect of the homogeneous term 1 d in (7). The focused versions of the PageRank are still smooth but concentrate the scores on a smaller set of authoritative pages which are more specific for the considered topic. This reflects the fact that the vertical WPSSs are able to discriminate the authorities on the specific topic, whereas the classical PageRank scheme considers the authoritative pages regardless of their topic. 5.2 Top Lists Figs. 3 and 4 show the eight top score pages for four different WPSSs on the data set Linux and cooking recipes, respectively. For the HITS surfer pool, we report the pages with the top authority values.

10 DILIGENTI ET AL.: A UNIFIED PROBABILISTIC FRAMEWORK FOR WEB PAGE SCORING SYSTEMS 13 PageRank can filter many off-topic authoritative pages from the top list. In particular, the Double Focused PageRank WPSS pushes all the authorities on the relevant topic to the top positions. Fig. 4. The eight top score pages for the data set cooking recipes. As shown in Fig. 3, all pages selected by the HITS algorithm are from the same site. This is due to the wellknown property of the HITS algorithm which produces a score vector in the direction of the most interconnected communities. In many cases, it is difficult to reduce this undesirable behavior of the HITS ranking by properly pruning the links among pages. For example, in order to reduce the nepotism of Web pages, for the data set cooking recipes, the connectivity map of pages was pruned removing all the intrasite links. However, as shown in the HITS section of Fig. 4, the Web site com, which is subdivided into a collection of Web sites ( etc.) which are strongly interconnected, occupies all the top positions in the ranking list. In [18], the content of pages is considered in order to propagate relevance scores only over the subset of links pointing to pages on a specific topic. However, in this case, the performance cannot be improved even by using this approach since all the sites in the community are effectively on-topic and, thus, the interconnections are semantically coherent. The PageRank algorithm is not topic dependent, and, consequently, highly interconnected pages result in being authoritative regardless of the topic of interest. For example, pages like com, etc., are shown in the top list even if they are not closely related to the specific topic. The focused versions of 5.3 Comparison of the WPPSs In this section, compare the results obtained by the In-link surfer, the Page Rank surfer, the Focused Page Rank scheme, the Double Focused Page Rank scheme, and the HITS surfer pool. We follow a methodology similar to that one presented in [23]. For each topic, we created a collection of pages which were evaluated by a pool of 10 human experts. The experts independently labeled each page in the collection as authoritative or not authoritative for the specific topic. In particular, the top 15 pages for each ranking function were shown to a set of experts. Each expert provided either a positive, negative, or null feedback on each single page. The labeled pages were used to measure the percentage of positive (negative) results in the top list returned by each ranking function. The length of the top list was varied between 1 and 300. The evaluation was performed on the two data sets Linux and Golf. Fig. 5 reports the percentage of all pages labeled as authoritative by experts among the first N pages in the top list produced by five different WPSSs for the two data sets. In both cases, the HITS algorithm produces the worst ranking. This result confirms the fact that HITS can only be used as a query-dependent ranking schema [3]. As previously reported in [23], in spite of its simplicity, the In-link algorithm has a performance similar to PageRank. In our experiments, PageRank outperformed the In-Links algorithm on the category Golf, whereas it was outperformed on the category Linux. However, in both cases, the gap is small. The two focused ranking functions clearly outperformed all the not focused ones, demonstrating that when searching focused authorities, a higher accuracy is provided by taking into account the page contents. In both cases, more than 60 percent of the authoritative pages are in the top 50 pages suggested by the Double Focused PageRank. 5.4 The Multitopic WPSS We evaluated the scoring model which considers the correlation among different topics proposed in Section 4.4. Each surfer was associated to a subtopic correlated to the main topic by using a text classifier. The set of subtopics was determined automatically by the following procedure. First, a set of seed pages for the topic of interest was selected. Then, the context graph of each of these pages was built by back-crawling the Web up to three levels (i.e., the pages one, two, and three clicks away from the seed ones). A hierarchical clustering algorithm on the bag-of-words vectors representing the documents was used to split the set of the pages in the context graph into subsets corresponding to the subtopics. In the experiments, we fixed the maximum number of clusters to be 10. Each cluster obtained in the previous step is associated to a surfer. In order to facilitate the integration with the probabilistic model which is used to compute the page scores, a set of naive Bayes classifiers (see e.g., [22, chapter 6])

11 14 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004 Fig. 5. The percentage of the authoritative pages as defined by a set of 10 users in the best N pages returned by the various WPSSs, respectively, for the topic (a) Linux and (b) Golf (the highest the best). Vice-versa, in (c), the percentage of the nonauthoritative pages for topic Golf in the best N pages returned by the various WPSSs (the lowest the best) is shown. Similar results hold for the Linux topic. was trained using the documents in each cluster. Finally, the matrix of topic-transition probabilities was estimated from the context graph by counting the number of pages in cluster j which link a page in cluster i and by normalizing this value using the total number of links to the pages in cluster i. The estimated topic-transition matrix was used as the interaction matrix for the multiple surfer model used to compute the page scores. Each surfer was focused on the particular subtopic corresponding to the associated cluster and the surfer behavior was chosen to be focused in the choice of the links to follow, in the jumps to take, and in the bias among these two actions (the value of the parameter d). We performed a set of experiments on the three topics wine, golf, and cooking recipes. Fig. 6 shows the plots of the scores assigned to the pages by each of the 10 surfers for the three data sets. Surfer 0 is the one associated to the topic of interest as defined by the seed pages. Each curve is normalized with respect to the maximum score assigned to a page. As can be seen from the curve corresponding to surfer 0, which is mostly flat, for the topic wine (plot c), only one page in the data set receives a high score by surfer 0 (winelibrary.com), while many pages are assigned similar scores. The scores assigned by the other surfers correspond to the context subtopics and they show a less uniform distribution. For the other data sets, the distribution of the scores assigned by surfer 0 is less uniform. 6 CONCLUSIONS In this paper, we have proposed a general probabilistic framework based on random walks for the definition of ranking functions on a set of hyperlinked documents. The proposed framework allows us the definition of both horizontal (topology-based) and vertical (topic-topology based) rankings. The proposed scheme incorporates many relevant scoring models proposed in the literature. Moreover, it contains novel features which look very appropriate especially for vertical (focused) search engines. In particular, in some of the proposed ranking algorithms, the topological structure of the Web, as well as the content of the Web pages, jointly play a crucial role for the computation of the scores. The experimental results support the effectiveness of the proposal which clearly

12 DILIGENTI ET AL.: A UNIFIED PROBABILISTIC FRAMEWORK FOR WEB PAGE SCORING SYSTEMS 15 Fig. 6. The distribution of page scores for each surfer when using a multisurfer model with 10 surfers. (a) Data set golf. (b) Data set cooking recipes. (c) Data set wine. emerge especially for focused search. Finally, it is worth mentioning that the model described in this paper is very well-suited for the construction of learning-based WPSS, which can, in principle, incorporate the user information while surfing the Web. ACKNOWLEDGMENTS The authors would like to thank Ottavio Calzone and Francesco Scali (DII, University of Siena) who performed some of the experimental evaluations of the scoring systems. Some fruitful discussions with Nicola Baldini concerning the focuseek project ( were also very stimulating and useful for the development of the general framework described in the paper. Finally, the authors would like to thank the anonymous reviewers for the useful suggestions. REFERENCES [1] S. Lawrence and C.L. Giles, Searching the Web, Science, vol. 281, no. 5374, p. 175, [2] S. Lawrence and C.L. Giles, Accessibility of Information on the Web, Nature, vol. 400, no. 8, pp , [3] M. Henzinger, Hyperlink Analysis for the Web, IEEE Internet Computing, vol. 1, no. 5, pp , [4] L. Page, S. Brin, R. Motwani, and T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, technical report, Computer Science Dept., Stanford Univ., [5] J.M. Kleinberg, Authoritative Sources in a Hyperlinked Environment, J. ACM, vol. 46, no. 5, pp , [6] K. Bharat and M.R. Henzinger, Improved Algorithms for Topic Distillation in a Hyperlinked Environment, Proc. 21st Ann. Int l ACM SIGIR Conf. Research and Development in Information Retrieval, pp , [7] R. Lempel and S. Moran, The Stochastic Approach for Link- Structure Analysis (SALSA) and the TKC Effect, Proc. Ninth World Wide Web Conf. (WWW9), pp , [8] R. Lempel and S. Moran, Salsa: The Stochastic Approach for Link-Structure Analysis, ACM Trans. Information Systems, vol. 19, no. 2, pp , [9] D. Cohn and H. Chang, Learning to Probabilistically Identify Authoritative Documents, Proc. 17th Int l Conf. Machine Learning (ICML), pp , [10] D. Cohn and T. Hofmann, The Missing Link: A Probabilistic Model of Document Content and Hypertext Connectivity, Advances in Neural Information Processing Systems 13, pp , [11] M. Richardson and P. Domingos, The Intelligent Surfer: Probabilistic Combination of Link and Content Information in Pagerank, Advances in Neural Information Processing Systems 14, pp , [12] T. H. Haveliwala, Topic-Sensitive Pagerank, Proc. 11th World Wide Web Conf. (WWW2002), pp , [13] M. Diligenti, M. Gori, and M. Maggini, Web Page Scoring Systems for Horizontal and Vertical Search, Proc. 11th World Wide Web Conf. (WWW2002), pp , [14] G. Greco, S. Greco, and E. Zumpano, A Probabilistic Approach for Distillation and Ranking of Web Pages, World Wide Web, vol. 4, no. 3, pp , 2001.

16 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004 [15] E. Seneta, Non-Negative Matrices and Markov Chains. Springer- Verlag, 1981. [16] S. Brin and L.

Documentation, vol. 14, pp. 10-25, 1963. [18] S. Chakrabarti, M. Joshi, and V. Tawde, Enhanced Topic Distillation Using Text, Markup Tags, and Hyperlinks, Proc. 24th Ann. Int l ACM SIGIR Conf.

Eighth Int l World Wide Web Conf. (WWW8), pp. 545-562, 1999. [20] S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J.

13 16 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 1, JANUARY 2004 [15] E. Seneta, Non-Negative Matrices and Markov Chains. Springer- Verlag, [16] S. Brin and L. Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Proc. Seventh World Wide Web Conf. (WWW7), pp , [17] M.M. Kessler, Bibliographic Coupling between Scientific Papers, Am. Documentation, vol. 14, pp , [18] S. Chakrabarti, M. Joshi, and V. Tawde, Enhanced Topic Distillation Using Text, Markup Tags, and Hyperlinks, Proc. 24th Ann. Int l ACM SIGIR Conf. Research and Development in Information Retrieval, pp , [19] S. Chakrabarti, M. Van der Berg, and B. Dom, Focused Crawling: A New Approach to Topic-Specific Web Resource Discovery, Proc. Eighth Int l World Wide Web Conf. (WWW8), pp , [20] S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg, Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text, Proc. Seventh World Wide Web Conf. (WWW7), pp , [21] M. Diligenti, F. Coetzee, S. Lawrence, L. Giles, and M. Gori, Focus Crawling by Context Graphs, Proc. 26th Int l Conf. Very Large Databases (VLDB 2000), pp , [22] T.M. Mitchell, Machine Learning. McGraw-Hill, [23] B. Amento, L. Terveen, and W. Hill, Does Authority Mean Quality? Predicting Expert Quality Ratings of Web Documents, Proc. 23rd Ann. Int l ACM SIGIR Conf. Research and Development in Information Retrieval, pp , Michelangelo Diligenti received the PhD degree in computer science and system engineering in 2002 from the University of Florence, Italy. Currently, he is a research associate at the University of Siena, Italy. He has collaborated with the University of Wollongong and the NEC Research Institute, Pricetown, New Jersey. His main research interests are pattern recognition, text categorization, visual databases, and machine learning applied to the World Wide Web. Marco Gori received the Laurea degree in electronic engineering from Università di Firenze, Italy, in 1984, and the PhD degree in 1990 from Università di Bologna, Italy. From October 1988 to June 1989, he was a visiting student at the School of Computer Science, McGill University, Montreal. In 1992, he became an associate professor of computer science at Università di Firenze and, in November 1995, he joined the University of Siena, where he is currently full professor. His main research interests are in neural networks, pattern recognition, and applications of machine learning to information retrieval on the Internet. He has led a number of research projects on these themes with either national or international partners, and has been involved in the organization of many scientific events, including the IEEE-INNS International Joint Conference on Neural Networks, for which he acted as the program chair (2000). Dr. Gori serves (served) as an associate editor of a number of technical journals related to his areas of expertise, including Pattern Recognition, the IEEE Transactions Neural Networks, Neurocomputing, and the International Journal on Pattern Recognition and Artificial Intelligence. He is the Italian chairman of the IEEE Neural Network Council (R.I.G.), is acting as the cochair of the TC3 technical committee of the IAPR (International Association for Pattern Recognition) on Neural Networks, and is the president of the Italian Association for Artificial Intelligence. Dr. Gori is a fellow of the IEEE. Marco Maggini received the Laurea degree (cum laude) in electronic engineering and the PhD degree in computer science and control systems from the University of Firenze in 1991 and 1995, respectively. In February 1996, he became assistant professor of computer engineering in the School of Engineering at the University of Siena, where, since March 2001, he has been an associate professor. His main research interests are: machine learning, neural networks, human-machine interaction, technologies for distributing and searching information on the Internet, and nonstructured databases. He has been collaborating with the NEC Research Institute, Princeton, New Jersey, on parallel processing, neural networks, and financial time series prediction. He is member of the editorial board of the Electronic Letters on Computer Vision and Image Analysis and associate editor of the ACM Transaction on Internet Technology. He has been guest editor of a special issue of the ACM Transactions on Internet Technology on machine learning for the Internet. He contributed to the organization of international and national scientific events. He is member of the IAPR-IC and the IEEE Computer Society.. For more information on this or any computing topic, please visit our Digital Library at

Link Analysis and Web Search

Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html