PageRank and related algorithms PageRank and HITS Jacob Kogan Department of Mathematics and Statistics University of Maryland, Baltimore County Baltimore, Maryland 21250 kogan@umbc.edu May 15, 2006
Basic References L. Page and S. Brin and R. Motwani and T. Winograd. The PageRank citation index: bringing order to the web. Stanford Digital Library Technologies Project, 1998, citeseer.ist.psu.edu/page98pagerank.html. Jon Kleinberg. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, 46:5, pp. 604-632, 1999. Berkhin, P. A survey on Page Rank computing. Internet Mathematics, vol. 2, no. 1, pp. 73 120, 2005. Jacob Kogan, UMBC PageRank and related algorithms, optimization 2/21
PageRank PageRank is a global importance ranking of every web page. The method is based on the graph of the web. The model is inspired by academic citation analysis. If a page has a link off an important page (Yahoo home page for example), then this link should make a larger contribution to the page importance, then links from obscure pages. Jacob Kogan, UMBC PageRank and related algorithms, optimization 3/21
The graph and the matrix G(V,E) is a directed graph V are the vertices/nodes (say n HTML pages) E are the directes edges (hyperlinks) The n n adjacency matrix A = (A ij ) A ij = { 1 if page i j 0 otherwise Jacob Kogan, UMBC PageRank and related algorithms, optimization 4/21
Transition matrix P P = (P ij ) P ij = A ij odeg(i) (odeg(i), the out degree of a node i, is the number of outgoing links) so that j P ij = 1 (P is row stochastic) Jacob Kogan, UMBC PageRank and related algorithms, optimization 5/21
Random Serfer Model A surfer travels along the directed graph G. P ij, j = 1,..., n is the probability the surfer moves from node i to node j. If at step k the probability of the surfer being located at node i is p (k) i, so that ( ) p (k) = p (k) 1,..., p(k) n, then p (k+1) = P T p (k). p (k+1) is a probability distribution! Jacob Kogan, UMBC PageRank and related algorithms, optimization 6/21
q = P T p if p = (p 1,..., p n ), p i 0, and q = (q 1,..., q n ), q = P T p then q i 0, pi = 1 qi = 1. Jacob Kogan, UMBC PageRank and related algorithms, optimization 7/21
q = P T p if p = (p 1,..., p n ), p i 0, and q = (q 1,..., q n ), q = P T p then q i 0, pi = 1 qi = 1. n q i = i=1 ( n n ) P ij p i = j=1 i=1 n n p i P ij i=1 j=1 = n p i = 1. i=1 Jacob Kogan, UMBC PageRank and related algorithms, optimization 7/21
Dangling Pages pages that have no outgoing links are called dangling pages or sinks or attractors. With dangling pages the transition matrix P has zero rows, and fails to be stochastic. Jacob Kogan, UMBC PageRank and related algorithms, optimization 8/21
PageRank Definition. A PageRank vector is a non-negative stationary point of the transformation q = P T p (a stationary distribution for a Markov chain) Jacob Kogan, UMBC PageRank and related algorithms, optimization 9/21
PageRank Definition. A PageRank vector is a non-negative stationary point of the transformation q = P T p (a stationary distribution for a Markov chain) What can be done in presence of dangling pages? Jacob Kogan, UMBC PageRank and related algorithms, optimization 9/21
What can be done? removal of dangling pages, renormalization of P T p (k+1), to add self link to each dangling page, to introduce an ideal page with a self link to each dangling page, to modify the matrix P by introducing artificial links that uniformly connect dangling pages to pages (P = P + dv T ). Jacob Kogan, UMBC PageRank and related algorithms, optimization 10/21
PageRank v = 1 n 1 n, d = δ(odeg(1), 0) δ(odeg(n), 0) Consider P = c [ P + dv T ] + (1 c)ev T. ( ) ( ) y = P T x = cp T x + cv d T x + (1 c)v e T x. Jacob Kogan, UMBC PageRank and related algorithms, optimization 11/21
PageRank computation Let x be a vector in R n, and P = (P ij ) is an n n matrix with non negative entries such that either j P ij = 1, or j P ij = 0. Let d R n so that d i = δ(odeg(i), 0), then P T x = x d T x. (where y = y 1 = y 1 + + y n ) Jacob Kogan, UMBC PageRank and related algorithms, optimization 12/21
PageRank computation P T x = x 1 P 11 P 12 P 1n P 11 P 21 P n1 P 12 P 22... P n2 P 1n P 2n P nn + x 2 P 21 P 22 P 2n + + x n x 1 x 2 x n = P n1 P n2 P nn Jacob Kogan, UMBC PageRank and related algorithms, optimization 13/21
PageRank computation P T x = x 1 P 11 P 12 P 1n Hence P T x = x 1 j P 11 P 21 P n1 P 12 P 22... P n2 P 1n P 2n P nn + x 2 P 1j P 21 P 22 P 2n + x 2 j + + x n P 2j x 1 x 2 x n = P n1 P n2 P nn + x n j P nj. Jacob Kogan, UMBC PageRank and related algorithms, optimization 13/21
PageRank computation P T x ( ) = x 1 j P 1j + + x n ( j P nj ). x d T x = x 1 + + x n δ(odeg(1), 0)x 1 + + δ(odeg(n), 0)x n Jacob Kogan, UMBC PageRank and related algorithms, optimization 14/21
PageRank ( ) ( ) y = P T x = cp T x + cv d T x + (1 c)v e T x. }{{} ( ( )) x c x c d T x = x cp T x. Hence y can be computed as follows: 1. y cp T x, 2. γ = x y, 3. y y + γv. Jacob Kogan, UMBC PageRank and related algorithms, optimization 15/21
Hyperlink Induced Topic Search (HITS) works with a subgraph specific to a particular query (rather than with a full graph), computes two weights (authority and hub) for each web page, allows clustering of results for multi-topic or polarized queries. Jacob Kogan, UMBC PageRank and related algorithms, optimization 16/21
Root and Focused Sets Root set: The top t (around 200) results are recalled for a given query (the results are picked according to a text based relevance criterion). Focused set: All pages pointed by out links of the root set are added along with up to d (about 50) pages corresponding to inlinks of each page in a root set. Jacob Kogan, UMBC PageRank and related algorithms, optimization 17/21
Hubs and Authorities Define authorities and hubs as follows: 1. a page p is an authority if it is pointed by many pages, 2. a page p is a hub if it points to many pages. To measure the authority and the hub of the pages we consider L 2 unit norm vectors a and h of dimension V, so that a[p] is the authority h[p] is the hub of the page p. Jacob Kogan, UMBC PageRank and related algorithms, optimization 18/21
Hubs and Authorities The following is an iterative process that computes the vectors. 1. set t = 0 2. assign initial values a (t), and h (t) 3. normalize vectors a (t), and h (t), so that ( ) 2 ( 2 a (t) [p] = h [p]) (t) = 1 p 4. set a (t+1) [p] = p q p h (t) [q], and h (t+1) [p] = 5. if (stopping criterion fails) then increment t by 1, goto Step 3 else stop. p q a (t+1) [q] Jacob Kogan, UMBC PageRank and related algorithms, optimization 19/21
Adjacency Matrix Let A be the adjacency matrix of the graph G, i.e. { 1 if page i j A ij = 0 otherwise Note that a (t+1) = AT h (t) A T h (t), and h(t+1) = Aa(t+1) Aa (t+1). This yields a (t+1) = AT Aa (t) A T Aa (t), and h(t+1) = AAT h (t) AA T h (t). Jacob Kogan, UMBC PageRank and related algorithms, optimization 20/21
Eigenvectors a (t) = ( A T A ) k a (0) (A T A) k, and h (t) = a (0) ( AA T ) k h (0) (AA T ) k. h (0) Let v and w be a unit eigenvectors corresponding to maximal eigenvalues of the symmetric matrices A T A and AA T correspondingly. The above arguments lead to the following result: lim t a(t) = v, lim h (t) = w. t Jacob Kogan, UMBC PageRank and related algorithms, optimization 21/21