Page Rank Algorithm. May 12, Abstract

Size: px

Start display at page:

Download "Page Rank Algorithm. May 12, Abstract"

Preston Garrett
6 years ago
Views:

1 Page Rank Algorithm Catherine Benincasa, Adena Calden, Emily Hanlon, Matthew Kindzerske, Kody Law, Eddery Lam, John Rhoades, Ishani Roy, Michael Satz, Eric Valentine and Nathaniel Whitaker Department of Mathematics and Statistics University of Massachusetts, Amherst May 12, 2006 Abstract PageRank is the algorithm used by the Google search engine, originally formulated by Sergey Brin and Larry Page in their paper The Anatomy of a Large-Scale Hypertextual Web Search Engine. It is based on the premise, prevalent in the world of academia, that the importance of a research paper can be judged by the number of citations the paper has from other research papers. Brin and Page have simply transferred this premise to its web equivalent: the importance of a web page can be judged by the number of hyperlinks pointing to it from other web pages. 1

2 1 Introduction There are various methods of information retrieval (IR) such as latent Symantic Indexing (LSI). LSI uses the singular value decomposition (SVD) of a term by document matrix to capture latent symantic associations. LSI method can efficiently handle difficult query terms involving synonynms and polysems. SVD enables LSI to cluster documents and terms into concepts. eg. (car and automobile should belong to the same category.) Unfortunately computation and storage of the SVD of the term by documnet matrix is costly. Secondly there are enormous amounts of documents on the web. The documents are not subjected to editorial review process. Therefore the web contains redundent documents, broken links, or poor quality documents. Moreover the web needs to be updated as pages are modified and/or added and deleted continuously. The final feature of the IR system which has proven to be math worthwhile, is the web s hyperlink structure. The Pagerank algorithm introduced by Google effectively represents the link structure of the internet, assigning each page a credibility based on this structure. Our focus here will be on the analysis and implementation of this algorithm. 2 PageRank Algorithm PageRank uses the hyperlink structure of the web to view inlinks into a page as a recommendation of that page from the author of the inlinking page. Since inlinks from good pages should carry more wight than the inlinks from marginal pages each webpage is assigned an appropriate rank score, which measures the importance of the page. The PageRank algorithm was formulated by Google founders Larry Page and Sergey Brin as a basis for their search engine. After webpages are retrieved by robot crawlers are indexed and cataloged (which will be discussed in section 1); PageRank values are assigned prior to querry time according to perceived importance. The importance of each page is determined by the links to that page. The importance of any page is increased by the number of sites which link to it. Thus the rank r(p) of a given page P is given by, r(p ) = Q B P r(q) Q (1) 2

3 where B P = all pages pointing to P and Q = number of outlinks from Q. The terms of the matrix P are usually, { 1 P p i,j = i if P i links to P j ; 0 otherwise. (These weights can be distributed in a non-uniform fashion as well, which will be explored in the application section. For this particular application, a uniform distribution will suffice.) For theoritical and practical reasons such as convergence and convergence rates the matrix P is adjusted. The raw Google matrix P is nonnegative with row sums equal to one or zero. Zero row sums correspond to pages that have no outlinks; these are referred to as dangling nodes. We eliminate the dangling nodes using one of two techniques. So that the rows artifically sum to 1. P is then a row stochastic matrix, which in turn means that the PageRank iteration represents the evolution of a Markov Chain. 2.1 Markov Model Figure 1 3

4 Figure 1 is a simple example of the stationary distribution of a Markov model. This structure accurately represents the probability that a random surfer is at each of the three pages at any point in time.the Markov model represents the webs directed graph as a transition probability matrix P whose element p ij is the probability of moving from page i to page j in one step (click). This is accomplished through a few steps. Step one is to create a binary Adjacency matrix to represent the link structure. A B C A B C The second step is to transform this Adjacency matrix into probability matrix by normalizing it ). A B C 1 1 A B C This matrix is the unadjusted or raw google matrix. The dominant eigenvalu for every stochastic matrix P is λ = 1. Therefore if the Pagerank iteration converges it converges to the normalized left hand eigenvector v T satisfying v T = v T P (2) where v T e = 1 which is the stationary or steady state distribution of the Markov chain. Thus google intuitively characterizes the PageRank value of each site as the long-run proportion of time spent at the site by a Web surfer eternally clicking on links at random. In this model we have not yet considered account clicking back or entering URLs on the command line. In our basic example, we have: (R(A) R(B) R(C)) * A = (R(A) R(B) R(C)) where A is A = A B C 1 1 A B C

5 R(A) = R(C) R(B) = 1 2 R(A) R(A) + R(B) + R(C) = 1 and the solution of this linear system is where A sol is R(C) = 1 R(A) + R(B) 2 ( )*A sol = ( ) A = A B C 1 1 A B C Let consider a larger network show represents by figure 2. Figure 2 5

6 This network has 8 nodes and therefore, the corresponding matrix has a size 8 x 8 matrix, as shown in figure 3. Figure 3 Again, we can transform it into stochastic matrix, and the result is the following: 6

7 2.1.1 Generalization Before going into the logistics of calculating this Pagerank vector, we generalize to an n-dimentional system. Let A i be the binary vector of outlinks from page i A i = (a i1, a i2,..., a in ) and N A i 1 = A ij (3) j=1 P = A 1 A 1 1 A 2 A A N A N 1 7

8 = P P 1N : : : : P N1.... P NN P i = (p i1, p i2,..., p in ) so N P i 1 = P ij = 1 (4) j=1 We now have a row stochastic probability matrix, unless, of course a page (node) points to no others: A i = P i = 0. Now let W i T = 1 N, where i = 1,..., N Furthermore, let d i = { 0 if i is not a dead end; 1 if it is a dead end. So W = d w T, S = W + P S is a stochastic matrix. It should be noted that there is more than one way to deal with dead ends. Such as removing them altogether or adding an extra link which points to all the others ( a so-called master node). We explore qualitatively the effects these methods have in the results analysis section. (See figure 10 for a deadend). 2.2 Computing PageRank The computation of PageRank is essentially solving an eigenvector problem of solving the linear system, v T (I P ) = 0, (5) with v T e = 1. There are several methods which can be utilized in this calculation, provided our matrix is irreductible, we are able to utilize the power method. 8

9 2.2.1 Power Method We are interested in the convergence of the method x T m G = x T m+1. For convenience we convert this expression to G T x m = x m+1. Clearly, the eigenvalues of G T are 1> λ 1 λ 2... λ n. Let v 1,...v n be the corresponding eigenvectors. Let x 0 (dimension n) such that x 0 1 = 1,so for a 1 R n a i v i G T x 0 = a i G T v i = a i λ i v i i=1 a 1 v 1 n a i λ i v i = a 1 + a 1 a 1 = x 1 G T x 1 = a 1 v 1 + G T x m = a 1 v 1 + i=2 n i=2 n i=2 a i λ 2 i v i = x 2 a i λ m+1 i v i = x m+1 so lim m GT x m = a 1 v 1 = π. (The stationary state of Markov Chain) 2.3 Irreducibility and Convergence of Markov Chain A difficulty that arises in comupation is that S can be a reducible matrix when the underlying chain is reduible. reducible chains are those that contain sets of states in which the chain eventually becomes trapped. For example if webpage S i contains only a link to S j, and S j contains only a link to S i, then a random surfer who hits either S i or S j is trapped into bouncing between the two pages bouncing endlessly, which is the essence of reducibility.the definition of Irreducibility is the following, for each pair i, j, there exists an M such that (S m ) ij 0. In the case of an undirected graph, this is equivalent to disjoint, non-empty subsets (see figure 11). However, the issue of meshing these rankings back together in a meaningful way still remiains Sink So far we are dealing with a directed graph, however, we also have to be concerned with the elusive sink.(missing figure 16,17) ) A Markov chain in 9

10 which every state is eventually reachable from every other state guarantees to possess a unique positive stationary distribution by the Perron-Frobenius Theorem. Hence the raw google matrix P is first modified to produce a stochastic matrix S. Due to the structure of the World Wide Web and the nature of gathering the web-structure, such as our method breadth first (which will be explained in the section on implementation), a stochastic matrix is almost certainly reducible. One way to force irreducibility is to displace the stochastic matrix S where α is a scalar between 0 and 1. In our computation we choose α to be For α between 0 and 1, consider the following: R(u) = α = v R(v) n v + (1 α) where α =.85 then the new stochastic matrix G becomes: where G = αs + (1 αd) (6) D = e W T e = < 1, 1,..., 1, 1 > W T i = < 1 N, 1 N... 1 N > Again, it should be noted that W T i can be any unit vector. In our basic example, this amounts to: 0.85 * A * B = C where A is our usual 3 * 3 stochastic matrix, B is a 3 by 3 matrix with 1 3 in every entry, and C is A = C This method allows for additional accuracy in our particular model since it accounts for the possibility of arriving at a particular page by means other 10

11 than via link. This certainly occurs in reality and hence, this method, improves the accuracy of our model, as well as providing us with our needed irreducibility, and as we will see, improving the rate of convergence of the power method. 3 Data Management Up to this point, we assume that we are always able to discover the desired networks or websites that containing information we google for. However, careful readers may notice that we have not really discussed the way of figuring the structure of the networks. In this section, we are going to switch our attentions toward more technical feature. How are we going to figure the structure of our networks? Furthermore, suppose if we are able to come up with the list of the websites, is there anyway we can find out the rank more efficiently and economically? 3.1 Breadth First Search Breadth First Search Method is our main approach to identify the structure of networks and its algorithm is the following. Let us begin with one single node (webpage) in our network, and assigns it with a number 1, as in Figure a 11

12 Figure a This node links to several nodes and we are going to assign each nodes with a number, as in Figure b 12

13 Figure b From figure b, we observe there is one node link to node 2, so we assign this node another number. Then we switch to node 3, assigning a number to the node connects to node 3, and so on. Figure c gives us the final result: 13

14 Figure c As you can see, by using the Breadth First Search Method, we are able to complete the graph structure, and therefore, we will be able to create our adjacency matrix. 3.2 Sparse Matrix Now we are able to form our adjacency matrix by knowing the structure of the network through Breadth First Search Method. But in reality, the network contains over millions or even billions pages, and these matrices will be huge. If we apply our power method directly to these matrices, even with the fastest computer in the world, it will take a long time to compute those dominant eigenvector. Therefore, it will be economical for us to develop some ways to reduce the size of these matrices without affecting the ranking of those pages. In this paper, Sparse Matrix method and Compressed Row Storage are the methods we are going to use to accelerate our calculating process. First, let consider the following network: 14

15 Figure d Link text formats this information from files to files, represent by the table next to the network. Then Sparse PR reads in a from-to file and computes ranks. It outputs the pages in order of rank. Figure (e) is the result of our sample 15

16 Figure e Sparse Matrix allows us to use less memory storage without compromising the final ranking. Full matrix format requires N 2 + 2N memory locations (N number of nodes). For 60k nodes about 50 Gbytes RAM. Sparse format requires 3N +2L locations (L number of links). For 60k nodes and 1.6M links about 50 Mbytes RAM. Obviously, Sparse Matrix use a lot less of memory than a full matrix in computation. Therefore, Sparse Matrix is more efficient than a full matrix in terms of the amounts of memory being used. 3.3 Compressed Row Vectors In this section we want to develop a method to accelerate a process of multiplying the matrix. We decide to compress row vectors, since we already know how each nodes points to other nodes. CRS compresses rows require two vectors of size L (number of links) and one of size N (numbers of nodes). Consider the following example, where we have 3 nodes and 6 links. First, we construct a column vector aa with a size L. This vector represents nonzero entries in reading order. Second, we construct a column vector ja crs 16

17 vectors with size L. This vector represents column indices of non-zero entries. Finally, we are creating the ia vector with size N. This is a cumulative count of non-zero entries by row. For example, the first row has two non-entries, therefore the first element of this ia vector is 2. Second row has one non-entry, therefore the second element of this vector is 3, etc. Figure f CRS storage allows us to multiply these matrix-vectors in the following concise form: // for each row in original matrix for i = 1 to N // for each nonzero entry in that row for j = ia(i) to ia(i+1) - 1 //multiply that entry by corresponding //entry in vector; accumulate in result result(i) = result(i) + aa(j) * vector(ja(j)) CRS is efficient, since we only need L additions and L multiplications, instead of N additions and N 2 multiplications. Now we can apply the power method and compute those tedious matrix multiplications and additions in more efficient way. 17

18 4 Results To apply the PageRank method, an adjacency matrix is needed which represents a directed graph. The conventional use for PageRank is to rank a subset of the internet. A program called a webcrawler must be employed to crawl a desired domain and map its structure (i.e. links). A simple approach to solving this problem is to use a breadth-first search technique. This technique involves starting at a particular node, say node 1, and discovering all of Node 1 s neighbors before beginning to search for the neighbors of 1 s first discovered neighbors. Figure 4 demonstrates this graphically. This technique can be contrasted with depth-first search which starts on a path and continues until the path ends before beginning a second unique path. Breadth-first search is much more appropriate for webcrawlers because it is much more likely that arbitrarily close neighbors won t be excluded during a lengthy crawl. Figure 4 A crawl in January of 2006 was focused on the umass.edu domain and yielded an adjacency matrix of 60,513x60,513. The PageRank method was implemented in conjunction with the CRS scheme to minimize the resources required. A final ranking was obtained and a sample can be seen in Figure 5. Notice that the first and sixth ranked websites are the same. This is due to the fact that the webcrawler did not differentiate between different aliases of a URL. This paper presents one of the possible ways for ranking. However, it is clear that the matrices Google dealing with is thousand times larger than 18

It is easy to check if the network is small, but when the networks getting bigger and bigger, verifying the results will become amazingly difficult.

19 the one we used. Therefore, it is safe to assume that Google would have a more efficient way to compute and to rank webpage. Furthermore, we have not introduced any method to confirm our results and algorithms. It is easy to check if the network is small, but when the networks getting bigger and bigger, verifying the results will become amazingly difficult. One of the potential solutions for this problem is to simulate a web surfer and use a random number generator to determine the linkage between websites. It should be interesting to see the result. Figure 5 Another implementation can be applied to a network of airports with flights representing directed edges. In this implementation, the notion of multilinking comes into play. More precisely, there may exist more than one flight from one airport to the next. In the internet application, the restriction was made to allow only one link from any particular node to another. Although this requires only slight alterations to the working software to ensure a stochastic matrix. Figure 6 shows a sample of the results in a PageRank application on 1500 North American airports. 19 Figure 6

20 A more visible application may be in a sports tournament setting. The methods used for ranking collegiate football teams is annually a hot topic for debate. Currently, an average of seven ranking systems are used by the BCS to select which teams are accepted to the appropriate bowl or title games. Five of these models are computer based and are arguably a special case of PageRank. 5 Conclusion This paper presents one of the possible ways for ranking. However, it is clear that the matrices Google dealing with is thousand times larger than the one we used. Therefore, it is safe to assume that Google would have a more efficient way to compute and to rank webpage. Furthermore, we have not introduced any method to confirm our results and algorithms. It is easy to check if the network is small, but when the networks getting bigger and bigger, verifying the results will become amazingly difficult. One of the potential solutions for this problem is to simulate a web surfer and use a random number generator to determine the linkage between websites. It should be interesting to see the result. References [1] Amy N. Langville, Carl D. Meyer A Survey of Eigenvector Methods for Web Information Retrieval Siam Review Vol 47, No 1 [2] S. Brin, L. Page, R. et.al. The PageRank Citation Ranking: Bringing Order to the Web 20

Information Retrieval. Lecture 11 - Link analysis

Information Retrieval. Lecture 11 - Link analysis Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks