Divide and Conquer Approach for Efficient PageRank Computation

Size: px

Start display at page:

Download "Divide and Conquer Approach for Efficient PageRank Computation"

Christal Bishop
5 years ago
Views:

1 Divide and Conquer Approach for Efficient agerank Computation rasanna Desikan Dept. of Computer Science University of Minnesota Minneapolis, MN USA Nishith athak Dept. of Computer Science University of Minnesota Minneapolis, MN USA Jaideep Srivastava Dept. of Computer Science University of Minnesota Minneapolis, MN USA Vipin Kumar Dept. of Computer Science University of Minnesota Minneapolis, MN USA ABSTRACT agerank is a popular ranking metric for large graphs such as the World Wide Web. Current research techniques for improving computational efficiency of agerank have focussed on improving the I/O cost, convergence and parallelizing the computation process. In this paper, we propose a divide and conquer strategy for efficient computation of agerank. The strategy is different from contemporary improvements in that it can be combined with any existing enhancements to agerank, giving way to an entire class of more efficient algorithms. We present a novel graph-partitioning technique for dividing the graph into subgraphs, on which computation can be performed independently. This approach has two significant benefits. Firstly, since the approach focuses on work-reduction, it can be combined with any existing enhancements to agerank. Secondly, the proposed approach leads naturally into developing an incremental approach for computation of such ranking metrics given that these large graphs evolve over a period of time. The partitioning technique is both lossless and independent of the type (variant) of agerank computation algorithm used. The experimental results for a static single graph (graph at a single time instance) as well as for the incremental computation in case of evolving graphs, illustrate the utility of our novel partitioning approach. The proposed approach can also be applied for the computation of any other metric based on first order Markov chain model. Categories and Subject Descriptors G.4. [Mathematical Software]: Efficiency, Algorithm Design and Analysis. General Terms Algorithms, erformance, Design, Theory Keywords agerank, Efficient Computation, Ranking Measures, Graph artitioning 1. INTRODUCTION Copyright is held by the author/owner(s). ICWE'06, July 11-14, 2006, alo Alto, California, USA. ACM /06/0007 Link analysis techniques have been used widely for developing ranking metrics in large graphs such as the Web. The principal observation is that a hyperlink from a source page to a destination page serves as an endorsement of the destination page by the (author of the) source page on some topic. Link based metrics for Web graphs have been found to provide stable rankings for Web search avoiding issues related to text spamming. Information on various link based metrics, such as Klienbergs HITS algorithm [2], is discussed in the survey [3]. Among the different link based metrics on the Web graph, agerank metric [1] has gained significant prominence with the success of Google. The primary key to its success has been the dependence of rank on pages pointing to it, thus reducing the chances of biasing a rank for a page for which the user is the creator. Secondly, agerank is precomputed for the whole Web graph and is query independent making it a faster approach to rank results during a search operation. The popularity and stability of agerank has led to a variety of modifications to the underlying agerank model addressing different scenarios such as topic sensitivity [4], usage analysis [5], and biased among different clusters[6]. The issue of efficient computation for agerank has also captured attention of the research community. Haveliwala proposed an efficient computation approach for agerank [6] using a block based technique using efficient i/o computation. Improvisations for such I/O efficient methods [7,8]and accelerated convergence [9, 10] for agerank have also been well studied. The various issues related to agerank computation have been extensively covered in recent surveys [11, 12]. In this paper our contribution is two fold. Firstly, we propose work reduction techniques through graph partitioning to break the problem of computation on a large graph into computation on smaller subgraphs. The partitioning approach is not an approximation method and hence does not result in loss of information. Also, this approach of work reduction is complementary to other approaches, and hence can be used in tandem with other efficient computation methods to further improve the efficiency. In the second part of our work, we address the issue of computation on evolving Web graph. A straightforward approach would be to compute these measures for the whole Web Graph at each time instance. However, given the size of the Web graph, this is becoming increasingly infeasible. Furthermore, if the percent of vertices that change during a typical time interval when the Web is crawled by search engines is not high, a large portion

2 of the computation cost may be wasted on re-computing the scores for the unchanged portion. Hence, there is a need for computing metrics incrementally, to save on the computation costs. Chien et al [13] propose an approximation approach to compute incrementally agerank. However, our approach relies on sound theoretical partitioning criterion that results in a lossless incremental computation of agerank. Initial work on the incremental approach was presented earlier [14]. Our results indicate that we achieve significant improvement in terms of computation time for such an approach. This paper is organized as follows. In the next section, we give an overview of agerank metric and its underlying model. In Section 3, we describe the theoretical framework of our proposed divide and conquer approach. Section 4 discusses the methodology to use the above mentioned approach to compute agerank for a large graph at a single time instance and the extension of this approach to the computation of agerank on such large evolving graphs. Experiments and results supporting the approach are presented in Section 5. Section 6 provides conclusions of our approach and discusses possible future work. 2. AGERANK OVERVIEW agerank is a metric for ranking hypertext documents that determines their quality. It was originally developed by age et al. [1] for the popular search engine, Google [14]. The key idea is that a page has high rank if it is pointed to by many highly ranked pages. Thus, the rank of a page depends upon the ranks of the pages pointing to it. The rank of a page p can thus be written as: R ( p ) = d + ( 1 d ) R ( q ) n OutDegree ( q ) (1) ( q, p ) G Here, n is the number of vertices in the graph and OutDegree(q) is the number of hyperlinks on page q. Intuitively, the approach can be viewed as a stochastic analysis of a random walk on the Web graph. The first term in the right hand side of the equation corresponds to the probability that a random Web surfer arrives at a page p from somewhere, i.e. (s)he could arrive at the page by typing the URL or from a bookmark, or may have a particular page as his/her homepage. d would then be the probability that a random surfer chooses a URL directly i.e. typing it, using the OutDeg ( 3) 3 1 OutDeg ( 1) 1 OutDeg ( 2) bookmark list, or by default rather than traversing a link. Finally, 1/n is the uniform probability that a person chooses page p from the complete set of n pages on the Web. The second term d N d/n R ( 1) R ( 2) R ( 3) R ( ) = d N + (1 d ) + + OutDeg ( 1) OutDeg ( 2) OutDeg ( 3) Figure1. Illustrative Example of agerank in the right hand side of the equation corresponds to a factor contributed by arriving at a page by traversing a link. 1- d is the probability that a person arrives at the page p by traversing a link. The summation corresponds to the sum of the rank contributions made by all the pages that point to the page p. The rank contribution is the agerank of the page multiplied by the probability that a particular link on the page is traversed. So for any page q pointing to page p, the probability that the link pointing to page p is traversed would be 1/OutDegree(q), assuming all links on the page is chosen with uniform probability. Figure 2 illustrates an example of computing agerank of a page from the pages, 1, 2, 3 pointing to it. There are other computational challenges that arise in agerank. Apart from the issue of scalability, the other important computational issues are the convergence of agerank iteration and the handling of dangling vertices. The convergence of agerank is guaranteed only if the Web graph is strongly connected and is aperiodic. To ensure the condition of strong connectedness, the dampening factor is introduced, which assigns a uniform probability to jumping to any page. In a graph theoretic sense it is equivalent of adding an edge between every pair of vertices with a transition probability of d/n. The aperiodic property is also guaranteed for the Web graph. Another important issue in computation of agerank is the handling of dangling vertices. Dangling vertices are vertices with no outgoing edge. These vertices tend to act as rank sink, as there is no way for rank to be distributed among the other vertices. The suggestion made initially to address this problem, was to iteratively remove all the vertices that have an outdegree of zero, and compute the agerank on the remaining vertices [1]. The reasoning here was that dangling vertices do not affect the agerank of other vertices. Another suggested approach was to remove the dangling vertices while computation initially and add them back during the final iterations of the computation [15]. Other popular approaches to handling dangling vertices, is to add self loops to dangling vertices[16,17] and to add links to all vertices in the graph, G from each of the dangling vertex to distribute the agerank of the dangling vertex uniformly among all vertices[1]. In this paper we handle dangling vertices by adding self loops to all vertices. 3. ROOSED AROACH In the proposed approach we make use of the fact that the agerank is based on first order Markov model. And in such a model if a vertex belonging to one set cannot be reached from a vertex belonging to any other set, then the score on this vertex would depend only on the vertices of the set to which it belongs. This is because in a first order Markov model, the present state depends upon one previous state and to arrive at the present state we need to have an incoming link from a previous vertex. This leads to the idea that agerank of the vertices belonging to a set A, does not depend on the agerank of the vertices from another set B if there is no incoming links from vertices in set B to vertices in set A. In such a scenario, agerank of vertices in set A could be computed independently of agerank of vertices in set B. We make use of this criterion, to divide the graphs into partitions of red patches and yellow patches such that there is no link that point from any red patch to another red patch or yellow patch and there are no outgoing links from a yellow patch to any of the red patches. Once we can partition the graph in such a manner into sets of red patches and yellow patches,

3 we can then compute the agerank of vertices in the red patches independently and follow it with the computation of agerank for vertices in the yellow patch. Such an approach has two advantages. Firstly, it reduces the size of the problem by reducing the size of the graph into smaller subgraphs of red patches and yellow patches. Such a reduced problem size helps in fitting the graph in the main memory without requiring a machine of high RAM capacity. Secondly, since the computation of red patches can be carried out independently, this process can be parallelized leading to further optimization and saving on computation time. However, we do not deal with parallelization issues in this paper. Let us consider a graph G = V, E. The idea is to partition graph G into components, G 1, G 2,.G k, such that: (a) U k Vi = V; Vi V j = φ i= 1 i j k (b) U E i E partition = E ; Ei E i j i= 1 j = φ For agerank which is based on First Order Markov Model further constraints apply to prevent cyclic flow of information, such as for a given partition, G i : exy E vx Gi v y G Gi Such a partition, G i corresponds to the definition of a red patch described earlier. In the figure, the graph G is partitioned into four partitions such that G 1, G 3, G 4 correspond to the red patches discussed earlier. G 2 corresponds to the yellow patch. We will now describe the scheme to compute agerank on such a partitioned graph. Let a graph be partitioned into k red patches, G r1 to G rk and a single yellow patch G y. The edges from the red patches to the yellow patch (represented as dotted edges in the figure) form the set E partition. In a given red patch, G ri, let the vertices (represented by annular circles in the figure) that are the source end of an edge belonging to E partition be denoted as V border,ri. Let us define a graph, G such that: G = V, E where rk V' = U Vborder ri UVy, ; E = E partition UE y ri= r1 This G corresponds to the yellow patch. The agerank for the whole graph G can now be computed by computing ageranks independently for all individual red patches, G r1 to G rk and then computing for the graph, G as defined above. The partitioning scheme presented above is purely based on the fact that the measure of certain nodes is independent of measure of other nodes that are not pointing to it and is dependent only on the previous node pointing to it. Thus this scheme as such is not restricted or specific to agerank computation but can be used for computation of any measures that are based on such first order Markov model. In this work, our contribution is presentation of such a scheme which can be used for improving computation on a single large graph and more importantly, making use of this technique for efficient incremental computation on evolving graphs. A detailed overview of the incremental approach is discussed in our previous work [18]. The most significant contribution of our approach is it does not result in loss of rank scores, as we make no approximations such as grouping a set of vertices as a single vertex. In the following section, we present algorithms to make such partitions. However, we do not claim such partitions to be optimal or the algorithm to be an optimal approach to partition to graph for the most efficient computation. We show the feasibility of our approach by presenting naïve algorithms that could still be used to improvise the computation costs. 4. EFFICIENT COMUTATION TECHNIQUES In this section we present efficient computation methods for lossless parallelization of agerank computation for a static single graph as well as an approach for lossless incremental computation of agerank for an evolving graph. Both these approaches are based on the partitioning scheme presented in the previous section. 4.1 Static Single Graph We present a methodology for parallelizing the agerank computation for a Static Single Graph, using the partitioning technique presented in the previous section. Assume that we have a graph partitioned into a set of red patches surrounded by a yellow patch. agerank scores for vertices in the red patches are computed first. The red patches can be treated as independent subgraphs and their agerank computation is performed in parallel. Next, using the agerank scores for peripheral red colored vertices i.e. red colored vertices with edges crossing over to the yellow patch; we compute the agerank scores for the vertices in the yellow patch. For agerank computation of the red patches and the yellow patch, the user is free to choose any variant (or the naïve version itself) of the agerank algorithm, as the partitioning scheme results in nothing but a reduction of the problem size and is independent of the algorithm used for agerank computation. Thus, we have a lossless parallel agerank computation method for a Static Single Graph based on our partitioning scheme.

4 The user is free to design and use any patch extraction algorithm, provided the patches extracted have the desired properties discussed in section 3. One can perform the patch extraction is observed that the red patches extracted in steps 1-5 contain a small fraction of the vertices in the graph. An ideal situation would be one where we have the entire graph partitioned into equal sized red patches i.e. no yellow patch. This is because we compute agerank for the red patches in parallel and the more the vertices in the red patches; the more will be the gain in efficiency. Even though we may not realize the ideal situation, it is still desirable that we include as much possible of the graph, as red patches. Therefore, the second part of the algorithm expands the size i.e. includes as many vertices as possible, in the above obtained red patches. Step 6 ick a red patch, call it R Step 7 erform a reverse BFS on each of the yellow colored children of the peripheral vertices of this red patch. If during the process of traversal, a red vertex belonging to red patch R is encountered, then all the vertices encountered so far are colored red and included in the red patch R. If a red colored vertex from some other red patch is encountered, then the red patch R cannot be expanded along this path and the reverse BFS is abandoned leaving all vertices as they were. If one encounters a dead end i.e. a yellow vertex with no incoming links then again all vertices encountered so far are colored red and included in the red patch R. Figure 3. Steps 1 to 4 of the approach during the crawling procedure or as a pre-processing step. For purpose of showing that such patches indeed exist and can be extracted, we have implemented a naïve algorithm. The following steps briefly describe the algorithm:- Consider that we have three sets of vertices colored red, yellow and black. Red colored vertices are those which belong to some red patch. Yellow colored vertices belong to the yellow patch. The black colored vertices are the unexplored vertices. Initially, since all the vertices are unexplored, there are no red and yellow colored vertices, all the vertices in the graph are black colored. Step 1 - Randomly pick a black colored vertex. Step 2 erform a reverse BFS on this vertex i.e. explore all the ancestors of this vertex by traversing along the incoming links. Color all the black vertices encountered red. The reverse BFS does not stop until no further traversal is possible or a red vertex is encountered. Step 3 - Label this set of red vertices as a red patch. Select all peripheral vertices i.e. vertices in this red patch that have edges crossing over to any vertex(s) outside this red patch. Step 4 erform a normal BFS on each of the peripheral vertices, coloring all the black vertices encountered as yellow. Each of the BFSs continue till no further traversal is possible or a yellow vertex is encountered. It is not possible for any of the BFSs to encounter a re d vertex. Thus, we color all the descendents of each of the peripheral vertices as yellow. Step 5 We now have a red patch surrounded by a yellow patch region. Repeat the entire procedure from Step 1-4 (Steps 1-4 are illustrated in figure. x), each time extracting a different red patch, until all the black vertices are exhausted. Note that in step 2 it is neither possible to encounter a red vertex from a previously extracted red patch nor a yellow colored vertex. Also in step 4, it is not possible to encounter a red colored vertex from a previously extracted red patch however one may encounter a yellow vertex. It Figure 4. Steps 6 to 7 of the approach Step 8 erform Steps 6 and 7 for each red patch (Steps 6-7 are illustrated in figure y). Step 9 Repeat Steps 6-8 until no change is observed in the size of any of the red patches i.e. no red patch can be expanded any more. Step 10 Return the list of red patches and the yellow colored vertices. These are then used for parallel agerank computation as explained earlier. After patch extraction, the graph can be treated for dangling vertices by adding a self-loop to each dangling vertex or one may also choose to delete these vertices. The first option will have no effect on the patches extracted. The second option may results in certain vertices being deleted; however, it will not result in a situation where the properties of the patches extracted are violated. One may also choose to perform these operations before the patch extraction process. Adding edges from a dangling vertex to every other vertex has an adverse effect on our partitioning scheme. In such a case, every vertex will have an incoming link from a dangling vertex. Therefore, every red patch will have to contain these dangling vertices. Since all red patches are disjoint, it is not possible to have more than one red patch. Therefore, in spite of its popularity this technique, for handling dangling vertices, will not work with our partitioning scheme. A good partitioning scheme would be where we get patches of a good profile. By a good profile we mean a set of patches that

5 maximize the total number of vertices included in red patches, minimize the size of the largest red patch and minimize the skew in the red patch sizes. Note that the patch extraction algorithm presented (and implemented by us) is quite naïve and not optimal. For instance we do not take any steps to prevent skew in the red patches sizes. Also note that the profile of the patches (i.e. properties such as number of patches, sizes of patches and skew in patch size) that are extracted from a graph depends on the black colored vertices are picked in step 1. In our case we choose to pick a vertex randomly. This is because we expect a more optimal or intelligent picking schemes to come up with better patch profile when compared to a random one such as ours. In section 5 we show that in spite of the sub-optimal patch extraction algorithm used by us, we still get favorable results. 4.2 Evolving Graphs In this section, we will describe the incremental algorithm to compute agerank. The initial step is to read the graph at a new instance and determine the vertices that have changed. This does not require additional time as it can be computed as we read the new graph. Thus, after reading the graph, we can assume that we are given two sets of vertices one containing the vertices which have changed from a previous time instance and the other containing vertices that have remain unchanged. Hence, the input to the algorithm is the graph G, and the two lists V c and V u.. The outline of the algorithm is shown in Figure 3. We will now briefly describe each step in the algorithm. We start by initializing a list V Q. Recall that, a change in a vertex induces a change in the agerank distribution of all its children and all such changed vertices are available to us in the queue V c. A simple traversal methos is used to extend this list of changed vertices, such that it also includes all descendents of the initial list of changed vertices. All of these vertices are pushed into the list Q2. For the remaining vertices are there is no change in their agerank distribution. The New agerank is simply obtained by scaling the previous agerank scores. IR(G,V u,v c ) :- Step 1 Initialize the list V Q Step 2 op a Vertex N from V c 2.1 For all the children of N if children of N list V u remove them from V u push them in V c 2.2. ush N in V Q and repeat step 2 till queue V c is empty Step 3 For each element in list V u 3.1 Take the element and scale the previous pagerank value to get new pagerank value. 3.2 Look up whether any of the children, of the element of V u belong element of V u, copy it in V b. to V Q, if so remove this Step 4 Scale Border Nodes in V b for stochastic property erform Original agerank(v U Q V b ) Figure 5. Incremental agerank Algorithm The scaling factor is simply: ( G ) n( G) n = Order of graph at previous time instance/order of the graph at the present time instance. Also note that all those vertices from this set of unchanged vertices that point to a changed vertex, will influence the agerank value of that changed vertex, hence these too must be included in the list V Q as their agerank scores will be required for computing the agerank scores for the changed vertices. We now perform the original agerank computation along with steps taken to ensure stochastic property of transition matrix, on the vertices that are in Q2 and colored violet (i.e. vertices which have changed) to get the new agerank values for these changed vertices. Thus, we end up localizing the changed partition to a certain sub-graph of the web which includes of all vertices whose agerank values are affected by the structural changes in the graph, and then basic agerank algorithm is performed only on this changed sub-graph. The agerank value for the rest of the vertices is simply a matter of scaling the previous values. Step 2 has a cost of E, where, is the number E = E Q E of edges in the partition Q. Now the agerank values for the partition are obtained by scaling the agerank values with respect to ranks in the previous time instance. This step requires a cost of V, where V is number of vertices in partition. Now using these scaled values and the naïve approach agerank for the vertices in partition Q is calculated. This step (including that required to scale the border vertices) requires a cost of ne + E +V b, where n is number of iterations required for agerank values to converge and E is again number of edges in partition Q. Thus, the total cost for incremental agerank can be summed up to be O(2E +V +ne+v b ). 5. EXERIMENTS AND RESULTS In this section we present results for our graph partitioning scheme for a static single graph as well as the incremental agerank computation for an evolving graph. For the graph partitioning scheme we present results on the link graph for website as crawled on July 19th, The graph contains a total of edges and vertices. rior to partitioning, dangling vertices were taken care of by adding self-loops to each of them. The various statistics for patch extraction are presented in table 1. The center column (labeled number ) provides the actual number of vertices/edges and the next column (labeled percentage ) provides the same information as a percentage value w.r.t the total number of vertices/edges in the graph. In spite of using a website link graph, which one expects to be denser than the web graph, and a naïve and suboptimal patch extraction algorithm, we still get a significant portion of the graph (i.e. 63.5%) in the form of red patches with the largest patch containing 16.94% of the total edges. Note that the sizes in number of edges are more important than the ones in number of vertices, as the agerank algorithm of O(E). The actual runtime of the patch extraction algorithm was 37 ms., which would reduce the efficiency to 1.57 times faster. However, the patch extraction is not an optimal implementation and hence an improvised patch extraction could further improve the efficiency. One might also perform patch extraction during the crawling procedure art

6 Table 1. Statistics for atch Extraction Number of Red atches = 326 The parallel agerank computation for a Static Single Graph, were again run on the link graph for as crawled on July 19th, Due to the unavailability of a parallel platform for carrying out our experiments, we simulate the results for a parallel execution by using the maximum of the runtimes of the agerank computations all red patches, in place of the runtime of computing agerank of all red patches in parallel. This is a valid assumption as there is no process intercommunication. Any variant of the agerank algorithm can be used with our partitioning scheme. For our experiments we chose to use the original, naïve agerank algorithm. Table 2 summarizes the runtime results for the parallel agerank computation compared with the same original agerank. We used a convergence threshold of 10-8 and a dampening factor of 0.1 for all agerank computations. To examine the experimental accuracy we computed the L1-norm for the agerank score vectors returned by the two methods, which was found to be 4.4 x The small error could arise due to numerical computing issues and may also depend on the number of iterations and convergence rate when computed as a whole graph versus a computation for convergence as a small subgraph. Table 2 Divide and Conquer Approach versus Naive Approach Operation Number Average Run Time (in ms) Largest Red atch agerank 20 Yellow atch age agerank 17 DC agerank 38 Original agerank 118 ercentage Total Edges in all Red atches % Total Edges in Yellow atch % Vertices in Red atches % Vertices in Yellow atch % Edges in Largest Red atch % Vertices in Largest Red atch % DC approach to R ran 3.1 times faster than original R. From Table 2 we can see that our proposed parallel agerank method ran about 3.1 times faster than the original agerank. Again we point out that this is in spite of using a sub-optimal partitioning algorithm. The results clearly indicate that a combination of any agerank algorithm and our partitioning scheme is a more efficient than that same agerank algorithm by itself. Along with actual runtimes of the parallel and original agerank, Table 2 also shows actual runtimes for agerank computation of the largest red patch as well as the yellow patch. To test our incremental agerank approach on, we performed the experiments on two different web sites- the Computer Science website ( and the Institute of Technology website ( at the University of Minnesota. We performed the experiments at different time intervals to study the change and effect of the incremental computation. For the Computer Science website our analysis was done at a time interval of two days, eight days and ten days. A time interval of two days was used for the Institute of technology web site. In our experiments we also simulated the focused crawling, by not considering the Web pages that have very low agerank into our graph construction and agerank Computation. This was to emulate the real world scenario where not all pages are crawled. We wanted to analyze, how the incremental approach performs when pages with low agerank are not crawled. We used the following approximate measure to compare the computational costs of our method versus the naïve method. Number of Times Faster = Num of Iterations(R)/(1 + (fraction of changed portion)*number of iterations(ir)) The intuition behind the measure was how fast the convergence threshold will be reached computing agerank incrementally versus computing agerank in a naïve method for the whole graph. The convergence threshold that was chosen on our experiments was 1x10-8 The experimental results are presented in Table 3. These results are from actual experiments conducted on the Computer Science and Institute of Technology websites. For the Computer Science website, in the first time interval of eight days, there seemed to be a significant change in the structure of the Website about 60% of the pages had changed their link structure. We found out such a sea change occurred because a whole subgraph that contained the documentation for Matlab help was removed. The incremental approach still however, performed 1.86 as much faster as the naïve agerank. Similarly, for a period of ten days the incremental approach performed around 1.75 times faster. For a period of two days the improvement was 8.65 times faster. These results are for the case of an unfocussed crawl. The results for focused crawl for the CS Website were better. In the first case, when the time interval was eight days, the improvement was 1.9 times and when the time interval was 10 days, the improvement was 1.76 times. For a period of two days the improvement with focused crawling was 9.88 times. Thus, it suggests that focused crawling can also improve the computational costs of the incremental algorithm. The Institute of technology website typically represented a website that doesn t change too often. The change over a period of two days in the Web Structure was none. Since there was no change detected, there was no necessity to compute the agerank for the graph at the new time instance. And by our measure, it was 11 times faster. Since, there was no change in the graph structure, the improvements for the case of focused crawling and unfocussed crawling remain the same

7 Table 3 Comparison of results for Incremental agerank Algorithm versus Naïve agerank Algorithm Computer Science Website Focussed Crawl July19 vs July 27th percentage of change = % L1 -norm : e-05 NumTimes faster= iteration(s) for inc_pagerank 12 iteration(s) for actual pagerank July 27th vs July 29th percentage of change = % L1-norm : e-07 NumTimes faster= iteration(s) for inc_pagerank 13 iteration(s) for actual pagerank July19th vs 29th percentage of change = % L1-norm : e-05 NumTimes faster= iteration(s) for inc_pagerank 12 iteration(s) for actual pagerank Unfocussed Crawl July19 vs July 27th percentage of change = % norm : e-07 NumTimes faster= iteration(s) for inc_pagerank 12 iteration(s) for actual pagerank July 27th vs July 29th percentage of change = % norm : e-07 NumTimes faster= iteration(s) for inc_pagerank 13 iteration(s) for actual pagerank July 19th vs July 29th percentage of change = % norm : e-07 NumTimes faster= iteration(s) for inc_pagerank 12 iteration(s) for actual pagerank Institute of Technology Website Unfocussed/Focussed Crawl July 30th vs Aug 1st percentage of change = 0% norm : e-07 NumTimes faster= 11 0 iteration(s) for inc_pagerank 11 iteration(s) for actual pagerank 6. CONCLUSIONS AND FUTURE DIRECTIONS In this work, we have followed a divide and conquer approach to partition a graph in a scheme that will enable efficient computation for measures and metrics based on first order Markov model. We present a theoretical framework to show how such a partitioning scheme can be used to divide the problem into smaller problems. We extend this approach into another important dimension of evolving graphs and show how such an approach could improve efficiency of computation on large evolving graphs significantly. Our experimental results also show that such an approach even using naive algorithm improves the computation by a significant portion. This approach also leads to other area of interest and research directions to optimize further on computation. As we discussed earlier, the two key issues that need to be considered is an optimal partition and secondly an algorithm to obtain such an

8 optimal partitioning. Another emerging challenge is the class of pages such as wikipedia and blogs, that change very dynamically. While it is still not clear if agerank is the right metric for such changing pages, however it should be noted that certain pages change more dynamically. Since different portions of the Web change at different rates, it poses challenges to able to keep the ageranks updated with optimizing the computation frequency. The study of efficient computation for the changing Web is thus a challenging problem and has a large scope for research to address the various issues. 7. ACKNOWLEDGMENTS This work was supported by Army High erformance Computing Research Center contract number DAAD The content of the work does not necessarily reflect the position or policy of the government and no official endorsement should be inferred. Access to computing facilities was provided by the AHCRC and the Minnesota Supercomputing Institute. 8. REFERENCES [1] L. age, S. Brin, R. Motwani and T. Winograd The agerank Citation Ranking: Bringing Order to the Web Stanford Digital Library Technologies, January [2] J.M. Kleinberg, Authoritative Sources in Hyperlinked Environment, 9 th Annual ACM-SIAM Symposium on Discrete Algorithms, pages , 1998 [3]. Desikan, J. Srivastava, V. Kumar,.-N. Tan, Hyperlink Analysis Techniques & Applications, Army High erformance Computing Center Technical Report, [4] T. Haveliwala, "Topic-Sensitive agerank," In roceedings of 11th International WWW Conference, May [5] D. admanabhan,. Desikan, J. Srivastava and K. Riaz "WICER: A Weighted Inter-Cluster Edge Ranking for Clustered Graphs", The 2005 IEEE/WIC/ACM International Conference on WI 2005 and IAT [6] Taher Haveliwala. "Efficient Computation of agerank," Stanford University Technical Report, September [7] Y. Chiang, M. Goodrich, E. Grove, R. Tamassia, D. Vengroff, and J. Vitter. External-memory graph algorithms. In roc. of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms, January [8] Y. Chen, Q. Gan, and T. Suel. I/O-efficient techniques for computing agerank. In roc. of the 11th International Conf. on Information and Knowledge Management, pages , November [9] A. Arasu, J. Novak, A. Tomkins, and J. Tomlin. agerank computation and the structure of the web: Experiments and algorithms. In oster presentation at the 11th Int. World Wide Web Conference, May [10] S.D. Kamvar, T.H. Haveliwala, Christopher D. Manning, and Gene H. Golub, "Extrapolation Methods for Accelerating agerank Computations." In roceedings of the 12th International WWW Conference, May, [11]. Berkhin, A survey on agerank computing, Internet Mathematics, Internet Mathematics, Vol 2 Issue 1( ) [12] A. N. Langville and C. D. Meyer, Deeper inside agerank, Internet Mathematics, 1 (2003-4), [13] S. Chien, C. Dwork, R. Kumar, D. Sivakumar, D. Simon, Link evolution: Analysis and algorithms. First Workshop on Algorithms and Models for the Web-graph [14] Google, [15] S. D. Kamvar, T H. Haveliwala, C D. Manning, and G H. Golub, "Exploiting the Block Structure of the Web for Computing agerank." reprint, March, 2003 [16] G. Jeh and J. Widom. Scaling personalized web search. In 12th Int. World Wide Web Conference, [17] N. Eiron, K. McCurley, J. Tomlin, Ranking the Web frontier., In: roc. 13th conference on World Wide Web, ACM ress (2004) [18]. Desikan,N athak, J. Srivastava and V. Kumar "Incremental agerank Computation on evolving graphs", In: roc. 14th International World Wide Web Conference on May 10-14, 2005.

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node Shipra Srivastava Student Department of Computer Science & Engineering Thapar University, Patiala, 147001 (India) Rinkle Rani Aggrawal