Divide and Conquer Approach for Efficient PageRank Computation

Size: px
Start display at page:

Download "Divide and Conquer Approach for Efficient PageRank Computation"

Transcription

1 Divide and Conquer Approach for Efficient agerank Computation rasanna Desikan Dept. of Computer Science University of Minnesota Minneapolis, MN USA Nishith athak Dept. of Computer Science University of Minnesota Minneapolis, MN USA Jaideep Srivastava Dept. of Computer Science University of Minnesota Minneapolis, MN USA Vipin Kumar Dept. of Computer Science University of Minnesota Minneapolis, MN USA ABSTRACT agerank is a popular ranking metric for large graphs such as the World Wide Web. Current research techniques for improving computational efficiency of agerank have focussed on improving the I/O cost, convergence and parallelizing the computation process. In this paper, we propose a divide and conquer strategy for efficient computation of agerank. The strategy is different from contemporary improvements in that it can be combined with any existing enhancements to agerank, giving way to an entire class of more efficient algorithms. We present a novel graph-partitioning technique for dividing the graph into subgraphs, on which computation can be performed independently. This approach has two significant benefits. Firstly, since the approach focuses on work-reduction, it can be combined with any existing enhancements to agerank. Secondly, the proposed approach leads naturally into developing an incremental approach for computation of such ranking metrics given that these large graphs evolve over a period of time. The partitioning technique is both lossless and independent of the type (variant) of agerank computation algorithm used. The experimental results for a static single graph (graph at a single time instance) as well as for the incremental computation in case of evolving graphs, illustrate the utility of our novel partitioning approach. The proposed approach can also be applied for the computation of any other metric based on first order Markov chain model. Categories and Subject Descriptors G.4. [Mathematical Software]: Efficiency, Algorithm Design and Analysis. General Terms Algorithms, erformance, Design, Theory Keywords agerank, Efficient Computation, Ranking Measures, Graph artitioning 1. INTRODUCTION Copyright is held by the author/owner(s). ICWE'06, July 11-14, 2006, alo Alto, California, USA. ACM /06/0007 Link analysis techniques have been used widely for developing ranking metrics in large graphs such as the Web. The principal observation is that a hyperlink from a source page to a destination page serves as an endorsement of the destination page by the (author of the) source page on some topic. Link based metrics for Web graphs have been found to provide stable rankings for Web search avoiding issues related to text spamming. Information on various link based metrics, such as Klienbergs HITS algorithm [2], is discussed in the survey [3]. Among the different link based metrics on the Web graph, agerank metric [1] has gained significant prominence with the success of Google. The primary key to its success has been the dependence of rank on pages pointing to it, thus reducing the chances of biasing a rank for a page for which the user is the creator. Secondly, agerank is precomputed for the whole Web graph and is query independent making it a faster approach to rank results during a search operation. The popularity and stability of agerank has led to a variety of modifications to the underlying agerank model addressing different scenarios such as topic sensitivity [4], usage analysis [5], and biased among different clusters[6]. The issue of efficient computation for agerank has also captured attention of the research community. Haveliwala proposed an efficient computation approach for agerank [6] using a block based technique using efficient i/o computation. Improvisations for such I/O efficient methods [7,8]and accelerated convergence [9, 10] for agerank have also been well studied. The various issues related to agerank computation have been extensively covered in recent surveys [11, 12]. In this paper our contribution is two fold. Firstly, we propose work reduction techniques through graph partitioning to break the problem of computation on a large graph into computation on smaller subgraphs. The partitioning approach is not an approximation method and hence does not result in loss of information. Also, this approach of work reduction is complementary to other approaches, and hence can be used in tandem with other efficient computation methods to further improve the efficiency. In the second part of our work, we address the issue of computation on evolving Web graph. A straightforward approach would be to compute these measures for the whole Web Graph at each time instance. However, given the size of the Web graph, this is becoming increasingly infeasible. Furthermore, if the percent of vertices that change during a typical time interval when the Web is crawled by search engines is not high, a large portion

2 of the computation cost may be wasted on re-computing the scores for the unchanged portion. Hence, there is a need for computing metrics incrementally, to save on the computation costs. Chien et al [13] propose an approximation approach to compute incrementally agerank. However, our approach relies on sound theoretical partitioning criterion that results in a lossless incremental computation of agerank. Initial work on the incremental approach was presented earlier [14]. Our results indicate that we achieve significant improvement in terms of computation time for such an approach. This paper is organized as follows. In the next section, we give an overview of agerank metric and its underlying model. In Section 3, we describe the theoretical framework of our proposed divide and conquer approach. Section 4 discusses the methodology to use the above mentioned approach to compute agerank for a large graph at a single time instance and the extension of this approach to the computation of agerank on such large evolving graphs. Experiments and results supporting the approach are presented in Section 5. Section 6 provides conclusions of our approach and discusses possible future work. 2. AGERANK OVERVIEW agerank is a metric for ranking hypertext documents that determines their quality. It was originally developed by age et al. [1] for the popular search engine, Google [14]. The key idea is that a page has high rank if it is pointed to by many highly ranked pages. Thus, the rank of a page depends upon the ranks of the pages pointing to it. The rank of a page p can thus be written as: R ( p ) = d + ( 1 d ) R ( q ) n OutDegree ( q ) (1) ( q, p ) G Here, n is the number of vertices in the graph and OutDegree(q) is the number of hyperlinks on page q. Intuitively, the approach can be viewed as a stochastic analysis of a random walk on the Web graph. The first term in the right hand side of the equation corresponds to the probability that a random Web surfer arrives at a page p from somewhere, i.e. (s)he could arrive at the page by typing the URL or from a bookmark, or may have a particular page as his/her homepage. d would then be the probability that a random surfer chooses a URL directly i.e. typing it, using the OutDeg ( 3) 3 1 OutDeg ( 1) 1 OutDeg ( 2) bookmark list, or by default rather than traversing a link. Finally, 1/n is the uniform probability that a person chooses page p from the complete set of n pages on the Web. The second term d N d/n R ( 1) R ( 2) R ( 3) R ( ) = d N + (1 d ) + + OutDeg ( 1) OutDeg ( 2) OutDeg ( 3) Figure1. Illustrative Example of agerank in the right hand side of the equation corresponds to a factor contributed by arriving at a page by traversing a link. 1- d is the probability that a person arrives at the page p by traversing a link. The summation corresponds to the sum of the rank contributions made by all the pages that point to the page p. The rank contribution is the agerank of the page multiplied by the probability that a particular link on the page is traversed. So for any page q pointing to page p, the probability that the link pointing to page p is traversed would be 1/OutDegree(q), assuming all links on the page is chosen with uniform probability. Figure 2 illustrates an example of computing agerank of a page from the pages, 1, 2, 3 pointing to it. There are other computational challenges that arise in agerank. Apart from the issue of scalability, the other important computational issues are the convergence of agerank iteration and the handling of dangling vertices. The convergence of agerank is guaranteed only if the Web graph is strongly connected and is aperiodic. To ensure the condition of strong connectedness, the dampening factor is introduced, which assigns a uniform probability to jumping to any page. In a graph theoretic sense it is equivalent of adding an edge between every pair of vertices with a transition probability of d/n. The aperiodic property is also guaranteed for the Web graph. Another important issue in computation of agerank is the handling of dangling vertices. Dangling vertices are vertices with no outgoing edge. These vertices tend to act as rank sink, as there is no way for rank to be distributed among the other vertices. The suggestion made initially to address this problem, was to iteratively remove all the vertices that have an outdegree of zero, and compute the agerank on the remaining vertices [1]. The reasoning here was that dangling vertices do not affect the agerank of other vertices. Another suggested approach was to remove the dangling vertices while computation initially and add them back during the final iterations of the computation [15]. Other popular approaches to handling dangling vertices, is to add self loops to dangling vertices[16,17] and to add links to all vertices in the graph, G from each of the dangling vertex to distribute the agerank of the dangling vertex uniformly among all vertices[1]. In this paper we handle dangling vertices by adding self loops to all vertices. 3. ROOSED AROACH In the proposed approach we make use of the fact that the agerank is based on first order Markov model. And in such a model if a vertex belonging to one set cannot be reached from a vertex belonging to any other set, then the score on this vertex would depend only on the vertices of the set to which it belongs. This is because in a first order Markov model, the present state depends upon one previous state and to arrive at the present state we need to have an incoming link from a previous vertex. This leads to the idea that agerank of the vertices belonging to a set A, does not depend on the agerank of the vertices from another set B if there is no incoming links from vertices in set B to vertices in set A. In such a scenario, agerank of vertices in set A could be computed independently of agerank of vertices in set B. We make use of this criterion, to divide the graphs into partitions of red patches and yellow patches such that there is no link that point from any red patch to another red patch or yellow patch and there are no outgoing links from a yellow patch to any of the red patches. Once we can partition the graph in such a manner into sets of red patches and yellow patches,

3 we can then compute the agerank of vertices in the red patches independently and follow it with the computation of agerank for vertices in the yellow patch. Such an approach has two advantages. Firstly, it reduces the size of the problem by reducing the size of the graph into smaller subgraphs of red patches and yellow patches. Such a reduced problem size helps in fitting the graph in the main memory without requiring a machine of high RAM capacity. Secondly, since the computation of red patches can be carried out independently, this process can be parallelized leading to further optimization and saving on computation time. However, we do not deal with parallelization issues in this paper. Let us consider a graph G = V, E. The idea is to partition graph G into components, G 1, G 2,.G k, such that: (a) U k Vi = V; Vi V j = φ i= 1 i j k (b) U E i E partition = E ; Ei E i j i= 1 j = φ For agerank which is based on First Order Markov Model further constraints apply to prevent cyclic flow of information, such as for a given partition, G i : exy E vx Gi v y G Gi Such a partition, G i corresponds to the definition of a red patch described earlier. In the figure, the graph G is partitioned into four partitions such that G 1, G 3, G 4 correspond to the red patches discussed earlier. G 2 corresponds to the yellow patch. We will now describe the scheme to compute agerank on such a partitioned graph. Let a graph be partitioned into k red patches, G r1 to G rk and a single yellow patch G y. The edges from the red patches to the yellow patch (represented as dotted edges in the figure) form the set E partition. In a given red patch, G ri, let the vertices (represented by annular circles in the figure) that are the source end of an edge belonging to E partition be denoted as V border,ri. Let us define a graph, G such that: G = V, E where rk V' = U Vborder ri UVy, ; E = E partition UE y ri= r1 This G corresponds to the yellow patch. The agerank for the whole graph G can now be computed by computing ageranks independently for all individual red patches, G r1 to G rk and then computing for the graph, G as defined above. The partitioning scheme presented above is purely based on the fact that the measure of certain nodes is independent of measure of other nodes that are not pointing to it and is dependent only on the previous node pointing to it. Thus this scheme as such is not restricted or specific to agerank computation but can be used for computation of any measures that are based on such first order Markov model. In this work, our contribution is presentation of such a scheme which can be used for improving computation on a single large graph and more importantly, making use of this technique for efficient incremental computation on evolving graphs. A detailed overview of the incremental approach is discussed in our previous work [18]. The most significant contribution of our approach is it does not result in loss of rank scores, as we make no approximations such as grouping a set of vertices as a single vertex. In the following section, we present algorithms to make such partitions. However, we do not claim such partitions to be optimal or the algorithm to be an optimal approach to partition to graph for the most efficient computation. We show the feasibility of our approach by presenting naïve algorithms that could still be used to improvise the computation costs. 4. EFFICIENT COMUTATION TECHNIQUES In this section we present efficient computation methods for lossless parallelization of agerank computation for a static single graph as well as an approach for lossless incremental computation of agerank for an evolving graph. Both these approaches are based on the partitioning scheme presented in the previous section. 4.1 Static Single Graph We present a methodology for parallelizing the agerank computation for a Static Single Graph, using the partitioning technique presented in the previous section. Assume that we have a graph partitioned into a set of red patches surrounded by a yellow patch. agerank scores for vertices in the red patches are computed first. The red patches can be treated as independent subgraphs and their agerank computation is performed in parallel. Next, using the agerank scores for peripheral red colored vertices i.e. red colored vertices with edges crossing over to the yellow patch; we compute the agerank scores for the vertices in the yellow patch. For agerank computation of the red patches and the yellow patch, the user is free to choose any variant (or the naïve version itself) of the agerank algorithm, as the partitioning scheme results in nothing but a reduction of the problem size and is independent of the algorithm used for agerank computation. Thus, we have a lossless parallel agerank computation method for a Static Single Graph based on our partitioning scheme.

4 The user is free to design and use any patch extraction algorithm, provided the patches extracted have the desired properties discussed in section 3. One can perform the patch extraction is observed that the red patches extracted in steps 1-5 contain a small fraction of the vertices in the graph. An ideal situation would be one where we have the entire graph partitioned into equal sized red patches i.e. no yellow patch. This is because we compute agerank for the red patches in parallel and the more the vertices in the red patches; the more will be the gain in efficiency. Even though we may not realize the ideal situation, it is still desirable that we include as much possible of the graph, as red patches. Therefore, the second part of the algorithm expands the size i.e. includes as many vertices as possible, in the above obtained red patches. Step 6 ick a red patch, call it R Step 7 erform a reverse BFS on each of the yellow colored children of the peripheral vertices of this red patch. If during the process of traversal, a red vertex belonging to red patch R is encountered, then all the vertices encountered so far are colored red and included in the red patch R. If a red colored vertex from some other red patch is encountered, then the red patch R cannot be expanded along this path and the reverse BFS is abandoned leaving all vertices as they were. If one encounters a dead end i.e. a yellow vertex with no incoming links then again all vertices encountered so far are colored red and included in the red patch R. Figure 3. Steps 1 to 4 of the approach during the crawling procedure or as a pre-processing step. For purpose of showing that such patches indeed exist and can be extracted, we have implemented a naïve algorithm. The following steps briefly describe the algorithm:- Consider that we have three sets of vertices colored red, yellow and black. Red colored vertices are those which belong to some red patch. Yellow colored vertices belong to the yellow patch. The black colored vertices are the unexplored vertices. Initially, since all the vertices are unexplored, there are no red and yellow colored vertices, all the vertices in the graph are black colored. Step 1 - Randomly pick a black colored vertex. Step 2 erform a reverse BFS on this vertex i.e. explore all the ancestors of this vertex by traversing along the incoming links. Color all the black vertices encountered red. The reverse BFS does not stop until no further traversal is possible or a red vertex is encountered. Step 3 - Label this set of red vertices as a red patch. Select all peripheral vertices i.e. vertices in this red patch that have edges crossing over to any vertex(s) outside this red patch. Step 4 erform a normal BFS on each of the peripheral vertices, coloring all the black vertices encountered as yellow. Each of the BFSs continue till no further traversal is possible or a yellow vertex is encountered. It is not possible for any of the BFSs to encounter a re d vertex. Thus, we color all the descendents of each of the peripheral vertices as yellow. Step 5 We now have a red patch surrounded by a yellow patch region. Repeat the entire procedure from Step 1-4 (Steps 1-4 are illustrated in figure. x), each time extracting a different red patch, until all the black vertices are exhausted. Note that in step 2 it is neither possible to encounter a red vertex from a previously extracted red patch nor a yellow colored vertex. Also in step 4, it is not possible to encounter a red colored vertex from a previously extracted red patch however one may encounter a yellow vertex. It Figure 4. Steps 6 to 7 of the approach Step 8 erform Steps 6 and 7 for each red patch (Steps 6-7 are illustrated in figure y). Step 9 Repeat Steps 6-8 until no change is observed in the size of any of the red patches i.e. no red patch can be expanded any more. Step 10 Return the list of red patches and the yellow colored vertices. These are then used for parallel agerank computation as explained earlier. After patch extraction, the graph can be treated for dangling vertices by adding a self-loop to each dangling vertex or one may also choose to delete these vertices. The first option will have no effect on the patches extracted. The second option may results in certain vertices being deleted; however, it will not result in a situation where the properties of the patches extracted are violated. One may also choose to perform these operations before the patch extraction process. Adding edges from a dangling vertex to every other vertex has an adverse effect on our partitioning scheme. In such a case, every vertex will have an incoming link from a dangling vertex. Therefore, every red patch will have to contain these dangling vertices. Since all red patches are disjoint, it is not possible to have more than one red patch. Therefore, in spite of its popularity this technique, for handling dangling vertices, will not work with our partitioning scheme. A good partitioning scheme would be where we get patches of a good profile. By a good profile we mean a set of patches that

5 maximize the total number of vertices included in red patches, minimize the size of the largest red patch and minimize the skew in the red patch sizes. Note that the patch extraction algorithm presented (and implemented by us) is quite naïve and not optimal. For instance we do not take any steps to prevent skew in the red patches sizes. Also note that the profile of the patches (i.e. properties such as number of patches, sizes of patches and skew in patch size) that are extracted from a graph depends on the black colored vertices are picked in step 1. In our case we choose to pick a vertex randomly. This is because we expect a more optimal or intelligent picking schemes to come up with better patch profile when compared to a random one such as ours. In section 5 we show that in spite of the sub-optimal patch extraction algorithm used by us, we still get favorable results. 4.2 Evolving Graphs In this section, we will describe the incremental algorithm to compute agerank. The initial step is to read the graph at a new instance and determine the vertices that have changed. This does not require additional time as it can be computed as we read the new graph. Thus, after reading the graph, we can assume that we are given two sets of vertices one containing the vertices which have changed from a previous time instance and the other containing vertices that have remain unchanged. Hence, the input to the algorithm is the graph G, and the two lists V c and V u.. The outline of the algorithm is shown in Figure 3. We will now briefly describe each step in the algorithm. We start by initializing a list V Q. Recall that, a change in a vertex induces a change in the agerank distribution of all its children and all such changed vertices are available to us in the queue V c. A simple traversal methos is used to extend this list of changed vertices, such that it also includes all descendents of the initial list of changed vertices. All of these vertices are pushed into the list Q2. For the remaining vertices are there is no change in their agerank distribution. The New agerank is simply obtained by scaling the previous agerank scores. IR(G,V u,v c ) :- Step 1 Initialize the list V Q Step 2 op a Vertex N from V c 2.1 For all the children of N if children of N list V u remove them from V u push them in V c 2.2. ush N in V Q and repeat step 2 till queue V c is empty Step 3 For each element in list V u 3.1 Take the element and scale the previous pagerank value to get new pagerank value. 3.2 Look up whether any of the children, of the element of V u belong element of V u, copy it in V b. to V Q, if so remove this Step 4 Scale Border Nodes in V b for stochastic property erform Original agerank(v U Q V b ) Figure 5. Incremental agerank Algorithm The scaling factor is simply: ( G ) n( G) n = Order of graph at previous time instance/order of the graph at the present time instance. Also note that all those vertices from this set of unchanged vertices that point to a changed vertex, will influence the agerank value of that changed vertex, hence these too must be included in the list V Q as their agerank scores will be required for computing the agerank scores for the changed vertices. We now perform the original agerank computation along with steps taken to ensure stochastic property of transition matrix, on the vertices that are in Q2 and colored violet (i.e. vertices which have changed) to get the new agerank values for these changed vertices. Thus, we end up localizing the changed partition to a certain sub-graph of the web which includes of all vertices whose agerank values are affected by the structural changes in the graph, and then basic agerank algorithm is performed only on this changed sub-graph. The agerank value for the rest of the vertices is simply a matter of scaling the previous values. Step 2 has a cost of E, where, is the number E = E Q E of edges in the partition Q. Now the agerank values for the partition are obtained by scaling the agerank values with respect to ranks in the previous time instance. This step requires a cost of V, where V is number of vertices in partition. Now using these scaled values and the naïve approach agerank for the vertices in partition Q is calculated. This step (including that required to scale the border vertices) requires a cost of ne + E +V b, where n is number of iterations required for agerank values to converge and E is again number of edges in partition Q. Thus, the total cost for incremental agerank can be summed up to be O(2E +V +ne+v b ). 5. EXERIMENTS AND RESULTS In this section we present results for our graph partitioning scheme for a static single graph as well as the incremental agerank computation for an evolving graph. For the graph partitioning scheme we present results on the link graph for website as crawled on July 19th, The graph contains a total of edges and vertices. rior to partitioning, dangling vertices were taken care of by adding self-loops to each of them. The various statistics for patch extraction are presented in table 1. The center column (labeled number ) provides the actual number of vertices/edges and the next column (labeled percentage ) provides the same information as a percentage value w.r.t the total number of vertices/edges in the graph. In spite of using a website link graph, which one expects to be denser than the web graph, and a naïve and suboptimal patch extraction algorithm, we still get a significant portion of the graph (i.e. 63.5%) in the form of red patches with the largest patch containing 16.94% of the total edges. Note that the sizes in number of edges are more important than the ones in number of vertices, as the agerank algorithm of O(E). The actual runtime of the patch extraction algorithm was 37 ms., which would reduce the efficiency to 1.57 times faster. However, the patch extraction is not an optimal implementation and hence an improvised patch extraction could further improve the efficiency. One might also perform patch extraction during the crawling procedure art

6 Table 1. Statistics for atch Extraction Number of Red atches = 326 The parallel agerank computation for a Static Single Graph, were again run on the link graph for as crawled on July 19th, Due to the unavailability of a parallel platform for carrying out our experiments, we simulate the results for a parallel execution by using the maximum of the runtimes of the agerank computations all red patches, in place of the runtime of computing agerank of all red patches in parallel. This is a valid assumption as there is no process intercommunication. Any variant of the agerank algorithm can be used with our partitioning scheme. For our experiments we chose to use the original, naïve agerank algorithm. Table 2 summarizes the runtime results for the parallel agerank computation compared with the same original agerank. We used a convergence threshold of 10-8 and a dampening factor of 0.1 for all agerank computations. To examine the experimental accuracy we computed the L1-norm for the agerank score vectors returned by the two methods, which was found to be 4.4 x The small error could arise due to numerical computing issues and may also depend on the number of iterations and convergence rate when computed as a whole graph versus a computation for convergence as a small subgraph. Table 2 Divide and Conquer Approach versus Naive Approach Operation Number Average Run Time (in ms) Largest Red atch agerank 20 Yellow atch age agerank 17 DC agerank 38 Original agerank 118 ercentage Total Edges in all Red atches % Total Edges in Yellow atch % Vertices in Red atches % Vertices in Yellow atch % Edges in Largest Red atch % Vertices in Largest Red atch % DC approach to R ran 3.1 times faster than original R. From Table 2 we can see that our proposed parallel agerank method ran about 3.1 times faster than the original agerank. Again we point out that this is in spite of using a sub-optimal partitioning algorithm. The results clearly indicate that a combination of any agerank algorithm and our partitioning scheme is a more efficient than that same agerank algorithm by itself. Along with actual runtimes of the parallel and original agerank, Table 2 also shows actual runtimes for agerank computation of the largest red patch as well as the yellow patch. To test our incremental agerank approach on, we performed the experiments on two different web sites- the Computer Science website ( and the Institute of Technology website ( at the University of Minnesota. We performed the experiments at different time intervals to study the change and effect of the incremental computation. For the Computer Science website our analysis was done at a time interval of two days, eight days and ten days. A time interval of two days was used for the Institute of technology web site. In our experiments we also simulated the focused crawling, by not considering the Web pages that have very low agerank into our graph construction and agerank Computation. This was to emulate the real world scenario where not all pages are crawled. We wanted to analyze, how the incremental approach performs when pages with low agerank are not crawled. We used the following approximate measure to compare the computational costs of our method versus the naïve method. Number of Times Faster = Num of Iterations(R)/(1 + (fraction of changed portion)*number of iterations(ir)) The intuition behind the measure was how fast the convergence threshold will be reached computing agerank incrementally versus computing agerank in a naïve method for the whole graph. The convergence threshold that was chosen on our experiments was 1x10-8 The experimental results are presented in Table 3. These results are from actual experiments conducted on the Computer Science and Institute of Technology websites. For the Computer Science website, in the first time interval of eight days, there seemed to be a significant change in the structure of the Website about 60% of the pages had changed their link structure. We found out such a sea change occurred because a whole subgraph that contained the documentation for Matlab help was removed. The incremental approach still however, performed 1.86 as much faster as the naïve agerank. Similarly, for a period of ten days the incremental approach performed around 1.75 times faster. For a period of two days the improvement was 8.65 times faster. These results are for the case of an unfocussed crawl. The results for focused crawl for the CS Website were better. In the first case, when the time interval was eight days, the improvement was 1.9 times and when the time interval was 10 days, the improvement was 1.76 times. For a period of two days the improvement with focused crawling was 9.88 times. Thus, it suggests that focused crawling can also improve the computational costs of the incremental algorithm. The Institute of technology website typically represented a website that doesn t change too often. The change over a period of two days in the Web Structure was none. Since there was no change detected, there was no necessity to compute the agerank for the graph at the new time instance. And by our measure, it was 11 times faster. Since, there was no change in the graph structure, the improvements for the case of focused crawling and unfocussed crawling remain the same

7 Table 3 Comparison of results for Incremental agerank Algorithm versus Naïve agerank Algorithm Computer Science Website Focussed Crawl July19 vs July 27th percentage of change = % L1 -norm : e-05 NumTimes faster= iteration(s) for inc_pagerank 12 iteration(s) for actual pagerank July 27th vs July 29th percentage of change = % L1-norm : e-07 NumTimes faster= iteration(s) for inc_pagerank 13 iteration(s) for actual pagerank July19th vs 29th percentage of change = % L1-norm : e-05 NumTimes faster= iteration(s) for inc_pagerank 12 iteration(s) for actual pagerank Unfocussed Crawl July19 vs July 27th percentage of change = % norm : e-07 NumTimes faster= iteration(s) for inc_pagerank 12 iteration(s) for actual pagerank July 27th vs July 29th percentage of change = % norm : e-07 NumTimes faster= iteration(s) for inc_pagerank 13 iteration(s) for actual pagerank July 19th vs July 29th percentage of change = % norm : e-07 NumTimes faster= iteration(s) for inc_pagerank 12 iteration(s) for actual pagerank Institute of Technology Website Unfocussed/Focussed Crawl July 30th vs Aug 1st percentage of change = 0% norm : e-07 NumTimes faster= 11 0 iteration(s) for inc_pagerank 11 iteration(s) for actual pagerank 6. CONCLUSIONS AND FUTURE DIRECTIONS In this work, we have followed a divide and conquer approach to partition a graph in a scheme that will enable efficient computation for measures and metrics based on first order Markov model. We present a theoretical framework to show how such a partitioning scheme can be used to divide the problem into smaller problems. We extend this approach into another important dimension of evolving graphs and show how such an approach could improve efficiency of computation on large evolving graphs significantly. Our experimental results also show that such an approach even using naive algorithm improves the computation by a significant portion. This approach also leads to other area of interest and research directions to optimize further on computation. As we discussed earlier, the two key issues that need to be considered is an optimal partition and secondly an algorithm to obtain such an

8 optimal partitioning. Another emerging challenge is the class of pages such as wikipedia and blogs, that change very dynamically. While it is still not clear if agerank is the right metric for such changing pages, however it should be noted that certain pages change more dynamically. Since different portions of the Web change at different rates, it poses challenges to able to keep the ageranks updated with optimizing the computation frequency. The study of efficient computation for the changing Web is thus a challenging problem and has a large scope for research to address the various issues. 7. ACKNOWLEDGMENTS This work was supported by Army High erformance Computing Research Center contract number DAAD The content of the work does not necessarily reflect the position or policy of the government and no official endorsement should be inferred. Access to computing facilities was provided by the AHCRC and the Minnesota Supercomputing Institute. 8. REFERENCES [1] L. age, S. Brin, R. Motwani and T. Winograd The agerank Citation Ranking: Bringing Order to the Web Stanford Digital Library Technologies, January [2] J.M. Kleinberg, Authoritative Sources in Hyperlinked Environment, 9 th Annual ACM-SIAM Symposium on Discrete Algorithms, pages , 1998 [3]. Desikan, J. Srivastava, V. Kumar,.-N. Tan, Hyperlink Analysis Techniques & Applications, Army High erformance Computing Center Technical Report, [4] T. Haveliwala, "Topic-Sensitive agerank," In roceedings of 11th International WWW Conference, May [5] D. admanabhan,. Desikan, J. Srivastava and K. Riaz "WICER: A Weighted Inter-Cluster Edge Ranking for Clustered Graphs", The 2005 IEEE/WIC/ACM International Conference on WI 2005 and IAT [6] Taher Haveliwala. "Efficient Computation of agerank," Stanford University Technical Report, September [7] Y. Chiang, M. Goodrich, E. Grove, R. Tamassia, D. Vengroff, and J. Vitter. External-memory graph algorithms. In roc. of the 6th Annual ACM-SIAM Symposium on Discrete Algorithms, January [8] Y. Chen, Q. Gan, and T. Suel. I/O-efficient techniques for computing agerank. In roc. of the 11th International Conf. on Information and Knowledge Management, pages , November [9] A. Arasu, J. Novak, A. Tomkins, and J. Tomlin. agerank computation and the structure of the web: Experiments and algorithms. In oster presentation at the 11th Int. World Wide Web Conference, May [10] S.D. Kamvar, T.H. Haveliwala, Christopher D. Manning, and Gene H. Golub, "Extrapolation Methods for Accelerating agerank Computations." In roceedings of the 12th International WWW Conference, May, [11]. Berkhin, A survey on agerank computing, Internet Mathematics, Internet Mathematics, Vol 2 Issue 1( ) [12] A. N. Langville and C. D. Meyer, Deeper inside agerank, Internet Mathematics, 1 (2003-4), [13] S. Chien, C. Dwork, R. Kumar, D. Sivakumar, D. Simon, Link evolution: Analysis and algorithms. First Workshop on Algorithms and Models for the Web-graph [14] Google, [15] S. D. Kamvar, T H. Haveliwala, C D. Manning, and G H. Golub, "Exploiting the Block Structure of the Web for Computing agerank." reprint, March, 2003 [16] G. Jeh and J. Widom. Scaling personalized web search. In 12th Int. World Wide Web Conference, [17] N. Eiron, K. McCurley, J. Tomlin, Ranking the Web frontier., In: roc. 13th conference on World Wide Web, ACM ress (2004) [18]. Desikan,N athak, J. Srivastava and V. Kumar "Incremental agerank Computation on evolving graphs", In: roc. 14th International World Wide Web Conference on May 10-14, 2005.

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node A Modified Algorithm to Handle Dangling Pages using Hypothetical Node Shipra Srivastava Student Department of Computer Science & Engineering Thapar University, Patiala, 147001 (India) Rinkle Rani Aggrawal

More information

CS6200 Information Retreival. The WebGraph. July 13, 2015

CS6200 Information Retreival. The WebGraph. July 13, 2015 CS6200 Information Retreival The WebGraph The WebGraph July 13, 2015 1 Web Graph: pages and links The WebGraph describes the directed links between pages of the World Wide Web. A directed edge connects

More information

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION International Journal of Computer Engineering and Applications, Volume IX, Issue VIII, Sep. 15 www.ijcea.com ISSN 2321-3469 COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

More information

PageRank and related algorithms

PageRank and related algorithms PageRank and related algorithms PageRank and HITS Jacob Kogan Department of Mathematics and Statistics University of Maryland, Baltimore County Baltimore, Maryland 21250 kogan@umbc.edu May 15, 2006 Basic

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Personalizing PageRank Based on Domain Profiles

Personalizing PageRank Based on Domain Profiles Personalizing PageRank Based on Domain Profiles Mehmet S. Aktas, Mehmet A. Nacar, and Filippo Menczer Computer Science Department Indiana University Bloomington, IN 47405 USA {maktas,mnacar,fil}@indiana.edu

More information

A P2P-based Incremental Web Ranking Algorithm

A P2P-based Incremental Web Ranking Algorithm A P2P-based Incremental Web Ranking Algorithm Sumalee Sangamuang Pruet Boonma Juggapong Natwichai Computer Engineering Department Faculty of Engineering, Chiang Mai University, Thailand sangamuang.s@gmail.com,

More information

Information Retrieval. Lecture 11 - Link analysis

Information Retrieval. Lecture 11 - Link analysis Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks

More information

Link Analysis. Hongning Wang

Link Analysis. Hongning Wang Link Analysis Hongning Wang CS@UVa Structured v.s. unstructured data Our claim before IR v.s. DB = unstructured data v.s. structured data As a result, we have assumed Document = a sequence of words Query

More information

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Einführung in Web und Data Science Community Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Today s lecture Anchor text Link analysis for ranking Pagerank and variants

More information

An Improved Computation of the PageRank Algorithm 1

An Improved Computation of the PageRank Algorithm 1 An Improved Computation of the PageRank Algorithm Sung Jin Kim, Sang Ho Lee School of Computing, Soongsil University, Korea ace@nowuri.net, shlee@computing.ssu.ac.kr http://orion.soongsil.ac.kr/ Abstract.

More information

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page Link Analysis Links Web consists of web pages and hyperlinks between pages A page receiving many links from other pages may be a hint of the authority of the page Links are also popular in some other information

More information

Searching the Web What is this Page Known for? Luis De Alba

Searching the Web What is this Page Known for? Luis De Alba Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse

More information

Local Methods for Estimating PageRank Values

Local Methods for Estimating PageRank Values Local Methods for Estimating PageRank Values Yen-Yu Chen Qingqing Gan Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 yenyu, qq gan, suel @photon.poly.edu Abstract The Google search

More information

Searching the Web [Arasu 01]

Searching the Web [Arasu 01] Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web

More information

Adaptive methods for the computation of PageRank

Adaptive methods for the computation of PageRank Linear Algebra and its Applications 386 (24) 51 65 www.elsevier.com/locate/laa Adaptive methods for the computation of PageRank Sepandar Kamvar a,, Taher Haveliwala b,genegolub a a Scientific omputing

More information

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW ISSN: 9 694 (ONLINE) ICTACT JOURNAL ON COMMUNICATION TECHNOLOGY, MARCH, VOL:, ISSUE: WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW V Lakshmi Praba and T Vasantha Department of Computer

More information

COMP5331: Knowledge Discovery and Data Mining

COMP5331: Knowledge Discovery and Data Mining COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd, Jon M. Kleinberg 1 1 PageRank

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

Lecture 8: Linkage algorithms and web search

Lecture 8: Linkage algorithms and web search Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017

More information

Mining Temporally Evolving Graphs

Mining Temporally Evolving Graphs Mining Temporally Evolving Graphs Prasanna Desikan and Jaideep Srivastava Department of Computer Science University of Minnesota, Minneapolis, MN 55414, U.S.A {desikan,srivastava}@cs.umn.edu Abstract Web

More information

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a !"#$ %#& ' Introduction ' Social network analysis ' Co-citation and bibliographic coupling ' PageRank ' HIS ' Summary ()*+,-/*,) Early search engines mainly compare content similarity of the query and

More information

Recent Researches on Web Page Ranking

Recent Researches on Web Page Ranking Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India Importance of Web Page Internet Surfers generally do not bother to go through

More information

An Application of Personalized PageRank Vectors: Personalized Search Engine

An Application of Personalized PageRank Vectors: Personalized Search Engine An Application of Personalized PageRank Vectors: Personalized Search Engine Mehmet S. Aktas 1,2, Mehmet A. Nacar 1,2, and Filippo Menczer 1,3 1 Indiana University, Computer Science Department Lindley Hall

More information

COMP 4601 Hubs and Authorities

COMP 4601 Hubs and Authorities COMP 4601 Hubs and Authorities 1 Motivation PageRank gives a way to compute the value of a page given its position and connectivity w.r.t. the rest of the Web. Is it the only algorithm: No! It s just one

More information

Link Structure Analysis

Link Structure Analysis Link Structure Analysis Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!) Link Analysis In the Lecture HITS: topic-specific algorithm Assigns each page two scores a hub score

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank

More information

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank Page rank computation HPC course project a.y. 2012-13 Compute efficient and scalable Pagerank 1 PageRank PageRank is a link analysis algorithm, named after Brin & Page [1], and used by the Google Internet

More information

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld Link Analysis CSE 454 Advanced Internet Systems University of Washington 1/26/12 16:36 1 Ranking Search Results TF / IDF or BM25 Tag Information Title, headers Font Size / Capitalization Anchor Text on

More information

An Enhanced Page Ranking Algorithm Based on Weights and Third level Ranking of the Webpages

An Enhanced Page Ranking Algorithm Based on Weights and Third level Ranking of the Webpages An Enhanced Page Ranking Algorithm Based on eights and Third level Ranking of the ebpages Prahlad Kumar Sharma* 1, Sanjay Tiwari #2 M.Tech Scholar, Department of C.S.E, A.I.E.T Jaipur Raj.(India) Asst.

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material. Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material. 1 Contents Introduction Network properties Social network analysis Co-citation

More information

Popularity of Twitter Accounts: PageRank on a Social Network

Popularity of Twitter Accounts: PageRank on a Social Network Popularity of Twitter Accounts: PageRank on a Social Network A.D-A December 8, 2017 1 Problem Statement Twitter is a social networking service, where users can create and interact with 140 character messages,

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Bruno Martins. 1 st Semester 2012/2013

Bruno Martins. 1 st Semester 2012/2013 Link Analysis Departamento de Engenharia Informática Instituto Superior Técnico 1 st Semester 2012/2013 Slides baseados nos slides oficiais do livro Mining the Web c Soumen Chakrabarti. Outline 1 2 3 4

More information

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) ' Sta306b May 11, 2012 $ PageRank: 1 Web search before Google (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.) & % Sta306b May 11, 2012 PageRank: 2 Web search

More information

Using Hyperlink Features to Personalize Web Search

Using Hyperlink Features to Personalize Web Search Using Hyperlink Features to Personalize Web Search Mehmet S. Aktas, Mehmet A. Nacar, and Filippo Menczer Computer Science Department School of Informatics Indiana University Bloomington, IN 47405 USA {maktas,mnacar,fil}@indiana.edu

More information

CS224W Final Report Emergence of Global Status Hierarchy in Social Networks

CS224W Final Report Emergence of Global Status Hierarchy in Social Networks CS224W Final Report Emergence of Global Status Hierarchy in Social Networks Group 0: Yue Chen, Jia Ji, Yizheng Liao December 0, 202 Introduction Social network analysis provides insights into a wide range

More information

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge Centralities (4) By: Ralucca Gera, NPS Excellence Through Knowledge Some slide from last week that we didn t talk about in class: 2 PageRank algorithm Eigenvector centrality: i s Rank score is the sum

More information

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page International Journal of Soft Computing and Engineering (IJSCE) ISSN: 31-307, Volume-, Issue-3, July 01 Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page Neelam Tyagi, Simple

More information

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5) INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5) Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS)

More information

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group Information Retrieval Lecture 4: Web Search Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group sht25@cl.cam.ac.uk (Lecture Notes after Stephen Clark)

More information

Advanced Computer Architecture: A Google Search Engine

Advanced Computer Architecture: A Google Search Engine Advanced Computer Architecture: A Google Search Engine Jeremy Bradley Room 372. Office hour - Thursdays at 3pm. Email: jb@doc.ic.ac.uk Course notes: http://www.doc.ic.ac.uk/ jb/ Department of Computing,

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies Large-Scale Networks PageRank Dr Vincent Gramoli Lecturer School of Information Technologies Introduction Last week we talked about: - Hubs whose scores depend on the authority of the nodes they point

More information

Brief (non-technical) history

Brief (non-technical) history Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank. Hans De Sterck

Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank. Hans De Sterck Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank Hans De Sterck Department of Applied Mathematics University of Waterloo, Ontario, Canada joint work with Steve McCormick,

More information

Lecture 17 November 7

Lecture 17 November 7 CS 559: Algorithmic Aspects of Computer Networks Fall 2007 Lecture 17 November 7 Lecturer: John Byers BOSTON UNIVERSITY Scribe: Flavio Esposito In this lecture, the last part of the PageRank paper has

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

Combinatorial Algorithms for Web Search Engines - Three Success Stories

Combinatorial Algorithms for Web Search Engines - Three Success Stories Combinatorial Algorithms for Web Search Engines - Three Success Stories Monika Henzinger Abstract How much can smart combinatorial algorithms improve web search engines? To address this question we will

More information

Mathematical Analysis of Google PageRank

Mathematical Analysis of Google PageRank INRIA Sophia Antipolis, France Ranking Answers to User Query Ranking Answers to User Query How a search engine should sort the retrieved answers? Possible solutions: (a) use the frequency of the searched

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu SPAM FARMING 2/11/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 2/11/2013 Jure Leskovec, Stanford

More information

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases Roadmap Random Walks in Ranking Query in Vagelis Hristidis Roadmap Ranking Web Pages Rank according to Relevance of page to query Quality of page Roadmap PageRank Stanford project Lawrence Page, Sergey

More information

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Pagerank Scoring. Imagine a browser doing a random walk on web pages: Ranking Sec. 21.2 Pagerank Scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably

More information

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE Bohar Singh 1, Gursewak Singh 2 1, 2 Computer Science and Application, Govt College Sri Muktsar sahib Abstract The World Wide Web is a popular

More information

Finding Neighbor Communities in the Web using Inter-Site Graph

Finding Neighbor Communities in the Web using Inter-Site Graph Finding Neighbor Communities in the Web using Inter-Site Graph Yasuhito Asano 1, Hiroshi Imai 2, Masashi Toyoda 3, and Masaru Kitsuregawa 3 1 Graduate School of Information Sciences, Tohoku University

More information

CS 6604: Data Mining Large Networks and Time-Series

CS 6604: Data Mining Large Networks and Time-Series CS 6604: Data Mining Large Networks and Time-Series Soumya Vundekode Lecture #12: Centrality Metrics Prof. B Aditya Prakash Agenda Link Analysis and Web Search Searching the Web: The Problem of Ranking

More information

COMP Page Rank

COMP Page Rank COMP 4601 Page Rank 1 Motivation Remember, we were interested in giving back the most relevant documents to a user. Importance is measured by reference as well as content. Think of this like academic paper

More information

Similarity Ranking in Large- Scale Bipartite Graphs

Similarity Ranking in Large- Scale Bipartite Graphs Similarity Ranking in Large- Scale Bipartite Graphs Alessandro Epasto Brown University - 20 th March 2014 1 Joint work with J. Feldman, S. Lattanzi, S. Leonardi, V. Mirrokni [WWW, 2014] 2 AdWords Ads Ads

More information

c 2006 Society for Industrial and Applied Mathematics

c 2006 Society for Industrial and Applied Mathematics SIAM J. SCI. COMPUT. Vol. 27, No. 6, pp. 2112 212 c 26 Society for Industrial and Applied Mathematics A REORDERING FOR THE PAGERANK PROBLEM AMY N. LANGVILLE AND CARL D. MEYER Abstract. We describe a reordering

More information

A project report submitted to Indiana University

A project report submitted to Indiana University Sequential Page Rank Algorithm Indiana University, Bloomington Fall-2012 A project report submitted to Indiana University By Shubhada Karavinkoppa and Jayesh Kawli Under supervision of Prof. Judy Qiu 1

More information

A brief history of Google

A brief history of Google the math behind Sat 25 March 2006 A brief history of Google 1995-7 The Stanford days (aka Backrub(!?)) 1998 Yahoo! wouldn't buy (but they might invest...) 1999 Finally out of beta! Sergey Brin Larry Page

More information

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Reading Time: A Method for Improving the Ranking Scores of Web Pages

Reading Time: A Method for Improving the Ranking Scores of Web Pages Reading Time: A Method for Improving the Ranking Scores of Web Pages Shweta Agarwal Asst. Prof., CS&IT Deptt. MIT, Moradabad, U.P. India Bharat Bhushan Agarwal Asst. Prof., CS&IT Deptt. IFTM, Moradabad,

More information

Information Networks: PageRank

Information Networks: PageRank Information Networks: PageRank Web Science (VU) (706.716) Elisabeth Lex ISDS, TU Graz June 18, 2018 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 1 / 38 Repetition Information Networks Shape of the

More information

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,

More information

PageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking

PageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 19, Issue 1, Ver. III (Jan.-Feb. 2017), PP 01-07 www.iosrjournals.org PageRank Algorithm Albi Dode 1, Silvester

More information

Social Network Analysis

Social Network Analysis Social Network Analysis Giri Iyengar Cornell University gi43@cornell.edu March 14, 2018 Giri Iyengar (Cornell Tech) Social Network Analysis March 14, 2018 1 / 24 Overview 1 Social Networks 2 HITS 3 Page

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Slides based on those in:

Slides based on those in: Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6 2 y 0.8 ½+0.2 ⅓ M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N]

More information

A Reordering for the PageRank problem

A Reordering for the PageRank problem A Reordering for the PageRank problem Amy N. Langville and Carl D. Meyer March 24 Abstract We describe a reordering particularly suited to the PageRank problem, which reduces the computation of the PageRank

More information

Comparative Study of Web Structure Mining Techniques for Links and Image Search

Comparative Study of Web Structure Mining Techniques for Links and Image Search Comparative Study of Web Structure Mining Techniques for Links and Image Search Rashmi Sharma 1, Kamaljit Kaur 2 1 Student of M.Tech in computer Science and Engineering, Sri Guru Granth Sahib World University,

More information

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page Agenda Math 104 1 Google PageRank algorithm 2 Developing a formula for ranking web pages 3 Interpretation 4 Computing the score of each page Google: background Mid nineties: many search engines often times

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 21: Link Analysis Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-06-18 1/80 Overview

More information

Reducing Directed Max Flow to Undirected Max Flow and Bipartite Matching

Reducing Directed Max Flow to Undirected Max Flow and Bipartite Matching Reducing Directed Max Flow to Undirected Max Flow and Bipartite Matching Henry Lin Division of Computer Science University of California, Berkeley Berkeley, CA 94720 Email: henrylin@eecs.berkeley.edu Abstract

More information

A project report submitted to Indiana University

A project report submitted to Indiana University Page Rank Algorithm Using MPI Indiana University, Bloomington Fall-2012 A project report submitted to Indiana University By Shubhada Karavinkoppa and Jayesh Kawli Under supervision of Prof. Judy Qiu 1

More information

How to organize the Web?

How to organize the Web? How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval attempts to find relevant docs in a small and trusted set Newspaper

More information

Web Structure, Age and Page Quality. Computer Science Department, University of Chile. Blanco Encalada 2120, Santiago, Chile.

Web Structure, Age and Page Quality. Computer Science Department, University of Chile. Blanco Encalada 2120, Santiago, Chile. Web Structure, Age and Page Quality Ricardo Baeza-Yates Felipe Saint-Jean Carlos Castillo Computer Science Department, University of Chile Blanco Encalada 2120, Santiago, Chile E-mail: frbaeza,fsaint,ccastillg@dcc.uchile.cl

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Unit VIII. Chapter 9. Link Analysis

Unit VIII. Chapter 9. Link Analysis Unit VIII Link Analysis: Page Ranking in web search engines, Efficient Computation of Page Rank using Map-Reduce and other approaches, Topic-Sensitive Page Rank, Link Spam, Hubs and Authorities (Text Book:2

More information

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Alessandro Epasto J. Feldman*, S. Lattanzi*, S. Leonardi, V. Mirrokni*. *Google Research Sapienza U. Rome Motivation Recommendation

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis Content Anchor text Link analysis for ranking Pagerank and variants HITS The Web as a Directed Graph Page A Anchor

More information

1. Introduction. 2. Motivation and Problem Definition. Volume 8 Issue 2, February Susmita Mohapatra

1. Introduction. 2. Motivation and Problem Definition. Volume 8 Issue 2, February Susmita Mohapatra Pattern Recall Analysis of the Hopfield Neural Network with a Genetic Algorithm Susmita Mohapatra Department of Computer Science, Utkal University, India Abstract: This paper is focused on the implementation

More information

PageRank. CS16: Introduction to Data Structures & Algorithms Spring 2018

PageRank. CS16: Introduction to Data Structures & Algorithms Spring 2018 PageRank CS16: Introduction to Data Structures & Algorithms Spring 2018 Outline Background The Internet World Wide Web Search Engines The PageRank Algorithm Basic PageRank Full PageRank Spectral Analysis

More information

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues COS 597A: Principles of Database and Information Systems Information Retrieval Traditional database system Large integrated collection of data Uniform access/modifcation mechanisms Model of data organization

More information

2013/2/12 EVOLVING GRAPH. Bahman Bahmani(Stanford) Ravi Kumar(Google) Mohammad Mahdian(Google) Eli Upfal(Brown) Yanzhao Yang

2013/2/12 EVOLVING GRAPH. Bahman Bahmani(Stanford) Ravi Kumar(Google) Mohammad Mahdian(Google) Eli Upfal(Brown) Yanzhao Yang 1 PAGERANK ON AN EVOLVING GRAPH Bahman Bahmani(Stanford) Ravi Kumar(Google) Mohammad Mahdian(Google) Eli Upfal(Brown) Present by Yanzhao Yang 1 Evolving Graph(Web Graph) 2 The directed links between web

More information

Link Analysis in Web Information Retrieval

Link Analysis in Web Information Retrieval Link Analysis in Web Information Retrieval Monika Henzinger Google Incorporated Mountain View, California monika@google.com Abstract The analysis of the hyperlink structure of the web has led to significant

More information

Mathematical Methods and Computational Algorithms for Complex Networks. Benard Abola

Mathematical Methods and Computational Algorithms for Complex Networks. Benard Abola Mathematical Methods and Computational Algorithms for Complex Networks Benard Abola Division of Applied Mathematics, Mälardalen University Department of Mathematics, Makerere University Second Network

More information

How Google Finds Your Needle in the Web's

How Google Finds Your Needle in the Web's of the content. In fact, Google feels that the value of its service is largely in its ability to provide unbiased results to search queries; Google claims, "the heart of our software is PageRank." As we'll

More information

Path Analysis References: Ch.10, Data Mining Techniques By M.Berry, andg.linoff Dr Ahmed Rafea

Path Analysis References: Ch.10, Data Mining Techniques By M.Berry, andg.linoff  Dr Ahmed Rafea Path Analysis References: Ch.10, Data Mining Techniques By M.Berry, andg.linoff http://www9.org/w9cdrom/68/68.html Dr Ahmed Rafea Outline Introduction Link Analysis Path Analysis Using Markov Chains Applications

More information

Weighted Page Rank Algorithm based on In-Out Weight of Webpages

Weighted Page Rank Algorithm based on In-Out Weight of Webpages Indian Journal of Science and Technology, Vol 8(34), DOI: 10.17485/ijst/2015/v8i34/86120, December 2015 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 eighted Page Rank Algorithm based on In-Out eight

More information

Collaborative filtering based on a random walk model on a graph

Collaborative filtering based on a random walk model on a graph Collaborative filtering based on a random walk model on a graph Marco Saerens, Francois Fouss, Alain Pirotte, Luh Yen, Pierre Dupont (UCL) Jean-Michel Renders (Xerox Research Europe) Some recent methods:

More information

I/O-Efficient Techniques for Computing Pagerank

I/O-Efficient Techniques for Computing Pagerank I/O-Efficient Techniques for Computing Pagerank Yen-Yu Chen Qingqing Gan Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 {yenyu, qq gan, suel}@photon.poly.edu Categories and Subject

More information

E-Business s Page Ranking with Ant Colony Algorithm

E-Business s Page Ranking with Ant Colony Algorithm E-Business s Page Ranking with Ant Colony Algorithm Asst. Prof. Chonawat Srisa-an, Ph.D. Faculty of Information Technology, Rangsit University 52/347 Phaholyothin Rd. Lakok Pathumthani, 12000 chonawat@rangsit.rsu.ac.th,

More information

Impact of Search Engines on Page Popularity

Impact of Search Engines on Page Popularity Impact of Search Engines on Page Popularity Junghoo John Cho (cho@cs.ucla.edu) Sourashis Roy (roys@cs.ucla.edu) University of California, Los Angeles Impact of Search Engines on Page Popularity J. Cho,

More information