On the peninsula phenomenon in web graph and its implications on web search

Size: px

Start display at page:

Download "On the peninsula phenomenon in web graph and its implications on web search"

Jesse Griffin
5 years ago
Views:

1 Computer Networks 51 (2007) On the peninsula phenomenon in web graph and its implications on web search Tao Meng *, Hong-Fei Yan Lab of Computer Networks and Distributed System, Department of Computer Science and Technology, Peking University, Beijing , China Received 26 April 2005; received in revised form 22 March 2006; accepted 17 April 2006 Available online 22 May 2006 Responsible Editor: R. Boutaba Abstract Web masters usually place certain web pages such as home pages and index pages in front of others. Under such a design, it is necessary to go through some pages to reach the destination pages, which is similar to the scenario of reaching an inner town of a peninsula through other towns at the edge of the peninsula. In this paper, we try to validate that peninsulas are a universal phenomenon in the World-Wide Web, and clarify how this phenomenon can be used to enhance web search and study web connectivity problems. For this purpose, we model the web as a directed graph, and give a proper definition of peninsulas based on this graph. We also present an efficient algorithm to find web peninsulas. Using data collected from the Chinese web by Tianwang search engine, we perform an experiment on the distribution of sizes of peninsulas and their correlations with PageRank values, outdegrees, or indegrees of the ties with other outside vertices. The results show that the peninsula structure on a web graph can greatly expedite the computation of PageRank values; and it can also significantly affect the link extraction capability and information coverage of web crawlers. Ó 2006 Elsevier B.V. All rights reserved. Keywords: Peninsula; Web graph; Web search; Power law; Pagerank; Link extraction capability; Information coverage 1. Introduction The World-Wide Web has continued its remarkable and seemingly exponential growth since its inception. The rapid increase in the number of hosts, web pages and link relations has put a great * Corresponding author. Tel.: x23. addresses: mengtao@net.pku.edu.cn (T. Meng), yhf@ net.pku.edu.cn (H.-F. Yan). deal of pressure on web information systems such as general search engines and web archives, which encourages us to find good tools and policies for efficiently managing the huge amount of internet information. The basic graph model of the web is that web pages and their hyperlinks can be viewed as nodes and arcs separately, and the web, correspondingly is actually a directed graph. In recent years, many researchers have studied the link structures of web graphs, for example, the shape and distribution of indegrees of web pages. These studies /$ - see front matter Ó 2006 Elsevier B.V. All rights reserved. doi: /j.comnet

2 178 T. Meng, H.-F. Yan / Computer Networks 51 (2007) have already played a key role in the design and implementation of web-based applications, and many useful tools have since been invented such as web spiders following those arcs to download web pages, and the Google PageRank algorithm [13] analyzing relative importance of pages using its intrinsic link structure. This paper discusses the phenomenon known as peninsula in web graphs, which has been observed in our research concerning the Tianwang search engine [22] and Web Infomall system [23]. A peninsula in a web graph is a set of vertices, each of which is only reachable from a common node in the set which is called the tache, just like each geographical byland only being accessible from where it joins the mainland. We believe that the universal existence of peninsulas has two main causes: First, many websites deliberately place the contents behind front pages so that people can only access them through the front pages. For example, lots of websites in BBS style need a session identifier in URLs to visit pages inside; and many sites use a paragraph of javascript to jump from their home pages to the contents inside. And usually it is too hard for spiders to parse out the script-style URLs of the inner contents. Second, web resources are usually organized in a tree-like hierarchical directory, and most web pages outside the directory have only links to its index page instead of those contents themselves inside the directory. In our research on search engines, we studied peninsula phenomenon for the following two reasons: 1. Information coverage: One of the important metrics of a search engine is information coverage. It is important to understand how losing link extraction capability affects information coverage. We are interested in understanding how an unparsable URL will harm information coverage eventually. The peninsula phenomenon is the key to answer this question. If one page is lost and this page happens to be the tache of a peninsula, all other vertices in the peninsula will be lost, too. 2. Pagerank computation: Peninsulas can be viewed as some relatively isolated sets of nodes. Both the PageRank algorithm and the HITS algorithm [9] are based on the link structure of a web graph, and their solutions are all based on the power iteration of the matrix eigenvalues. Based on peninsula concept, web graphs can be grouped into some natural blocks. Therefore, power iteration calculation can be greatly expedited when the scale of a web graph is huge. The rest of this paper is structured as follows. In Section 2, we present a clear definition of peninsulas in web graphs and present an efficient method for finding a peninsula with a given base node. In Section 3, we discuss the experiment results on the distribution of sizes of peninsulas and their relationships with three link-attributes (PageRank values, outdegrees and indegrees) of taches. The experiment is based on 251 million web pages collected by the Tianwang search engine. In Section 4, we explain the application of peninsulas in a web search. In Section 5, we conclude our work and discuss future work Related work Structure of web graphs The structure of web graphs has been deeply analyzed both theoretically and experimentally. Levene et al. [14] presented a stochastic model for the evolution of the web and proved that both indegrees and outdegrees of web pages obey a power law: f i Ci (1+q). Another similar theoretical work was completed in [11], which successfully used a method of rate equation to prove the power law. Several authors have reported experimental results on the power law based on web crawling: Kumar et al. [10] examined a web crawl with about 40 millions pages since 1997, and proved the power law of indegrees and outdegrees. Similar work has also been done by Barabasi and Albert [3]. Our earlier work [16] also proved experimentally that many link-attributes such as PageRank values and indegrees obey the power law. Our experiments reported in this paper will prove that the sizes of peninsulas also obey the power law Web graph shape and PageRank The shape of a web graph was studied in [1], which proposed the famous bow tie structure described in Fig. 1. In this work, the web is mainly divided into five parts: SCC (strongly connected component), IN (consisting of pages that can reach the SCC, but cannot be reached from the SCC), OUT (consisting of pages that are accessible from the SCC, but do not link back to it), TENDRILS (containing pages that can neither reach the SCC nor be reached from the SCC), DISC (disconnected component). Their proportions are 27.7%, 21.3%,

3 T. Meng, H.-F. Yan / Computer Networks 51 (2007) %, 21.5% and 8.2%, respectively. This means that if we follow the links to crawl the web, we can at most achieve only about 70% of all the pages. This structure is also used to enhance the speed of the power iteration in PageRank computation in [2]. Kamvar et al. [20] also presented a method to divide the web graph into host-blocks according to strong local link relations inside websites, which can expedite the iteration as well. This paper introduces a method which outperforms both of the above Information coverage of crawlers In our previous work [15], a model for evaluating the information coverage of search engines was proposed. In the model, we defined quantity coverage for ordinary pages and quality coverage for important pages. Cho and Garcia-Molina [5] explained the relationship between crawling modes and information coverage. He suggested, in a firewall mode where URLs are not transparent to servers, and where there are only a small number of clusters running in parallel, information coverage is not significantly affected by the number of seeds (i.e. the URLs which the crawling starts from). This result is not complete since it only used 40 million out of the total 2 billion static pages [7], and did not directly consider the link extraction capability Our main results Fig. 1. Connectivity of the web. In this paper, we report a large scale experiment on peninsulas based on the web data collected by our crawlers. We achieve the following results: 1. We discover the universal peninsula phenomenon in web graph and give it a proper definition. We propose an efficient algorithm to find the peninsula with a given node as the tache. 2. Based on about 251 million pages and hyperlinks among them, we study the distribution of the sizes of peninsulas. We find out that the distribution of peninsula sizes follows the power law. We also perform experiments to learn about the relationships between the sizes of peninsulas and indegrees, outdegrees, or PageRank values of their taches. 3. For peninsulas of average size, we find that a slight loss in link extraction capability will significantly affect the final information coverage. The higher PageRank values the pages have, the less significant the effect on their coverage will become. 4. The locality of link relations in peninsulas can be used to greatly expedite the PageRank computation. Our analysis shows that we can use peninsulas to divide web graphs into blocks and complete the power iteration with better efficiency than the previous work. 2. Modeling the peninsula structure of web The definition of web peninsula is different from that of the geographical one: First, since a web graph is directed, if a set of pages form a peninsula, this means each page is only reachable from the tache irrespective of its own outreach. Second, we limit the relationship between a peninsula and other outside vertices to the tache, which is a single node, but borders of real peninsulas on the earth are usually a curve Basic definition We model a web graph as a directed graph, and Fig. 2 is a simple example. All the symbols related to peninsulas and web graphs are listed in Table 1. Since each peninsula has a tache, we define it with the tache given beforehand. In Table 1, we denote the peninsula with tache v as P v, and define it as follows: Definition 1. A peninsula associated with a given node v is a set of nodes that v can exclusively reach following hyperlinks. Assuming P is a collection of

4 180 T. Meng, H.-F. Yan / Computer Networks 51 (2007) C A D B E Definition 2. Given a vertex v as the tache, all its peninsulas are P v1,p v2,...,p vn ; we call the peninsula, which has the largest size, the maximal peninsula of v and denote it as P m v. For example, in Fig. 2, maximal peninsulas with the taches A, B, D, F, G, I are: P m A ¼fA; C; F g; P m B ¼fB; E; H; Jg; P m D ¼fDg; P m F ¼fFg; P m G ¼fGg; P m I ¼fIg: F G H 2.2. Some properties of peninsula I Fig. 2. A web graph example. Table 1 Symbols of peninsula and web graph Symbol Definition Symbol Definition G Directed web graph V Set of vertices in G od v Outdegree of node v id v Indegree of node v ol v Set of vertices node v links to il v Set of vertices linking to node v P v Peninsula whose tache is node v vertices, and G is the web graph, if P satisfies all the four conditions as follows: 1. P G; 2. v 2 P; 3. "q 2 G P "p 2 P ^ p 5 v, there is not a path in G from q to p without going through v; 4. "p 2 P ^ p 5 v, there is a path from v to p; we call P a peninsula and v the tache of P. We also define the peninsula of v as the peninsula whose tache is v, and define the peninsula with v as the peninsula of v. Besides, we define the size of a peninsula as the number of nodes in this peninsula. For example, in Fig. 2, peninsulas with taches A and B are: P A ={A,C,F}, {A,C}, or {A}; P B = {B,E,H,J} or{b,e,h},... As shown in the example above, there is probably more than one peninsula with a tache appointed vertex. We define maximal peninsula with a tache appointed as follows: P m v J The maximal peninsula with the tache node v There are several important properties of a peninsula of an appointed node as listed below: Theorem 1. The maximal peninsula of a node is unique. It is easy to prove this proposition. If there are multiple maximal peninsulas for a given node as the tache, the union of them will obviously be a peninsula, which conflicts with the fact that the maximal peninsula has the largest size. Theorem 2. All peninsulas of a node are subsets of its maximal peninsula. Using the same method as above, we can easily prove this proposition, because the maximal peninsula must be a union of other peninsulas. And when peninsula is used in the rest of the paper, we mean maximal peninsula unless otherwise specified. Theorem 3. If one node p links to another node q, and ol p P m q, then P m q P m p. From p, we can reach all items in the maximal peninsula of q through p or nodes in the maximal peninsula of q, because p links to q. So the proposition is true. It is useful to estimate the relationship between peninsulas of nodes with link relations. Due to the universal locality in hyperlinks between web pages [8,4], that is, pages tend to link to others nearby or be in the same website, maximal peninsula of a vertex will be tightly related to peninsulas of the nodes that contain links to it Algorithms to find peninsula There are two key issues in finding the maximal peninsula following links with a tache appointed node: 1. How to deal with back links? This issue is illustrated in Fig. 3. When trying to find the peninsula

5 T. Meng, H.-F. Yan / Computer Networks 51 (2007) A B To solve the first problem, we use a back-trace method: consider all nodes which the tache reaches as a base collection, and remove those nodes which are accessible from the outside. This approach is described in Algorithm 1: Algorithm 1. an algorithm to find the maximal peninsula P m v for a node v. C D E F of A, we find C is already linked by a temporally outer node F. However, the peninsula of A actually contains F. 2. When finding the peninsula of a given node, we usually have to exhaust all the nodes it can reach. This is called an out-component in [11]. Due to the enormous size of the web, the out-component is usually very large. The issue is that a fast algorithm is required. As shown in Fig. 4, wewill have to consider all the nodes in finding the peninsula of X. F C I A I G Fig. 3. There is a back-link from F to C. X G D Fig. 4. From X, each node is reachable. B J J H H E 1. Initialize S and S 0 as an empty set; 2. Make a breadth-first walk from v in G, put all the nodes reached into S as follows: (a) S ={v}; (b) For each node p in S: i. S = S + ol p ii. If the size of S increases, go to (b). 3. S 0 = G S; S = S {v}; 4. Start the main loop as follows: (a) If S is not empty, For each node p in S: (b) If p is linked by some node in S 0 i. S = S {p} and S 0 = S 0 +{p}; ii. Goto (4) 5. P m v ¼ S, and return S. There are several ways to speed up this algorithm. For example, in step (4), to check if p is linked by some node in S 0, we can construct a sub-graph G 0 of G first in step (2) (i.e. G 0 contains all the nodes v arrives at), and only compare indegrees of p in G 0 and G. If they are equal, p is only linked by nodes in S. To solve the second problem, we present an improved algorithm which is based on Algorithm 1 as follows: Algorithm 2. An approximate algorithm to find the maximal peninsula P m v for a node v, given a limitation N. 1. Initialize S and S 0 as an empty set; 2. Make a breadth-first walk starting from v in G, put the front N nodes into S; 3. S 0 = G S; S = S {v}; 4. Start the main loop as follows: (a) If S is not empty, For each node p in S: (b) If p is linked by some node in S 0 i. S = S {p} and S 0 = S 0 +{p}; ii. Go to (4) 5. P m v ¼ S, and return S In Algorithm 2, we set a limit for the initial vertex collection to be considered. There are two main reasons: First, because the main cause of a

6 182 T. Meng, H.-F. Yan / Computer Networks 51 (2007) peninsula is that websites place their content behind home pages or other index-type pages, we can assure that the sizes of peninsulas usually do not exceed the number of pages in the tache site. So we limit the finding area in a website. Second, our main purpose is to prove the universal existence of peninsulas. If we cannot find the maximal peninsula, we will get a non-maximal peninsula as a tradeoff between accuracy and performance. It is actually an estimation of the maximal peninsula, and its size is a lower bound of that of the maximal peninsula. Therefore, using Algorithm 2, we find the maximal peninsula most of the time and nonmaximal ones when trade-off between speed and accuracy is required. The limit for website size should be set as the normal size of large websites. Based on the data given in the next section, we find most hosts in China have less than 200,000 static pages. Therefore, 200,000 is set as the limit for Algorithm 2. In order to evaluate the performance of the two algorithms above, 1533 web pages from 251 millions were selected. The peninsula size of each node was calculated using the algorithms above. The results show that Algorithm 1 has very poor efficiency as we predicted. Algorithm 2 needs s on average to find a peninsula, while Algorithm 1 uses s, which is nearly times that of Algorithm 2. The memory usage of Algorithm 1 is on average 71.4 times that of Algorithm 2. When precision is considered, Algorithm 1 can find the maximal peninsula for each node, while Algorithm 2 can only find 1439 (93.87%) of maximal peninsulas, and 59 (3.85%) non-maximal peninsulas whose sizes are very close to those of the latent maximal peninsulas. The accuracy of Algorithm 2 is about 97.72%, which is tolerable for most of the applications. Even for the remaining 2.28%, Algorithm 2 can still compute a lower bound of the size of latent maximal peninsula. Therefore, Algorithm 2 is used to find peninsulas hereafter. 3. Web experiment setup and results We reconstructed the original web graph with the help of crawlers. In this section, we will first introduce the crawlers we use, and analyze the data collection sampled from the Chinese web, then present methodologies of studying peninsulas and at last, the results Tianwang crawlers and experimental data The Tianwang System [22] is the largest noncommercial search engine in China, which has been running since 1997 with over 200,000 daily users. Its basic design and implementation are described in [24]. Currently, its crawler is powered by five Dell PowerEdge 1650 servers, each of which has two Intel Pentium III 1.1 GHz processors and 2 GB memory. It is capable to download about 1 million pages per day in an incremental mode and 12 million pages per day at its peak speed. All pages and other digital contents are stored in Web Infomall [23] as described in [12]. The experiment data is collected as follows: In the December 2002 crawl, we collected about 105 million pages with about 1 million pages as seeds. To cover the isolated information unreachable from hyperlinks, we automatically scanned large amounts of IP addresses as seeds which supplied a web service at port 80. According to our evaluation [15], such a crawling can cover at least 37% of all the information in the Chinese web. In April 2004, we used the 105 million pages as seeds to download pages from the Chinese web, and collected 251, 763,574 static pages whose URLs do not contain question marks. These data form the basis of our experiment. According to the report of CNNIC [6], there were about 312 million static pages by December of 2003, and our data accounted for about 80.5% of all pages in the Chinese web back then. Therefore, this is a fair sample set large enough to represent the World-Wide Web in China. The analysis of the data stored in Web Infomall was performed on a machine which has 4 Intel XEON 1.9 GHz CPUs and 16 GB memory. The statistics show that we have totally stored about 4 billion hyperlinks with a size of 15 gigabytes after those URLs are transformed to integer identifiers. All these link relations are put into the main memory using TMPFS technology (Temporary File System) which supports 64-bit files. We reconstruct the web graph, and compute indegrees, outdegrees, and PageRank values for each node. For any given node now, we can try to find its maximal peninsula using Algorithm Experiments and results We randomly select about 1 million nodes to find their peninsulas using Algorithm 2. The selection of nodes is done by a pseudo-random integer generator

7 T. Meng, H.-F. Yan / Computer Networks 51 (2007) which produces pseudo-random integers between 0 and 251,763,574 as introduced in [19]. The statistic results of the experiment are summarized below: Size distribution of peninsulas As mentioned before, with Algorithm 2, it is very time consuming to find maximal peninsulas for all of the 251,763,574 nodes, so a subset of the samples is used for analysis. The sample selection and finding of their peninsulas took about 10 days. In the end, 997,965 pages are selected and each of their peninsulas is found. The average size of the peninsulas is illustrated in Fig. 5, which shows that the mean of the peninsula sizes converges when the sample size approaches to 1 million. Similar results are also achieved for the variance in the size of the peninsulas. Therefore, we assume such a sample is representative for all the nodes. The size distribution of peninsulas is shown in Fig. 6. From this figure, we can see the distribution obviously obeys a power law. We fit the plot as f ðxþ ¼0:9106 x 1:913 : In the equation, x stands for the peninsula size, and f(x) stands for its proportion. The 95% confidence bounds are: f(x) = * x 1.916, f(x)= * x We also know that the average size of a peninsula is Because of the randomicity of sample selection, this value can prove that the peninsula is a universal phenomenon. In fact, the actual average Mean Peninsula Size Quantity of Peninsula Size of Peninsula Fig. 6. Distribution of sizes of peninsulas. The vertical axes stands for quantity of peninsulas, and the horizontal axis stands for size of peninsulas. peninsula size in the real web will be larger than 15.6, and the real exponent in the fitting function will also be larger than This is because: (1) as stated before, Algorithm 2 sometimes computes only a lower bound of the size of a maximal peninsula; (2) because of the difficulty in parsing URLs tokenized in Javascript or other scripts, the dataset misses them as well as their linking URLs, and some real peninsulas are not in our dataset. So the peninsula size is underestimated. As shown in Table 2, we further divide the peninsulas into five classes with proportions of 91.06%, 5.4%, 2.25%, 0.95%, and 0.34% respectively. This table shows that nearly 9% of all nodes in a web graph link exclusively to some other nodes. In summary, the experiment results prove that: The peninsula phenomenon is universal, and nearly 9% of all pages in the web link exclusively to a page collection. Sizes of the peninsulas obey a power-law distribution, and the exponent in the fitting function is at least K 400K 600K 800K 1M Sample Size Fig. 5. Relation between sample size and average peninsula size. As sample size increases, average peninsula size begins to converge. Table 2 Proportions of peninsulas with different sizes Peninsula size 1 (no) 2 10 (small) (medium) (large) >1000 (huge) Its proportion 91.06% 5.4% 2.25% 0.95% 0.34%

8 184 T. Meng, H.-F. Yan / Computer Networks 51 (2007) The average size of peninsulas is at least This implies that a missing page during crawling can result in 15.6 pages missing on average. Only about 0.34% of all peninsulas are larger than 1000, and the mean size is about 15.6 or more. Their sizes are far below 200,000, the common limit for Algorithm 2. Therefore we can reduce the limit for Algorithm 2 to improve the speed with acceptable precision loss. Table 3 Correlations between three link-attributes and size of peninsula Correlation Pagerank value Indegree Outdegree Peninsula size Correlation between size of peninsula and its three link-attributes It is important to check if there is any relationship between the peninsula of a node and its linkattributes. For example, it is very likely that home pages of websites or directories are the taches of large peninsulas, and it is usual for them to have large PageRank values, so we want to validate their relationship. We obtain three link-attributes for the one million randomly selected nodes, and compute their correlations with the sizes of the peninsulas. The result is shown in Table 3: From Table 3 we can clearly see that: The correlation is close to 0, indicating that there is almost no relationship between PageRank values, indegrees, outdegrees of the taches and their peninsula sizes. Furthermore, the correlation is positive, indicating that the peninsula size increases along with the three link-attributes. The positive relationship is further analyzed by plotting the change of the three linkattributes against peninsula size, which are shown in Fig. 7. We can conclude from Fig. 7 that when the size of a peninsula increases, all the three link-attributes of its tache increase at the same time. However, when the size of a peninsula reaches a certain threshold, for example, about 50 in the middle subplot, the positive correlation begins to disappear. That is, for small peninsulas, their sizes are highly related to the three link-attributes of their taches; but when the size of a peninsula is very large, there is no obvious relationship. Such a threshold is probably caused by the insignificance of a too high Page- Rank value, outdegree or indegree in comparing the relative importance of highly ranked pages Relationship between peninsulas of nodes linking each other To validate the possible tight relationship between the peninsulas of linking nodes, we randomly selected 923,572 link relations from 4 billion and computed peninsulas for all the linking-from and linking-to nodes using Algorithm 2. The results are given in Table 4. From Table 4, we can see that the size of the peninsula of a node is highly related to the sizes of peninsulas of nodes it links to, and the correlation is In addition, there is little correlation between PageRank values, indegrees, or outdegrees of parent nodes and peninsula sizes of child nodes and vice versa. The correlations are negative. In summary, the experiment results prove that: The peninsula sizes of linking nodes are highly related, with a correlation of This result can be understandable by examining the following example: if A is the tache of a large peninsula, Average Outdegree Average Indegree Average PageRank Value Size of Peninsula Size of Peninsula Size of Peninsula Fig. 7. Change of the three link-attributes when peninsula becomes large.

9 T. Meng, H.-F. Yan / Computer Networks 51 (2007) Table 4 Correlations between the size of a peninsula and the three linkattributes of the nodes its tache links to, or between the size of a peninsula and the three link-attributes of nodes that contains links to its tache Correlation Parent peninsula size Correlation Child peninsula size Child peninsula size and B links to A, then B probably also has a large peninsula. Due to the locality of hyperlinks, pages linked by B or linking to A are probably in the peninsula of A. This implies that we can expand a peninsula by trying to add the nodes linking to its tache. The correlations between peninsula size and PageRank value, indegree and outdegree of a node containing links to its tache are all negative. As high PageRank values or degrees mean fame or importance of pages, we can conclude that the more famous or important a page is, the smaller the peninsulas of its linking-to or linking-from pages might be. This implies that pages linking to a famous page tend to have small peninsulas. 4. Implications on web search Child PageRank value Child indegree Child outdegree Parent peninsula size Parent PageRank value Parent indegree Parent outdegree If A links to B, we call A a parent node, and B a child node. The universal existence of peninsulas has two direct applications: first, we can study the loss in information coverage caused by a quantitative disability in link extraction capability; second, the local link structure of a peninsula can expedite the computation of PageRank values Implications on web crawling and information coverage If every parsed URL is accessed, the final information coverage of web spiders is only determined by its link extraction capability. Due to the difficulty in exactly parsing URLs from a paragraph of scripts, most spiders give up on them. Additionally, as link parsers such as DUE in Compaq Mercator [18] are usually overloaded, spiders usually have to discard some URLs at the end of crawling. So if a small portion of pages are not parsed out or downloaded, we should consider how much loss will occur to the final information coverage. The problem is: if 1% of the link extraction capability is lost, how much loss will be resulted in the information coverage for pages of different importance? The importance of pages is difficult to rate exactly. For example, peninsula web pages cannot be accessed directly without going through the front page, which implies that they should not be accorded the same importance as the front page in web searches. We use the PageRank algorithm, a widely-used global method, to compute ranks for pages in our experiment, as an estimation of their relative importance. We divide pages into six categories according to their PageRank values: the top 1% largest, the top 5% largest, the top 10% largest, the top 20% largest, the top 50% largest, and the top 100% largest. To simulate a p% link-extraction capability loss, we randomly select bod * (100 p)%c URLs out for each page with an outdegree of od. We perform an experiment to walk in the reconstructed web graph, and our experimental results of the effect brought by a 2% link extraction capability loss are shown in Fig. 8. In Fig. 8, we use 98% link extraction capability to do a breadth-first walk in the web graph. The horizontal axis is the proportion of pages, which have been crawled to the total collection without any link extraction capability loss (actually 118,581,361). Number of URLs Extracted Out Top 1% PageRank Top 5% PageRank Top 10% PageRank Top 20% PageRank Top 50% PageRank Top 100% PageRank Number of Pages Crawled Fig. 8. Effect on information coverage of link extraction capability loss.

10 186 T. Meng, H.-F. Yan / Computer Networks 51 (2007) The vertical axis is the proportion of the URLs, which have been extracted to all URLs extracted without any link extraction ability loss (also 118,581,361). In the figure, a 2% loss in link extraction results in a 51.7% loss of quantity coverage. To confirm this, we selected five other different seeds and performed the breadth-first walk five times, and they all came up with the similar results. We analyze the result as follows: Indegrees of pages are proved to obey the power law [1]: f i Ci 2:1 : If we fail to parse out a URL due to the 2% disability, we can still extract it out when other pages are downloaded and analyzed later. However, the probability of missing this URL forever (denoted P tache ), caused by the disability is 1 1 P tache > P indegree¼1 ¼ P 2:1 1 64:112%: i¼1 1 i 2:1 This is because: if the indegree of a page is 1, when it is somehow lost in the crawling, it will be lost forever; if the indegree of a page is larger than 1, when it is somehow lost, it could still be extracted out from another page later. Thus the loss of final information coverage is 2% * P tache * 15.6 (at least 20.0%). The remaining large gap compared to 51.7% can be explained in four aspects: 1. The simulation of link extraction capability loss is underestimated. Because most pages have an outdegree not more than 20 [7], a 2% loss has almost the same impact as a 5% loss, and both of them actually miss one URL according to bod * (100 p)%c. So the simulation is not accurate. 2. The average size of peninsulas is underestimated since 15.6 is a lower bound of its real value according to Algorithm When performing a breadth-first walk in a reconstructed web graph, only a portion of all nodes are covered. In Fig. 8, only 118,581,361 (47.1%) of all the 251,763,574 nodes are covered, with a 100% link extraction capability. So the actual peninsula size in the subgraph of 118,581,361 nodes is larger than 15.6, but we underestimate it with the one computed from its superset. 4. We only give a lower bound of the size of P tache, though this is not crucial since it is already 64.1% and not more than 1. We regard the first point as the main factor, so we perform another breadth-first walk with a 5% loss of link extraction capability, and try to decrease the error rate caused by the first factor. This time it reaches 46.2% of all the 118,581,361 nodes, which was close to 48.3% previously. Also, the theoretical estimation of link loss 2% * P tache * 15.6 is at least 50.0% and accordant with the experimental value 53.8% in the main. It is feasible to simulate a small loss in link extraction with a 2% or 5% link extraction capability loss. This experiment proves that a slight loss in link extraction capacity will cause a great loss in final quantity coverage. We also notice that little loss in link extraction capability has less effect on quality coverage for important pages. For example, coverage for those pages whose PageRank values are in the top 1% is still about 80%. Fig. 8 shows that the more important the web pages are, the less effect link extraction capability loss has on their coverage. This is because important pages which have high PageRank values seldom emerge in peninsulas of other nodes since they usually have many incoming links and are seldom exclusively reached by others, so they are less sensitive to link extraction capability. In the breadth-first walk experiment, we have also verified that the peninsula size of a node is almost independent of its PageRank value, which has already been validated by the correlation analysis in Section 3.2. We change the simulation for link extraction capability as follows: to simulate a p% link-extraction ability loss, we randomly select bod * (100 p)%c URLs out for each page with an outdegree of od when the number of crawled pages exceeds 5 million (about 2% of all), otherwise no URL should be dropped. As we know, breadthfirst crawling first yields high-quality pages [17]. So if there was a tight relation between sizes of peninsulas and PageRank values, we would first reach those taches of large peninsulas, with few large peninsulas missed, and link extraction capability loss would have almost no effect on coverage. However, the experimental results contradict with this, as shown in Fig. 9: there is no crucial increase (only from 48.3% to 56.2%) for information coverage with different link extraction capability. In summary, a 5% loss in link extraction capability causes a 53.8% loss for quantity coverage. This implies that a powerful link parser is necessary for web spiders despite its cost. However, for quality coverage aimed at important pages, such a loss has little effect. Our experiment also indirectly illus-

11 T. Meng, H.-F. Yan / Computer Networks 51 (2007) Number of URLs Extracted out Number of Pages Crawled trates the independence between peninsula size and PageRank value of the tache Implications on computation of PageRank The Google PageRank algorithm is based on link relations between pages, and is now widely used in search engines. If a page Q is linked by other nodes: P 1,P 2,...,P m, according to this algorithm, Page- Rank value of Q (denoted PR(Q)) is usually written as PRðQÞ ¼k Xm PRðP i Þ=od Pi þð1 kþ: ð1þ i¼1 If PageRank values of all N pages P 1,P 2,...,P N are regarded as a vector PR, and a matrix M is defined as follows: M i;j ¼ 1=od i ; where od i is the outdegree of P i : We can transform Eq. (1) as follows, where I is the vector of all 1s: PR ¼ km PR þð1 kþi: ð2þ The computation of PR is usually completed through power iteration as follows: PR iþ1 ¼ km PR i þð1 kþi: Top 1% PageRank Top 5% PageRank Top 10% PageRank Top 20% PageRank Top 50% PageRank Top 100% PageRank Fig. 9. Effect on information coverage of a different link extraction capability loss. ð3þ Or, if we scale PR so that I T PR = N, then we can get: PR iþ1 ¼ km þð1 kþ I I T PR i : ð4þ N In fact, now PR is an eigenvector of the matrix km þð1 kþ IIT. N According to Eq. (3) or (4), PR will eventually converge after enough iterations. However, the scale of the web graph is too huge to fit in memory. So M has to be divided into blocks for calculation as described in [2], where the Depth-First Search [21] is used to find irreducible diagonal blocks from the graph structure. In this way, SCC of Fig. 1 is just the largest block. Kamvar [20] gives a better solution called BlockRank algorithm, by dividing the web graph into host-blocks according to the strong local link relations inside a single website. This method computes a local PageRank value for each page first, and then uses the vector of local PageRank values as an initial vector in the iteration for global PageRank values. Because the initial vector is proved to be close to the global one, it can help to decrease the iterations required for convergence. Our definition of peninsula limits link relations to within a small local area, so we can also use it to divide the web graph and decrease the scale of M. It cannot only find the blocks inside a single website, but also those including pages from many different websites. We first give a theorem as follows: Theorem 4. PageRank value of each node in the peninsula is determined by the PageRank value of the tache. Assuming the peninsula is P v ={v,n 1,N 2,..., N A }, according to the definition, there are no incoming outer links to each N i, and it is reachable only from other nodes in P v, we can formulate their PageRank values as follows: PRðN i Þ¼a i PRðvÞþa 1i PRðN 1 Þþa 2i PRðN 2 Þ þþa Ai PRðN A Þþb i ; i ¼ 1; 2;...; A: ð5þ If there is a link from N i to N j, a i,j = k/od i ; otherwise a i,j =0. If PR(v) is known, we now havea linear equations for the A variables: PR(N 1 ), PR(N 2 ),..., PR(N A ), each of which can be formulated by a function of PR(v). Now, we have already proved Theorem 4. According to Theorem 4, we can perform a computation of PageRank values as follows: Algorithm 3. An improved algorithm to complete computation of PageRank values

12 188 T. Meng, H.-F. Yan / Computer Networks 51 (2007) Divide the collection of vertices in web graph V into S subsets, which are disjoint, and a subset denotes a peninsula. 2. For each of the S subsets, represent all PageRank values by that of the tache according to Eq. (5). 3. Perform the power iteration according to Eq. (4), and replace PageRank values of all non-tache nodes by those of their tache. 4. According to the solution of Eq. (5), compute PageRank values for all non-tache nodes. In Algorithm 3, steps (2) and (4) can be computed locally, which means that they can be concurrently executed on different machines without a global data view of M. Step (1) might be difficult, but if we use a proximity method like Algorithm 2, it will be fast enough (in fact, we can decrease the limitation more for better speed as mentioned before). The most difficult step is step (3). Since average peninsula size is 15.6, the scale of M can be decreased to about 6.41% of its original scale. The detailed implementation of Algorithm 3 is similar to [20], except for two differences: (1) our method of dividing the web graph is based on Algorithm 2; (2) the local PageRank value computed in the block using Algorithm 3 is just the global one, while in [20] the local value should be used to perform iterations again for the global one. For example, in a web graph with four nodes, if three of them construct a peninsula and we want to compute their PageRank values, M can be shrunk to: 2 3 m 1;1 ; m 1;2 ; m 1;3 ; m 1;4 m 2;1 ; m 2;2 ; m 2;3 ; m 2;4 M ¼ m 3;1 ; m 3;2 ; m 3;3 ; m 3;4 5! M m 4;1 ; 0; 0; m 4;4 2 3 m 1;1 ; 0; 0; ð1 þ f 2 þ f 3 Þm 1;4 0; 0; 0; 0 ¼ ; 0; 0; 0 5! M m 4;1 ; 0; 0; m 4;4 ¼ m 1;1; ð1 þ f 2 þ f 3 Þm 1;4 m 4;1 ; m 4;4 In summary, we find an efficient algorithm to split M into many small relatively independent blocks, and this can greatly expedite the power iteration by shrinking the matrix and relax memory requirement. The proposed algorithm can handle some special cases more efficiently than previously reported algorithms. : 5. Conclusion and future works In this paper, we perform a comprehensive study on peninsulas in a web graph, which is a non-obvious but universal phenomenon. We first analyze the reasons of peninsula existence: some site administrators place the content behind home pages so that there is no link from outside; also, web resources are usually organized in a directory tree structure. Then, we present the definition of peninsulas. We also describe some of their characteristics and propose two searching algorithms: one can find the exact maximal peninsula with moderate efficiency; the other is very effective but with compromised tolerable accuracy. We conduct a large scale web experiment using data collected by the Tianwang system. We find that the sizes of peninsulas obey the power law distribution, and have a regression function. We also study the correlations between sizes of peninsulas of nodes and their indegrees, outdegrees, and PageRank values. The conclusion is that: there is little correlation between the size of a peninsula and indegree, outdegree, and PageRank value of its tache, and there is strong relation between the peninsula of a node and those of the nodes that contain links to it. Based on these experiment results, we quantitatively analyze their implications on web search: on one hand, the peninsula structure of a web graph can be used to greatly expedite the computation of PageRank values by helping to divide the web graph into blocks; on the other hand, it has a prominent effect on the information coverage of search engines. A slight loss of link extraction capability can cause a great loss in information coverage. For future work, we plan to prove the existence of peninsula and analyze its characteristics theoretically, from the viewpoint of web generation model as proposed by Levene et al. [14] and Krapivsky and Redner [11]. Acknowledgements We thank Professor Xiaoming Li, Dr. Weihong Wang, Dr. Bo Peng, Zhifeng Yang, Lianen Huang, Jiaji Zhu, Jiajing Li, Bihong Gong, Xiubo Zhao from Peking University, and Qu Li from China University of Geosceinces for their comments. Also thanks to Jing Zhao from Hong Kong University of Science and Technology, Xiaojie Gao from California Institute of Technology, and Hang Su from Vanderbilt University for their proof reads. The

T. Meng, H.-F. Yan / Computer Networks 51 (2007) 177 189 189 work of Tao Meng was supported by NSFC Grant 60435020 and NSFC Grant 60573166.

Proceedings of the 9th WWW Conference, 2000. [2] Arvind Arasu, Jasmine Novak, Andrew S. Tomkins, John A.

13 T. Meng, H.-F. Yan / Computer Networks 51 (2007) work of Tao Meng was supported by NSFC Grant and NSFC Grant References [1] Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, Janet Wiener, Graph structure in the web: experiments and models, in: Proceedings of the 9th WWW Conference, [2] Arvind Arasu, Jasmine Novak, Andrew S. Tomkins, John A. Tomlin, Pagerank computation and the structure of the web: experiments and algorithms, in: Poster Proc. WWW2002, Honolulu [3] A. Barabasi, R. Albert, Wmergence of scaling in random networks, Science 286 (509) (1999). [4] P. Boldi, S. Vigna, WebGraph framework i: compression techniques, in: Proceedings of the Thirteenth International WWW Conference, [5] Junghoo Cho, Hector Garcia-Molina, Parallel crawlers, in: The Proceedings of 11th World Wide Web Conference, Hawaii, USA, May [6] The China Internet Network Information Center, China s Internet Development and Usage Report. Available from: < report2003.pdf>. [7] Cyveillance, Inc., White Papers. Sizing the Internet. Available from: < papers.htm>, July [8] N. Eiron, K.S. McCurley, Locality, hierarchy, and bidirectionality on the web, in: The Workshop on Web Algorithms and Models, [9] J. Kleinberg, S.R. Kumar, P. Raphavan, S. Rajagopalan, A. Tomkins, The web as a graph: measurements, models and methods, in: The Proceedings of the International Conference on Combinatorics and Computing, July 26 28, [10] R. kumar, P. Raghavan, S. Rajagopalan, A. Tomkins. Trawling the web for cyber communities, in: The Proceedings of the 8th WWW Conference, April [11] P.L. Krapivsky, S. Redner, A statistical physics perspective on web growth, Computer Networks 39 (2002) [12] Lianen Huang, Hongfei Yan, Xiaoming Li, Engineering of web Infomall: the Chinese web archive, in: The Proceedings of World Engineers Convention, [13] L. Page, S. Brin, R. Motwani, T. Winogard, The pagerank citation ranking: bring order to the web, Stanford Digital Libraries Working Paper, [14] Mark Levene, Trevor Fenner, George Loizou, Richard Wheeldon, A stochastic model for the evolution of the web, Computer Networks 39 (2002) [15] Tao Meng, Hongfei Yan, Xiaoming Li, An evaluation model on information coverage of search engines, Chinese Journal of Electronics 31 (8) (2003) (in Chinese). [16] Tao Meng, Hongfei Yan, Jimin Wang, and Xiaoming Li, The evolution of link-attributes for pages and its implications on web crawling, in: The Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, pp [17] Marc Najork, Janet L. Wiener, Breadth-first search crawling yields high-quality pages, in: Proceedings of the 10th World Wide Web Conference, May [18] Marc Najork, Allan Heydon, High-Performance Web Crawling, COMPAQ SRC Research Report 173, September 26, [19] Park and Miller, Random number generators: good ones are hard to find, Communications of the ACM, October [20] Sepandar D. Kamvar, Taher H. Haveliwala, Christopher D.Manning, Gene H. Golub, Exploiting the block structure of the web for computing pagerank, in: Proceedings of 12th World Wide Web Conference, [21] R.E. Tarjan, Depth-first search and linear graph algorithms, SIAM Journal of Computing (1972) [22] Tianwang search engine, by the lab of computer networks and distributed system in Peking University since Available from: < [23] Web Infomall, a web archive based on Tianwang search engine, which has stored all the web pages in China since 1997, by the lab of computer networks and distributed system in Peking University. Available from: < infomall.cn/>. [24] Hongfei Yan, Jianyong Wang, Xiaoming Li, Lin Guo, Architectural design and evaluation of an efficient webcrawling system, Journal of System and Software (March) (2002) Tao Meng received his bachelor s degree in computer science from Peking University in He is currently a Ph.D. candidate at Peking University, supervised by Prof. Xiaoming Li. His research interests mainly include search engine and web mining. Meng joined the group of Tianwang Search Engine in 2001, and worked primarily in web crawlers and link analysis from then on. His recent design and implementation in 2005 made Tianwang system capable to download more than one billion web pages and perform link analysis such as pagerank computation for them in a short period. Hong-Fei Yan received his Ph.D. in computer science from Peking University in He is currently an associate professor at Peking University. His research interests involve information retrieval and distributed system. He was ever in charge of Tianwang Search Engine s parallel upgrade and made it become a tens of millions pages search engine from one million pages one. He has also successfully pioneered the deployment of the first large-scale Chinese Web Test collection with 100 GB web pages (CWT100g) and has been continuously organizing annual Workshop on Chinese Web Information Retrieval Evaluation since 2004.

Some Characteristics of Web Data and their Reflection on Our Society: an Empirical Approach *

Some Characteristics of Web Data and their Reflection on Our Society: an Empirical Approach * Li Xiaoming and Zhu Jiaji Institute for Internet Information Studies (i 3 S) Peking University 1. Introduction