On the peninsula phenomenon in web graph and its implications on web search

Size: px
Start display at page:

Download "On the peninsula phenomenon in web graph and its implications on web search"

Transcription

1 Computer Networks 51 (2007) On the peninsula phenomenon in web graph and its implications on web search Tao Meng *, Hong-Fei Yan Lab of Computer Networks and Distributed System, Department of Computer Science and Technology, Peking University, Beijing , China Received 26 April 2005; received in revised form 22 March 2006; accepted 17 April 2006 Available online 22 May 2006 Responsible Editor: R. Boutaba Abstract Web masters usually place certain web pages such as home pages and index pages in front of others. Under such a design, it is necessary to go through some pages to reach the destination pages, which is similar to the scenario of reaching an inner town of a peninsula through other towns at the edge of the peninsula. In this paper, we try to validate that peninsulas are a universal phenomenon in the World-Wide Web, and clarify how this phenomenon can be used to enhance web search and study web connectivity problems. For this purpose, we model the web as a directed graph, and give a proper definition of peninsulas based on this graph. We also present an efficient algorithm to find web peninsulas. Using data collected from the Chinese web by Tianwang search engine, we perform an experiment on the distribution of sizes of peninsulas and their correlations with PageRank values, outdegrees, or indegrees of the ties with other outside vertices. The results show that the peninsula structure on a web graph can greatly expedite the computation of PageRank values; and it can also significantly affect the link extraction capability and information coverage of web crawlers. Ó 2006 Elsevier B.V. All rights reserved. Keywords: Peninsula; Web graph; Web search; Power law; Pagerank; Link extraction capability; Information coverage 1. Introduction The World-Wide Web has continued its remarkable and seemingly exponential growth since its inception. The rapid increase in the number of hosts, web pages and link relations has put a great * Corresponding author. Tel.: x23. addresses: mengtao@net.pku.edu.cn (T. Meng), yhf@ net.pku.edu.cn (H.-F. Yan). deal of pressure on web information systems such as general search engines and web archives, which encourages us to find good tools and policies for efficiently managing the huge amount of internet information. The basic graph model of the web is that web pages and their hyperlinks can be viewed as nodes and arcs separately, and the web, correspondingly is actually a directed graph. In recent years, many researchers have studied the link structures of web graphs, for example, the shape and distribution of indegrees of web pages. These studies /$ - see front matter Ó 2006 Elsevier B.V. All rights reserved. doi: /j.comnet

2 178 T. Meng, H.-F. Yan / Computer Networks 51 (2007) have already played a key role in the design and implementation of web-based applications, and many useful tools have since been invented such as web spiders following those arcs to download web pages, and the Google PageRank algorithm [13] analyzing relative importance of pages using its intrinsic link structure. This paper discusses the phenomenon known as peninsula in web graphs, which has been observed in our research concerning the Tianwang search engine [22] and Web Infomall system [23]. A peninsula in a web graph is a set of vertices, each of which is only reachable from a common node in the set which is called the tache, just like each geographical byland only being accessible from where it joins the mainland. We believe that the universal existence of peninsulas has two main causes: First, many websites deliberately place the contents behind front pages so that people can only access them through the front pages. For example, lots of websites in BBS style need a session identifier in URLs to visit pages inside; and many sites use a paragraph of javascript to jump from their home pages to the contents inside. And usually it is too hard for spiders to parse out the script-style URLs of the inner contents. Second, web resources are usually organized in a tree-like hierarchical directory, and most web pages outside the directory have only links to its index page instead of those contents themselves inside the directory. In our research on search engines, we studied peninsula phenomenon for the following two reasons: 1. Information coverage: One of the important metrics of a search engine is information coverage. It is important to understand how losing link extraction capability affects information coverage. We are interested in understanding how an unparsable URL will harm information coverage eventually. The peninsula phenomenon is the key to answer this question. If one page is lost and this page happens to be the tache of a peninsula, all other vertices in the peninsula will be lost, too. 2. Pagerank computation: Peninsulas can be viewed as some relatively isolated sets of nodes. Both the PageRank algorithm and the HITS algorithm [9] are based on the link structure of a web graph, and their solutions are all based on the power iteration of the matrix eigenvalues. Based on peninsula concept, web graphs can be grouped into some natural blocks. Therefore, power iteration calculation can be greatly expedited when the scale of a web graph is huge. The rest of this paper is structured as follows. In Section 2, we present a clear definition of peninsulas in web graphs and present an efficient method for finding a peninsula with a given base node. In Section 3, we discuss the experiment results on the distribution of sizes of peninsulas and their relationships with three link-attributes (PageRank values, outdegrees and indegrees) of taches. The experiment is based on 251 million web pages collected by the Tianwang search engine. In Section 4, we explain the application of peninsulas in a web search. In Section 5, we conclude our work and discuss future work Related work Structure of web graphs The structure of web graphs has been deeply analyzed both theoretically and experimentally. Levene et al. [14] presented a stochastic model for the evolution of the web and proved that both indegrees and outdegrees of web pages obey a power law: f i Ci (1+q). Another similar theoretical work was completed in [11], which successfully used a method of rate equation to prove the power law. Several authors have reported experimental results on the power law based on web crawling: Kumar et al. [10] examined a web crawl with about 40 millions pages since 1997, and proved the power law of indegrees and outdegrees. Similar work has also been done by Barabasi and Albert [3]. Our earlier work [16] also proved experimentally that many link-attributes such as PageRank values and indegrees obey the power law. Our experiments reported in this paper will prove that the sizes of peninsulas also obey the power law Web graph shape and PageRank The shape of a web graph was studied in [1], which proposed the famous bow tie structure described in Fig. 1. In this work, the web is mainly divided into five parts: SCC (strongly connected component), IN (consisting of pages that can reach the SCC, but cannot be reached from the SCC), OUT (consisting of pages that are accessible from the SCC, but do not link back to it), TENDRILS (containing pages that can neither reach the SCC nor be reached from the SCC), DISC (disconnected component). Their proportions are 27.7%, 21.3%,

3 T. Meng, H.-F. Yan / Computer Networks 51 (2007) %, 21.5% and 8.2%, respectively. This means that if we follow the links to crawl the web, we can at most achieve only about 70% of all the pages. This structure is also used to enhance the speed of the power iteration in PageRank computation in [2]. Kamvar et al. [20] also presented a method to divide the web graph into host-blocks according to strong local link relations inside websites, which can expedite the iteration as well. This paper introduces a method which outperforms both of the above Information coverage of crawlers In our previous work [15], a model for evaluating the information coverage of search engines was proposed. In the model, we defined quantity coverage for ordinary pages and quality coverage for important pages. Cho and Garcia-Molina [5] explained the relationship between crawling modes and information coverage. He suggested, in a firewall mode where URLs are not transparent to servers, and where there are only a small number of clusters running in parallel, information coverage is not significantly affected by the number of seeds (i.e. the URLs which the crawling starts from). This result is not complete since it only used 40 million out of the total 2 billion static pages [7], and did not directly consider the link extraction capability Our main results Fig. 1. Connectivity of the web. In this paper, we report a large scale experiment on peninsulas based on the web data collected by our crawlers. We achieve the following results: 1. We discover the universal peninsula phenomenon in web graph and give it a proper definition. We propose an efficient algorithm to find the peninsula with a given node as the tache. 2. Based on about 251 million pages and hyperlinks among them, we study the distribution of the sizes of peninsulas. We find out that the distribution of peninsula sizes follows the power law. We also perform experiments to learn about the relationships between the sizes of peninsulas and indegrees, outdegrees, or PageRank values of their taches. 3. For peninsulas of average size, we find that a slight loss in link extraction capability will significantly affect the final information coverage. The higher PageRank values the pages have, the less significant the effect on their coverage will become. 4. The locality of link relations in peninsulas can be used to greatly expedite the PageRank computation. Our analysis shows that we can use peninsulas to divide web graphs into blocks and complete the power iteration with better efficiency than the previous work. 2. Modeling the peninsula structure of web The definition of web peninsula is different from that of the geographical one: First, since a web graph is directed, if a set of pages form a peninsula, this means each page is only reachable from the tache irrespective of its own outreach. Second, we limit the relationship between a peninsula and other outside vertices to the tache, which is a single node, but borders of real peninsulas on the earth are usually a curve Basic definition We model a web graph as a directed graph, and Fig. 2 is a simple example. All the symbols related to peninsulas and web graphs are listed in Table 1. Since each peninsula has a tache, we define it with the tache given beforehand. In Table 1, we denote the peninsula with tache v as P v, and define it as follows: Definition 1. A peninsula associated with a given node v is a set of nodes that v can exclusively reach following hyperlinks. Assuming P is a collection of

4 180 T. Meng, H.-F. Yan / Computer Networks 51 (2007) C A D B E Definition 2. Given a vertex v as the tache, all its peninsulas are P v1,p v2,...,p vn ; we call the peninsula, which has the largest size, the maximal peninsula of v and denote it as P m v. For example, in Fig. 2, maximal peninsulas with the taches A, B, D, F, G, I are: P m A ¼fA; C; F g; P m B ¼fB; E; H; Jg; P m D ¼fDg; P m F ¼fFg; P m G ¼fGg; P m I ¼fIg: F G H 2.2. Some properties of peninsula I Fig. 2. A web graph example. Table 1 Symbols of peninsula and web graph Symbol Definition Symbol Definition G Directed web graph V Set of vertices in G od v Outdegree of node v id v Indegree of node v ol v Set of vertices node v links to il v Set of vertices linking to node v P v Peninsula whose tache is node v vertices, and G is the web graph, if P satisfies all the four conditions as follows: 1. P G; 2. v 2 P; 3. "q 2 G P "p 2 P ^ p 5 v, there is not a path in G from q to p without going through v; 4. "p 2 P ^ p 5 v, there is a path from v to p; we call P a peninsula and v the tache of P. We also define the peninsula of v as the peninsula whose tache is v, and define the peninsula with v as the peninsula of v. Besides, we define the size of a peninsula as the number of nodes in this peninsula. For example, in Fig. 2, peninsulas with taches A and B are: P A ={A,C,F}, {A,C}, or {A}; P B = {B,E,H,J} or{b,e,h},... As shown in the example above, there is probably more than one peninsula with a tache appointed vertex. We define maximal peninsula with a tache appointed as follows: P m v J The maximal peninsula with the tache node v There are several important properties of a peninsula of an appointed node as listed below: Theorem 1. The maximal peninsula of a node is unique. It is easy to prove this proposition. If there are multiple maximal peninsulas for a given node as the tache, the union of them will obviously be a peninsula, which conflicts with the fact that the maximal peninsula has the largest size. Theorem 2. All peninsulas of a node are subsets of its maximal peninsula. Using the same method as above, we can easily prove this proposition, because the maximal peninsula must be a union of other peninsulas. And when peninsula is used in the rest of the paper, we mean maximal peninsula unless otherwise specified. Theorem 3. If one node p links to another node q, and ol p P m q, then P m q P m p. From p, we can reach all items in the maximal peninsula of q through p or nodes in the maximal peninsula of q, because p links to q. So the proposition is true. It is useful to estimate the relationship between peninsulas of nodes with link relations. Due to the universal locality in hyperlinks between web pages [8,4], that is, pages tend to link to others nearby or be in the same website, maximal peninsula of a vertex will be tightly related to peninsulas of the nodes that contain links to it Algorithms to find peninsula There are two key issues in finding the maximal peninsula following links with a tache appointed node: 1. How to deal with back links? This issue is illustrated in Fig. 3. When trying to find the peninsula

5 T. Meng, H.-F. Yan / Computer Networks 51 (2007) A B To solve the first problem, we use a back-trace method: consider all nodes which the tache reaches as a base collection, and remove those nodes which are accessible from the outside. This approach is described in Algorithm 1: Algorithm 1. an algorithm to find the maximal peninsula P m v for a node v. C D E F of A, we find C is already linked by a temporally outer node F. However, the peninsula of A actually contains F. 2. When finding the peninsula of a given node, we usually have to exhaust all the nodes it can reach. This is called an out-component in [11]. Due to the enormous size of the web, the out-component is usually very large. The issue is that a fast algorithm is required. As shown in Fig. 4, wewill have to consider all the nodes in finding the peninsula of X. F C I A I G Fig. 3. There is a back-link from F to C. X G D Fig. 4. From X, each node is reachable. B J J H H E 1. Initialize S and S 0 as an empty set; 2. Make a breadth-first walk from v in G, put all the nodes reached into S as follows: (a) S ={v}; (b) For each node p in S: i. S = S + ol p ii. If the size of S increases, go to (b). 3. S 0 = G S; S = S {v}; 4. Start the main loop as follows: (a) If S is not empty, For each node p in S: (b) If p is linked by some node in S 0 i. S = S {p} and S 0 = S 0 +{p}; ii. Goto (4) 5. P m v ¼ S, and return S. There are several ways to speed up this algorithm. For example, in step (4), to check if p is linked by some node in S 0, we can construct a sub-graph G 0 of G first in step (2) (i.e. G 0 contains all the nodes v arrives at), and only compare indegrees of p in G 0 and G. If they are equal, p is only linked by nodes in S. To solve the second problem, we present an improved algorithm which is based on Algorithm 1 as follows: Algorithm 2. An approximate algorithm to find the maximal peninsula P m v for a node v, given a limitation N. 1. Initialize S and S 0 as an empty set; 2. Make a breadth-first walk starting from v in G, put the front N nodes into S; 3. S 0 = G S; S = S {v}; 4. Start the main loop as follows: (a) If S is not empty, For each node p in S: (b) If p is linked by some node in S 0 i. S = S {p} and S 0 = S 0 +{p}; ii. Go to (4) 5. P m v ¼ S, and return S In Algorithm 2, we set a limit for the initial vertex collection to be considered. There are two main reasons: First, because the main cause of a

6 182 T. Meng, H.-F. Yan / Computer Networks 51 (2007) peninsula is that websites place their content behind home pages or other index-type pages, we can assure that the sizes of peninsulas usually do not exceed the number of pages in the tache site. So we limit the finding area in a website. Second, our main purpose is to prove the universal existence of peninsulas. If we cannot find the maximal peninsula, we will get a non-maximal peninsula as a tradeoff between accuracy and performance. It is actually an estimation of the maximal peninsula, and its size is a lower bound of that of the maximal peninsula. Therefore, using Algorithm 2, we find the maximal peninsula most of the time and nonmaximal ones when trade-off between speed and accuracy is required. The limit for website size should be set as the normal size of large websites. Based on the data given in the next section, we find most hosts in China have less than 200,000 static pages. Therefore, 200,000 is set as the limit for Algorithm 2. In order to evaluate the performance of the two algorithms above, 1533 web pages from 251 millions were selected. The peninsula size of each node was calculated using the algorithms above. The results show that Algorithm 1 has very poor efficiency as we predicted. Algorithm 2 needs s on average to find a peninsula, while Algorithm 1 uses s, which is nearly times that of Algorithm 2. The memory usage of Algorithm 1 is on average 71.4 times that of Algorithm 2. When precision is considered, Algorithm 1 can find the maximal peninsula for each node, while Algorithm 2 can only find 1439 (93.87%) of maximal peninsulas, and 59 (3.85%) non-maximal peninsulas whose sizes are very close to those of the latent maximal peninsulas. The accuracy of Algorithm 2 is about 97.72%, which is tolerable for most of the applications. Even for the remaining 2.28%, Algorithm 2 can still compute a lower bound of the size of latent maximal peninsula. Therefore, Algorithm 2 is used to find peninsulas hereafter. 3. Web experiment setup and results We reconstructed the original web graph with the help of crawlers. In this section, we will first introduce the crawlers we use, and analyze the data collection sampled from the Chinese web, then present methodologies of studying peninsulas and at last, the results Tianwang crawlers and experimental data The Tianwang System [22] is the largest noncommercial search engine in China, which has been running since 1997 with over 200,000 daily users. Its basic design and implementation are described in [24]. Currently, its crawler is powered by five Dell PowerEdge 1650 servers, each of which has two Intel Pentium III 1.1 GHz processors and 2 GB memory. It is capable to download about 1 million pages per day in an incremental mode and 12 million pages per day at its peak speed. All pages and other digital contents are stored in Web Infomall [23] as described in [12]. The experiment data is collected as follows: In the December 2002 crawl, we collected about 105 million pages with about 1 million pages as seeds. To cover the isolated information unreachable from hyperlinks, we automatically scanned large amounts of IP addresses as seeds which supplied a web service at port 80. According to our evaluation [15], such a crawling can cover at least 37% of all the information in the Chinese web. In April 2004, we used the 105 million pages as seeds to download pages from the Chinese web, and collected 251, 763,574 static pages whose URLs do not contain question marks. These data form the basis of our experiment. According to the report of CNNIC [6], there were about 312 million static pages by December of 2003, and our data accounted for about 80.5% of all pages in the Chinese web back then. Therefore, this is a fair sample set large enough to represent the World-Wide Web in China. The analysis of the data stored in Web Infomall was performed on a machine which has 4 Intel XEON 1.9 GHz CPUs and 16 GB memory. The statistics show that we have totally stored about 4 billion hyperlinks with a size of 15 gigabytes after those URLs are transformed to integer identifiers. All these link relations are put into the main memory using TMPFS technology (Temporary File System) which supports 64-bit files. We reconstruct the web graph, and compute indegrees, outdegrees, and PageRank values for each node. For any given node now, we can try to find its maximal peninsula using Algorithm Experiments and results We randomly select about 1 million nodes to find their peninsulas using Algorithm 2. The selection of nodes is done by a pseudo-random integer generator

7 T. Meng, H.-F. Yan / Computer Networks 51 (2007) which produces pseudo-random integers between 0 and 251,763,574 as introduced in [19]. The statistic results of the experiment are summarized below: Size distribution of peninsulas As mentioned before, with Algorithm 2, it is very time consuming to find maximal peninsulas for all of the 251,763,574 nodes, so a subset of the samples is used for analysis. The sample selection and finding of their peninsulas took about 10 days. In the end, 997,965 pages are selected and each of their peninsulas is found. The average size of the peninsulas is illustrated in Fig. 5, which shows that the mean of the peninsula sizes converges when the sample size approaches to 1 million. Similar results are also achieved for the variance in the size of the peninsulas. Therefore, we assume such a sample is representative for all the nodes. The size distribution of peninsulas is shown in Fig. 6. From this figure, we can see the distribution obviously obeys a power law. We fit the plot as f ðxþ ¼0:9106 x 1:913 : In the equation, x stands for the peninsula size, and f(x) stands for its proportion. The 95% confidence bounds are: f(x) = * x 1.916, f(x)= * x We also know that the average size of a peninsula is Because of the randomicity of sample selection, this value can prove that the peninsula is a universal phenomenon. In fact, the actual average Mean Peninsula Size Quantity of Peninsula Size of Peninsula Fig. 6. Distribution of sizes of peninsulas. The vertical axes stands for quantity of peninsulas, and the horizontal axis stands for size of peninsulas. peninsula size in the real web will be larger than 15.6, and the real exponent in the fitting function will also be larger than This is because: (1) as stated before, Algorithm 2 sometimes computes only a lower bound of the size of a maximal peninsula; (2) because of the difficulty in parsing URLs tokenized in Javascript or other scripts, the dataset misses them as well as their linking URLs, and some real peninsulas are not in our dataset. So the peninsula size is underestimated. As shown in Table 2, we further divide the peninsulas into five classes with proportions of 91.06%, 5.4%, 2.25%, 0.95%, and 0.34% respectively. This table shows that nearly 9% of all nodes in a web graph link exclusively to some other nodes. In summary, the experiment results prove that: The peninsula phenomenon is universal, and nearly 9% of all pages in the web link exclusively to a page collection. Sizes of the peninsulas obey a power-law distribution, and the exponent in the fitting function is at least K 400K 600K 800K 1M Sample Size Fig. 5. Relation between sample size and average peninsula size. As sample size increases, average peninsula size begins to converge. Table 2 Proportions of peninsulas with different sizes Peninsula size 1 (no) 2 10 (small) (medium) (large) >1000 (huge) Its proportion 91.06% 5.4% 2.25% 0.95% 0.34%

8 184 T. Meng, H.-F. Yan / Computer Networks 51 (2007) The average size of peninsulas is at least This implies that a missing page during crawling can result in 15.6 pages missing on average. Only about 0.34% of all peninsulas are larger than 1000, and the mean size is about 15.6 or more. Their sizes are far below 200,000, the common limit for Algorithm 2. Therefore we can reduce the limit for Algorithm 2 to improve the speed with acceptable precision loss. Table 3 Correlations between three link-attributes and size of peninsula Correlation Pagerank value Indegree Outdegree Peninsula size Correlation between size of peninsula and its three link-attributes It is important to check if there is any relationship between the peninsula of a node and its linkattributes. For example, it is very likely that home pages of websites or directories are the taches of large peninsulas, and it is usual for them to have large PageRank values, so we want to validate their relationship. We obtain three link-attributes for the one million randomly selected nodes, and compute their correlations with the sizes of the peninsulas. The result is shown in Table 3: From Table 3 we can clearly see that: The correlation is close to 0, indicating that there is almost no relationship between PageRank values, indegrees, outdegrees of the taches and their peninsula sizes. Furthermore, the correlation is positive, indicating that the peninsula size increases along with the three link-attributes. The positive relationship is further analyzed by plotting the change of the three linkattributes against peninsula size, which are shown in Fig. 7. We can conclude from Fig. 7 that when the size of a peninsula increases, all the three link-attributes of its tache increase at the same time. However, when the size of a peninsula reaches a certain threshold, for example, about 50 in the middle subplot, the positive correlation begins to disappear. That is, for small peninsulas, their sizes are highly related to the three link-attributes of their taches; but when the size of a peninsula is very large, there is no obvious relationship. Such a threshold is probably caused by the insignificance of a too high Page- Rank value, outdegree or indegree in comparing the relative importance of highly ranked pages Relationship between peninsulas of nodes linking each other To validate the possible tight relationship between the peninsulas of linking nodes, we randomly selected 923,572 link relations from 4 billion and computed peninsulas for all the linking-from and linking-to nodes using Algorithm 2. The results are given in Table 4. From Table 4, we can see that the size of the peninsula of a node is highly related to the sizes of peninsulas of nodes it links to, and the correlation is In addition, there is little correlation between PageRank values, indegrees, or outdegrees of parent nodes and peninsula sizes of child nodes and vice versa. The correlations are negative. In summary, the experiment results prove that: The peninsula sizes of linking nodes are highly related, with a correlation of This result can be understandable by examining the following example: if A is the tache of a large peninsula, Average Outdegree Average Indegree Average PageRank Value Size of Peninsula Size of Peninsula Size of Peninsula Fig. 7. Change of the three link-attributes when peninsula becomes large.

9 T. Meng, H.-F. Yan / Computer Networks 51 (2007) Table 4 Correlations between the size of a peninsula and the three linkattributes of the nodes its tache links to, or between the size of a peninsula and the three link-attributes of nodes that contains links to its tache Correlation Parent peninsula size Correlation Child peninsula size Child peninsula size and B links to A, then B probably also has a large peninsula. Due to the locality of hyperlinks, pages linked by B or linking to A are probably in the peninsula of A. This implies that we can expand a peninsula by trying to add the nodes linking to its tache. The correlations between peninsula size and PageRank value, indegree and outdegree of a node containing links to its tache are all negative. As high PageRank values or degrees mean fame or importance of pages, we can conclude that the more famous or important a page is, the smaller the peninsulas of its linking-to or linking-from pages might be. This implies that pages linking to a famous page tend to have small peninsulas. 4. Implications on web search Child PageRank value Child indegree Child outdegree Parent peninsula size Parent PageRank value Parent indegree Parent outdegree If A links to B, we call A a parent node, and B a child node. The universal existence of peninsulas has two direct applications: first, we can study the loss in information coverage caused by a quantitative disability in link extraction capability; second, the local link structure of a peninsula can expedite the computation of PageRank values Implications on web crawling and information coverage If every parsed URL is accessed, the final information coverage of web spiders is only determined by its link extraction capability. Due to the difficulty in exactly parsing URLs from a paragraph of scripts, most spiders give up on them. Additionally, as link parsers such as DUE in Compaq Mercator [18] are usually overloaded, spiders usually have to discard some URLs at the end of crawling. So if a small portion of pages are not parsed out or downloaded, we should consider how much loss will occur to the final information coverage. The problem is: if 1% of the link extraction capability is lost, how much loss will be resulted in the information coverage for pages of different importance? The importance of pages is difficult to rate exactly. For example, peninsula web pages cannot be accessed directly without going through the front page, which implies that they should not be accorded the same importance as the front page in web searches. We use the PageRank algorithm, a widely-used global method, to compute ranks for pages in our experiment, as an estimation of their relative importance. We divide pages into six categories according to their PageRank values: the top 1% largest, the top 5% largest, the top 10% largest, the top 20% largest, the top 50% largest, and the top 100% largest. To simulate a p% link-extraction capability loss, we randomly select bod * (100 p)%c URLs out for each page with an outdegree of od. We perform an experiment to walk in the reconstructed web graph, and our experimental results of the effect brought by a 2% link extraction capability loss are shown in Fig. 8. In Fig. 8, we use 98% link extraction capability to do a breadth-first walk in the web graph. The horizontal axis is the proportion of pages, which have been crawled to the total collection without any link extraction capability loss (actually 118,581,361). Number of URLs Extracted Out Top 1% PageRank Top 5% PageRank Top 10% PageRank Top 20% PageRank Top 50% PageRank Top 100% PageRank Number of Pages Crawled Fig. 8. Effect on information coverage of link extraction capability loss.

10 186 T. Meng, H.-F. Yan / Computer Networks 51 (2007) The vertical axis is the proportion of the URLs, which have been extracted to all URLs extracted without any link extraction ability loss (also 118,581,361). In the figure, a 2% loss in link extraction results in a 51.7% loss of quantity coverage. To confirm this, we selected five other different seeds and performed the breadth-first walk five times, and they all came up with the similar results. We analyze the result as follows: Indegrees of pages are proved to obey the power law [1]: f i Ci 2:1 : If we fail to parse out a URL due to the 2% disability, we can still extract it out when other pages are downloaded and analyzed later. However, the probability of missing this URL forever (denoted P tache ), caused by the disability is 1 1 P tache > P indegree¼1 ¼ P 2:1 1 64:112%: i¼1 1 i 2:1 This is because: if the indegree of a page is 1, when it is somehow lost in the crawling, it will be lost forever; if the indegree of a page is larger than 1, when it is somehow lost, it could still be extracted out from another page later. Thus the loss of final information coverage is 2% * P tache * 15.6 (at least 20.0%). The remaining large gap compared to 51.7% can be explained in four aspects: 1. The simulation of link extraction capability loss is underestimated. Because most pages have an outdegree not more than 20 [7], a 2% loss has almost the same impact as a 5% loss, and both of them actually miss one URL according to bod * (100 p)%c. So the simulation is not accurate. 2. The average size of peninsulas is underestimated since 15.6 is a lower bound of its real value according to Algorithm When performing a breadth-first walk in a reconstructed web graph, only a portion of all nodes are covered. In Fig. 8, only 118,581,361 (47.1%) of all the 251,763,574 nodes are covered, with a 100% link extraction capability. So the actual peninsula size in the subgraph of 118,581,361 nodes is larger than 15.6, but we underestimate it with the one computed from its superset. 4. We only give a lower bound of the size of P tache, though this is not crucial since it is already 64.1% and not more than 1. We regard the first point as the main factor, so we perform another breadth-first walk with a 5% loss of link extraction capability, and try to decrease the error rate caused by the first factor. This time it reaches 46.2% of all the 118,581,361 nodes, which was close to 48.3% previously. Also, the theoretical estimation of link loss 2% * P tache * 15.6 is at least 50.0% and accordant with the experimental value 53.8% in the main. It is feasible to simulate a small loss in link extraction with a 2% or 5% link extraction capability loss. This experiment proves that a slight loss in link extraction capacity will cause a great loss in final quantity coverage. We also notice that little loss in link extraction capability has less effect on quality coverage for important pages. For example, coverage for those pages whose PageRank values are in the top 1% is still about 80%. Fig. 8 shows that the more important the web pages are, the less effect link extraction capability loss has on their coverage. This is because important pages which have high PageRank values seldom emerge in peninsulas of other nodes since they usually have many incoming links and are seldom exclusively reached by others, so they are less sensitive to link extraction capability. In the breadth-first walk experiment, we have also verified that the peninsula size of a node is almost independent of its PageRank value, which has already been validated by the correlation analysis in Section 3.2. We change the simulation for link extraction capability as follows: to simulate a p% link-extraction ability loss, we randomly select bod * (100 p)%c URLs out for each page with an outdegree of od when the number of crawled pages exceeds 5 million (about 2% of all), otherwise no URL should be dropped. As we know, breadthfirst crawling first yields high-quality pages [17]. So if there was a tight relation between sizes of peninsulas and PageRank values, we would first reach those taches of large peninsulas, with few large peninsulas missed, and link extraction capability loss would have almost no effect on coverage. However, the experimental results contradict with this, as shown in Fig. 9: there is no crucial increase (only from 48.3% to 56.2%) for information coverage with different link extraction capability. In summary, a 5% loss in link extraction capability causes a 53.8% loss for quantity coverage. This implies that a powerful link parser is necessary for web spiders despite its cost. However, for quality coverage aimed at important pages, such a loss has little effect. Our experiment also indirectly illus-

11 T. Meng, H.-F. Yan / Computer Networks 51 (2007) Number of URLs Extracted out Number of Pages Crawled trates the independence between peninsula size and PageRank value of the tache Implications on computation of PageRank The Google PageRank algorithm is based on link relations between pages, and is now widely used in search engines. If a page Q is linked by other nodes: P 1,P 2,...,P m, according to this algorithm, Page- Rank value of Q (denoted PR(Q)) is usually written as PRðQÞ ¼k Xm PRðP i Þ=od Pi þð1 kþ: ð1þ i¼1 If PageRank values of all N pages P 1,P 2,...,P N are regarded as a vector PR, and a matrix M is defined as follows: M i;j ¼ 1=od i ; where od i is the outdegree of P i : We can transform Eq. (1) as follows, where I is the vector of all 1s: PR ¼ km PR þð1 kþi: ð2þ The computation of PR is usually completed through power iteration as follows: PR iþ1 ¼ km PR i þð1 kþi: Top 1% PageRank Top 5% PageRank Top 10% PageRank Top 20% PageRank Top 50% PageRank Top 100% PageRank Fig. 9. Effect on information coverage of a different link extraction capability loss. ð3þ Or, if we scale PR so that I T PR = N, then we can get: PR iþ1 ¼ km þð1 kþ I I T PR i : ð4þ N In fact, now PR is an eigenvector of the matrix km þð1 kþ IIT. N According to Eq. (3) or (4), PR will eventually converge after enough iterations. However, the scale of the web graph is too huge to fit in memory. So M has to be divided into blocks for calculation as described in [2], where the Depth-First Search [21] is used to find irreducible diagonal blocks from the graph structure. In this way, SCC of Fig. 1 is just the largest block. Kamvar [20] gives a better solution called BlockRank algorithm, by dividing the web graph into host-blocks according to the strong local link relations inside a single website. This method computes a local PageRank value for each page first, and then uses the vector of local PageRank values as an initial vector in the iteration for global PageRank values. Because the initial vector is proved to be close to the global one, it can help to decrease the iterations required for convergence. Our definition of peninsula limits link relations to within a small local area, so we can also use it to divide the web graph and decrease the scale of M. It cannot only find the blocks inside a single website, but also those including pages from many different websites. We first give a theorem as follows: Theorem 4. PageRank value of each node in the peninsula is determined by the PageRank value of the tache. Assuming the peninsula is P v ={v,n 1,N 2,..., N A }, according to the definition, there are no incoming outer links to each N i, and it is reachable only from other nodes in P v, we can formulate their PageRank values as follows: PRðN i Þ¼a i PRðvÞþa 1i PRðN 1 Þþa 2i PRðN 2 Þ þþa Ai PRðN A Þþb i ; i ¼ 1; 2;...; A: ð5þ If there is a link from N i to N j, a i,j = k/od i ; otherwise a i,j =0. If PR(v) is known, we now havea linear equations for the A variables: PR(N 1 ), PR(N 2 ),..., PR(N A ), each of which can be formulated by a function of PR(v). Now, we have already proved Theorem 4. According to Theorem 4, we can perform a computation of PageRank values as follows: Algorithm 3. An improved algorithm to complete computation of PageRank values

12 188 T. Meng, H.-F. Yan / Computer Networks 51 (2007) Divide the collection of vertices in web graph V into S subsets, which are disjoint, and a subset denotes a peninsula. 2. For each of the S subsets, represent all PageRank values by that of the tache according to Eq. (5). 3. Perform the power iteration according to Eq. (4), and replace PageRank values of all non-tache nodes by those of their tache. 4. According to the solution of Eq. (5), compute PageRank values for all non-tache nodes. In Algorithm 3, steps (2) and (4) can be computed locally, which means that they can be concurrently executed on different machines without a global data view of M. Step (1) might be difficult, but if we use a proximity method like Algorithm 2, it will be fast enough (in fact, we can decrease the limitation more for better speed as mentioned before). The most difficult step is step (3). Since average peninsula size is 15.6, the scale of M can be decreased to about 6.41% of its original scale. The detailed implementation of Algorithm 3 is similar to [20], except for two differences: (1) our method of dividing the web graph is based on Algorithm 2; (2) the local PageRank value computed in the block using Algorithm 3 is just the global one, while in [20] the local value should be used to perform iterations again for the global one. For example, in a web graph with four nodes, if three of them construct a peninsula and we want to compute their PageRank values, M can be shrunk to: 2 3 m 1;1 ; m 1;2 ; m 1;3 ; m 1;4 m 2;1 ; m 2;2 ; m 2;3 ; m 2;4 M ¼ m 3;1 ; m 3;2 ; m 3;3 ; m 3;4 5! M m 4;1 ; 0; 0; m 4;4 2 3 m 1;1 ; 0; 0; ð1 þ f 2 þ f 3 Þm 1;4 0; 0; 0; 0 ¼ ; 0; 0; 0 5! M m 4;1 ; 0; 0; m 4;4 ¼ m 1;1; ð1 þ f 2 þ f 3 Þm 1;4 m 4;1 ; m 4;4 In summary, we find an efficient algorithm to split M into many small relatively independent blocks, and this can greatly expedite the power iteration by shrinking the matrix and relax memory requirement. The proposed algorithm can handle some special cases more efficiently than previously reported algorithms. : 5. Conclusion and future works In this paper, we perform a comprehensive study on peninsulas in a web graph, which is a non-obvious but universal phenomenon. We first analyze the reasons of peninsula existence: some site administrators place the content behind home pages so that there is no link from outside; also, web resources are usually organized in a directory tree structure. Then, we present the definition of peninsulas. We also describe some of their characteristics and propose two searching algorithms: one can find the exact maximal peninsula with moderate efficiency; the other is very effective but with compromised tolerable accuracy. We conduct a large scale web experiment using data collected by the Tianwang system. We find that the sizes of peninsulas obey the power law distribution, and have a regression function. We also study the correlations between sizes of peninsulas of nodes and their indegrees, outdegrees, and PageRank values. The conclusion is that: there is little correlation between the size of a peninsula and indegree, outdegree, and PageRank value of its tache, and there is strong relation between the peninsula of a node and those of the nodes that contain links to it. Based on these experiment results, we quantitatively analyze their implications on web search: on one hand, the peninsula structure of a web graph can be used to greatly expedite the computation of PageRank values by helping to divide the web graph into blocks; on the other hand, it has a prominent effect on the information coverage of search engines. A slight loss of link extraction capability can cause a great loss in information coverage. For future work, we plan to prove the existence of peninsula and analyze its characteristics theoretically, from the viewpoint of web generation model as proposed by Levene et al. [14] and Krapivsky and Redner [11]. Acknowledgements We thank Professor Xiaoming Li, Dr. Weihong Wang, Dr. Bo Peng, Zhifeng Yang, Lianen Huang, Jiaji Zhu, Jiajing Li, Bihong Gong, Xiubo Zhao from Peking University, and Qu Li from China University of Geosceinces for their comments. Also thanks to Jing Zhao from Hong Kong University of Science and Technology, Xiaojie Gao from California Institute of Technology, and Hang Su from Vanderbilt University for their proof reads. The

13 T. Meng, H.-F. Yan / Computer Networks 51 (2007) work of Tao Meng was supported by NSFC Grant and NSFC Grant References [1] Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, Janet Wiener, Graph structure in the web: experiments and models, in: Proceedings of the 9th WWW Conference, [2] Arvind Arasu, Jasmine Novak, Andrew S. Tomkins, John A. Tomlin, Pagerank computation and the structure of the web: experiments and algorithms, in: Poster Proc. WWW2002, Honolulu [3] A. Barabasi, R. Albert, Wmergence of scaling in random networks, Science 286 (509) (1999). [4] P. Boldi, S. Vigna, WebGraph framework i: compression techniques, in: Proceedings of the Thirteenth International WWW Conference, [5] Junghoo Cho, Hector Garcia-Molina, Parallel crawlers, in: The Proceedings of 11th World Wide Web Conference, Hawaii, USA, May [6] The China Internet Network Information Center, China s Internet Development and Usage Report. Available from: < report2003.pdf>. [7] Cyveillance, Inc., White Papers. Sizing the Internet. Available from: < papers.htm>, July [8] N. Eiron, K.S. McCurley, Locality, hierarchy, and bidirectionality on the web, in: The Workshop on Web Algorithms and Models, [9] J. Kleinberg, S.R. Kumar, P. Raphavan, S. Rajagopalan, A. Tomkins, The web as a graph: measurements, models and methods, in: The Proceedings of the International Conference on Combinatorics and Computing, July 26 28, [10] R. kumar, P. Raghavan, S. Rajagopalan, A. Tomkins. Trawling the web for cyber communities, in: The Proceedings of the 8th WWW Conference, April [11] P.L. Krapivsky, S. Redner, A statistical physics perspective on web growth, Computer Networks 39 (2002) [12] Lianen Huang, Hongfei Yan, Xiaoming Li, Engineering of web Infomall: the Chinese web archive, in: The Proceedings of World Engineers Convention, [13] L. Page, S. Brin, R. Motwani, T. Winogard, The pagerank citation ranking: bring order to the web, Stanford Digital Libraries Working Paper, [14] Mark Levene, Trevor Fenner, George Loizou, Richard Wheeldon, A stochastic model for the evolution of the web, Computer Networks 39 (2002) [15] Tao Meng, Hongfei Yan, Xiaoming Li, An evaluation model on information coverage of search engines, Chinese Journal of Electronics 31 (8) (2003) (in Chinese). [16] Tao Meng, Hongfei Yan, Jimin Wang, and Xiaoming Li, The evolution of link-attributes for pages and its implications on web crawling, in: The Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence, pp [17] Marc Najork, Janet L. Wiener, Breadth-first search crawling yields high-quality pages, in: Proceedings of the 10th World Wide Web Conference, May [18] Marc Najork, Allan Heydon, High-Performance Web Crawling, COMPAQ SRC Research Report 173, September 26, [19] Park and Miller, Random number generators: good ones are hard to find, Communications of the ACM, October [20] Sepandar D. Kamvar, Taher H. Haveliwala, Christopher D.Manning, Gene H. Golub, Exploiting the block structure of the web for computing pagerank, in: Proceedings of 12th World Wide Web Conference, [21] R.E. Tarjan, Depth-first search and linear graph algorithms, SIAM Journal of Computing (1972) [22] Tianwang search engine, by the lab of computer networks and distributed system in Peking University since Available from: < [23] Web Infomall, a web archive based on Tianwang search engine, which has stored all the web pages in China since 1997, by the lab of computer networks and distributed system in Peking University. Available from: < infomall.cn/>. [24] Hongfei Yan, Jianyong Wang, Xiaoming Li, Lin Guo, Architectural design and evaluation of an efficient webcrawling system, Journal of System and Software (March) (2002) Tao Meng received his bachelor s degree in computer science from Peking University in He is currently a Ph.D. candidate at Peking University, supervised by Prof. Xiaoming Li. His research interests mainly include search engine and web mining. Meng joined the group of Tianwang Search Engine in 2001, and worked primarily in web crawlers and link analysis from then on. His recent design and implementation in 2005 made Tianwang system capable to download more than one billion web pages and perform link analysis such as pagerank computation for them in a short period. Hong-Fei Yan received his Ph.D. in computer science from Peking University in He is currently an associate professor at Peking University. His research interests involve information retrieval and distributed system. He was ever in charge of Tianwang Search Engine s parallel upgrade and made it become a tens of millions pages search engine from one million pages one. He has also successfully pioneered the deployment of the first large-scale Chinese Web Test collection with 100 GB web pages (CWT100g) and has been continuously organizing annual Workshop on Chinese Web Information Retrieval Evaluation since 2004.

Some Characteristics of Web Data and their Reflection on Our Society: an Empirical Approach *

Some Characteristics of Web Data and their Reflection on Our Society: an Empirical Approach * Some Characteristics of Web Data and their Reflection on Our Society: an Empirical Approach * Li Xiaoming and Zhu Jiaji Institute for Internet Information Studies (i 3 S) Peking University 1. Introduction

More information

A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET

A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET Shervin Daneshpajouh, Mojtaba Mohammadi Nasiri¹ Computer Engineering Department, Sharif University of Technology, Tehran, Iran daneshpajouh@ce.sharif.edu,

More information

Lecture 17 November 7

Lecture 17 November 7 CS 559: Algorithmic Aspects of Computer Networks Fall 2007 Lecture 17 November 7 Lecturer: John Byers BOSTON UNIVERSITY Scribe: Flavio Esposito In this lecture, the last part of the PageRank paper has

More information

Compact Encoding of the Web Graph Exploiting Various Power Laws

Compact Encoding of the Web Graph Exploiting Various Power Laws Compact Encoding of the Web Graph Exploiting Various Power Laws Statistical Reason Behind Link Database Yasuhito Asano, Tsuyoshi Ito 2, Hiroshi Imai 2, Masashi Toyoda 3, and Masaru Kitsuregawa 3 Department

More information

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION International Journal of Computer Engineering and Applications, Volume IX, Issue VIII, Sep. 15 www.ijcea.com ISSN 2321-3469 COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

More information

Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo

Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo Social and Technological Network Data Analytics Lecture 5: Structure of the Web, Search and Power Laws Prof Cecilia Mascolo In This Lecture We describe power law networks and their properties and show

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Web Structure Mining using Link Analysis Algorithms

Web Structure Mining using Link Analysis Algorithms Web Structure Mining using Link Analysis Algorithms Ronak Jain Aditya Chavan Sindhu Nair Assistant Professor Abstract- The World Wide Web is a huge repository of data which includes audio, text and video.

More information

An Improved Computation of the PageRank Algorithm 1

An Improved Computation of the PageRank Algorithm 1 An Improved Computation of the PageRank Algorithm Sung Jin Kim, Sang Ho Lee School of Computing, Soongsil University, Korea ace@nowuri.net, shlee@computing.ssu.ac.kr http://orion.soongsil.ac.kr/ Abstract.

More information

Breadth-First Search Crawling Yields High-Quality Pages

Breadth-First Search Crawling Yields High-Quality Pages Breadth-First Search Crawling Yields High-Quality Pages Marc Najork Compaq Systems Research Center 13 Lytton Avenue Palo Alto, CA 9431, USA marc.najork@compaq.com Janet L. Wiener Compaq Systems Research

More information

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node A Modified Algorithm to Handle Dangling Pages using Hypothetical Node Shipra Srivastava Student Department of Computer Science & Engineering Thapar University, Patiala, 147001 (India) Rinkle Rani Aggrawal

More information

CS Search Engine Technology

CS Search Engine Technology CS236620 - Search Engine Technology Ronny Lempel Winter 2008/9 The course consists of 14 2-hour meetings, divided into 4 main parts. It aims to cover both engineering and theoretical aspects of search

More information

Local Methods for Estimating PageRank Values

Local Methods for Estimating PageRank Values Local Methods for Estimating PageRank Values Yen-Yu Chen Qingqing Gan Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201 yenyu, qq gan, suel @photon.poly.edu Abstract The Google search

More information

A Hierarchical Web Page Crawler for Crawling the Internet Faster

A Hierarchical Web Page Crawler for Crawling the Internet Faster A Hierarchical Web Page Crawler for Crawling the Internet Faster Anirban Kundu, Ruma Dutta, Debajyoti Mukhopadhyay and Young-Chon Kim Web Intelligence & Distributed Computing Research Lab, Techno India

More information

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA

CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA CRAWLING THE WEB: DISCOVERY AND MAINTENANCE OF LARGE-SCALE WEB DATA An Implementation Amit Chawla 11/M.Tech/01, CSE Department Sat Priya Group of Institutions, Rohtak (Haryana), INDIA anshmahi@gmail.com

More information

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS

A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS INTERNATIONAL JOURNAL OF RESEARCH IN COMPUTER APPLICATIONS AND ROBOTICS ISSN 2320-7345 A SURVEY ON WEB FOCUSED INFORMATION EXTRACTION ALGORITHMS Satwinder Kaur 1 & Alisha Gupta 2 1 Research Scholar (M.tech

More information

Finding Neighbor Communities in the Web using Inter-Site Graph

Finding Neighbor Communities in the Web using Inter-Site Graph Finding Neighbor Communities in the Web using Inter-Site Graph Yasuhito Asano 1, Hiroshi Imai 2, Masashi Toyoda 3, and Masaru Kitsuregawa 3 1 Graduate School of Information Sciences, Tohoku University

More information

Review: Searching the Web [Arasu 2001]

Review: Searching the Web [Arasu 2001] Review: Searching the Web [Arasu 2001] Gareth Cronin University of Auckland gareth@cronin.co.nz The authors of Searching the Web present an overview of the state of current technologies employed in the

More information

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho,

Parallel Crawlers. 1 Introduction. Junghoo Cho, Hector Garcia-Molina Stanford University {cho, Parallel Crawlers Junghoo Cho, Hector Garcia-Molina Stanford University {cho, hector}@cs.stanford.edu Abstract In this paper we study how we can design an effective parallel crawler. As the size of the

More information

Information Retrieval. Lecture 9 - Web search basics

Information Retrieval. Lecture 9 - Web search basics Information Retrieval Lecture 9 - Web search basics Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 30 Introduction Up to now: techniques for general

More information

Parallel Crawlers. Junghoo Cho University of California, Los Angeles. Hector Garcia-Molina Stanford University.

Parallel Crawlers. Junghoo Cho University of California, Los Angeles. Hector Garcia-Molina Stanford University. Parallel Crawlers Junghoo Cho University of California, Los Angeles cho@cs.ucla.edu Hector Garcia-Molina Stanford University cho@cs.stanford.edu ABSTRACT In this paper we study how we can design an effective

More information

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page International Journal of Soft Computing and Engineering (IJSCE) ISSN: 31-307, Volume-, Issue-3, July 01 Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page Neelam Tyagi, Simple

More information

Bow-tie Decomposition in Directed Graphs

Bow-tie Decomposition in Directed Graphs 14th International Conference on Information Fusion Chicago, Illinois, USA, July 5-8, 2011 Bow-tie Decomposition in Directed Graphs Rong Yang Dept. of Mathematics and Computer Science Western Kentucky

More information

A STUDY ON THE EVOLUTION OF THE WEB

A STUDY ON THE EVOLUTION OF THE WEB A STUDY ON THE EVOLUTION OF THE WEB Alexandros Ntoulas, Junghoo Cho, Hyun Kyu Cho 2, Hyeonsung Cho 2, and Young-Jo Cho 2 Summary We seek to gain improved insight into how Web search engines should cope

More information

CS224W Final Report Emergence of Global Status Hierarchy in Social Networks

CS224W Final Report Emergence of Global Status Hierarchy in Social Networks CS224W Final Report Emergence of Global Status Hierarchy in Social Networks Group 0: Yue Chen, Jia Ji, Yizheng Liao December 0, 202 Introduction Social network analysis provides insights into a wide range

More information

Do TREC Web Collections Look Like the Web?

Do TREC Web Collections Look Like the Web? Do TREC Web Collections Look Like the Web? Ian Soboroff National Institute of Standards and Technology Gaithersburg, MD ian.soboroff@nist.gov Abstract We measure the WT10g test collection, used in the

More information

World Wide Web has specific challenges and opportunities

World Wide Web has specific challenges and opportunities 6. Web Search Motivation Web search, as offered by commercial search engines such as Google, Bing, and DuckDuckGo, is arguably one of the most popular applications of IR methods today World Wide Web has

More information

Architecture of A Scalable Dynamic Parallel WebCrawler with High Speed Downloadable Capability for a Web Search Engine

Architecture of A Scalable Dynamic Parallel WebCrawler with High Speed Downloadable Capability for a Web Search Engine Architecture of A Scalable Dynamic Parallel WebCrawler with High Speed Downloadable Capability for a Web Search Engine Debajyoti Mukhopadhyay 1, 2 Sajal Mukherjee 1 Soumya Ghosh 1 Saheli Kar 1 Young-Chon

More information

Searching the Web [Arasu 01]

Searching the Web [Arasu 01] Searching the Web [Arasu 01] Most user simply browse the web Google, Yahoo, Lycos, Ask Others do more specialized searches web search engines submit queries by specifying lists of keywords receive web

More information

FILTERING OF URLS USING WEBCRAWLER

FILTERING OF URLS USING WEBCRAWLER FILTERING OF URLS USING WEBCRAWLER Arya Babu1, Misha Ravi2 Scholar, Computer Science and engineering, Sree Buddha college of engineering for women, 2 Assistant professor, Computer Science and engineering,

More information

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures Anatomy of a search engine Design criteria of a search engine Architecture Data structures Step-1: Crawling the web Google has a fast distributed crawling system Each crawler keeps roughly 300 connection

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha

Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) Research and implementation of search engine based on Lucene Wan Pu, Wang Lisha Physics Institute,

More information

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW ISSN: 9 694 (ONLINE) ICTACT JOURNAL ON COMMUNICATION TECHNOLOGY, MARCH, VOL:, ISSUE: WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW V Lakshmi Praba and T Vasantha Department of Computer

More information

Popularity of Twitter Accounts: PageRank on a Social Network

Popularity of Twitter Accounts: PageRank on a Social Network Popularity of Twitter Accounts: PageRank on a Social Network A.D-A December 8, 2017 1 Problem Statement Twitter is a social networking service, where users can create and interact with 140 character messages,

More information

Exploring both Content and Link Quality for Anti-Spamming

Exploring both Content and Link Quality for Anti-Spamming Exploring both Content and Link Quality for Anti-Spamming Lei Zhang, Yi Zhang, Yan Zhang National Laboratory on Machine Perception Peking University 100871 Beijing, China zhangl, zhangyi, zhy @cis.pku.edu.cn

More information

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION

CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION CHAPTER 6 MODIFIED FUZZY TECHNIQUES BASED IMAGE SEGMENTATION 6.1 INTRODUCTION Fuzzy logic based computational techniques are becoming increasingly important in the medical image analysis arena. The significant

More information

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science Lecture 9: I: Web Retrieval II: Webology Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen April 10, 2003 Page 1 WWW retrieval Two approaches

More information

A scalable lightweight distributed crawler for crawling with limited resources

A scalable lightweight distributed crawler for crawling with limited resources University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2008 A scalable lightweight distributed crawler for crawling with limited

More information

Adaptive methods for the computation of PageRank

Adaptive methods for the computation of PageRank Linear Algebra and its Applications 386 (24) 51 65 www.elsevier.com/locate/laa Adaptive methods for the computation of PageRank Sepandar Kamvar a,, Taher Haveliwala b,genegolub a a Scientific omputing

More information

A METHODOLOGY FOR THE EVALUATION OF WEB GRAPH MODELS AND A TEST CASE. Antonios Kogias Dimosthenis Anagnostopoulos

A METHODOLOGY FOR THE EVALUATION OF WEB GRAPH MODELS AND A TEST CASE. Antonios Kogias Dimosthenis Anagnostopoulos Proceedings of the 2006 Winter Simulation Conference L. F. Perrone, F. P. Wieland, J. Liu, B. G. Lawson, D. M. Nicol, and R. M. Fujimoto, eds. A METHODOLOGY FOR THE EVALUATION OF WEB GRAPH MODELS AND A

More information

RankMass Crawler: A Crawler with High Personalized PageRank Coverage Guarantee

RankMass Crawler: A Crawler with High Personalized PageRank Coverage Guarantee Crawler: A Crawler with High Personalized PageRank Coverage Guarantee Junghoo Cho University of California Los Angeles 4732 Boelter Hall Los Angeles, CA 90095 cho@cs.ucla.edu Uri Schonfeld University of

More information

An Improved PageRank Method based on Genetic Algorithm for Web Search

An Improved PageRank Method based on Genetic Algorithm for Web Search Available online at www.sciencedirect.com Procedia Engineering 15 (2011) 2983 2987 Advanced in Control Engineeringand Information Science An Improved PageRank Method based on Genetic Algorithm for Web

More information

A P2P-based Incremental Web Ranking Algorithm

A P2P-based Incremental Web Ranking Algorithm A P2P-based Incremental Web Ranking Algorithm Sumalee Sangamuang Pruet Boonma Juggapong Natwichai Computer Engineering Department Faculty of Engineering, Chiang Mai University, Thailand sangamuang.s@gmail.com,

More information

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule 1 How big is the Web How big is the Web? In the past, this question

More information

Ranking web pages using machine learning approaches

Ranking web pages using machine learning approaches University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2008 Ranking web pages using machine learning approaches Sweah Liang Yong

More information

Motivation. Motivation

Motivation. Motivation COMS11 Motivation PageRank Department of Computer Science, University of Bristol Bristol, UK 1 November 1 The World-Wide Web was invented by Tim Berners-Lee circa 1991. By the late 199s, the amount of

More information

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine

International Journal of Scientific & Engineering Research Volume 2, Issue 12, December ISSN Web Search Engine International Journal of Scientific & Engineering Research Volume 2, Issue 12, December-2011 1 Web Search Engine G.Hanumantha Rao*, G.NarenderΨ, B.Srinivasa Rao+, M.Srilatha* Abstract This paper explains

More information

CS6200 Information Retreival. The WebGraph. July 13, 2015

CS6200 Information Retreival. The WebGraph. July 13, 2015 CS6200 Information Retreival The WebGraph The WebGraph July 13, 2015 1 Web Graph: pages and links The WebGraph describes the directed links between pages of the World Wide Web. A directed edge connects

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

INTRODUCTION. Chapter GENERAL

INTRODUCTION. Chapter GENERAL Chapter 1 INTRODUCTION 1.1 GENERAL The World Wide Web (WWW) [1] is a system of interlinked hypertext documents accessed via the Internet. It is an interactive world of shared information through which

More information

The Mobile Web Is Structurally Different

The Mobile Web Is Structurally Different The Mobile Web Is Structurally Different Apoorva Jindal Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089 Email: apoorvaj@usc.edu Christopher Crutchfield CSAIL Massachusetts

More information

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Proximity Prestige using Incremental Iteration in Page Rank Algorithm Indian Journal of Science and Technology, Vol 9(48), DOI: 10.17485/ijst/2016/v9i48/107962, December 2016 ISSN (Print) : 0974-6846 ISSN (Online) : 0974-5645 Proximity Prestige using Incremental Iteration

More information

News Page Discovery Policy for Instant Crawlers

News Page Discovery Policy for Instant Crawlers News Page Discovery Policy for Instant Crawlers Yong Wang, Yiqun Liu, Min Zhang, Shaoping Ma State Key Lab of Intelligent Tech. & Sys., Tsinghua University wang-yong05@mails.tsinghua.edu.cn Abstract. Many

More information

Link Analysis in Web Information Retrieval

Link Analysis in Web Information Retrieval Link Analysis in Web Information Retrieval Monika Henzinger Google Incorporated Mountain View, California monika@google.com Abstract The analysis of the hyperlink structure of the web has led to significant

More information

Identify Temporal Websites Based on User Behavior Analysis

Identify Temporal Websites Based on User Behavior Analysis Identify Temporal Websites Based on User Behavior Analysis Yong Wang, Yiqun Liu, Min Zhang, Shaoping Ma State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for Information

More information

6. Lecture notes on matroid intersection

6. Lecture notes on matroid intersection Massachusetts Institute of Technology 18.453: Combinatorial Optimization Michel X. Goemans May 2, 2017 6. Lecture notes on matroid intersection One nice feature about matroids is that a simple greedy algorithm

More information

On Finding Power Method in Spreading Activation Search

On Finding Power Method in Spreading Activation Search On Finding Power Method in Spreading Activation Search Ján Suchal Slovak University of Technology Faculty of Informatics and Information Technologies Institute of Informatics and Software Engineering Ilkovičova

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis

Plan for today. CS276B Text Retrieval and Mining Winter Evolution of search engines. Connectivity analysis CS276B Text Retrieval and Mining Winter 2005 Lecture 7 Plan for today Review search engine history (slightly more technically than in the first lecture) Web crawling/corpus construction Distributed crawling

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information

Finding Strongly Connected Components

Finding Strongly Connected Components Yufei Tao ITEE University of Queensland We just can t get enough of the beautiful algorithm of DFS! In this lecture, we will use it to solve a problem finding strongly connected components that seems to

More information

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004

Web Search Engines: Solutions to Final Exam, Part I December 13, 2004 Web Search Engines: Solutions to Final Exam, Part I December 13, 2004 Problem 1: A. In using the vector model to compare the similarity of two documents, why is it desirable to normalize the vectors to

More information

Algorithms for minimum m-connected k-tuple dominating set problem

Algorithms for minimum m-connected k-tuple dominating set problem Theoretical Computer Science 381 (2007) 241 247 www.elsevier.com/locate/tcs Algorithms for minimum m-connected k-tuple dominating set problem Weiping Shang a,c,, Pengjun Wan b, Frances Yao c, Xiaodong

More information

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler

Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler Term-Frequency Inverse-Document Frequency Definition Semantic (TIDS) Based Focused Web Crawler Mukesh Kumar and Renu Vig University Institute of Engineering and Technology, Panjab University, Chandigarh,

More information

An Algorithm of Parking Planning for Smart Parking System

An Algorithm of Parking Planning for Smart Parking System An Algorithm of Parking Planning for Smart Parking System Xuejian Zhao Wuhan University Hubei, China Email: xuejian zhao@sina.com Kui Zhao Zhejiang University Zhejiang, China Email: zhaokui@zju.edu.cn

More information

arxiv:cs/ v1 [cs.ir] 26 Apr 2002

arxiv:cs/ v1 [cs.ir] 26 Apr 2002 Navigating the Small World Web by Textual Cues arxiv:cs/0204054v1 [cs.ir] 26 Apr 2002 Filippo Menczer Department of Management Sciences The University of Iowa Iowa City, IA 52242 Phone: (319) 335-0884

More information

Mathematical Analysis of Google PageRank

Mathematical Analysis of Google PageRank INRIA Sophia Antipolis, France Ranking Answers to User Query Ranking Answers to User Query How a search engine should sort the retrieved answers? Possible solutions: (a) use the frequency of the searched

More information

The application of Randomized HITS algorithm in the fund trading network

The application of Randomized HITS algorithm in the fund trading network The application of Randomized HITS algorithm in the fund trading network Xingyu Xu 1, Zhen Wang 1,Chunhe Tao 1,Haifeng He 1 1 The Third Research Institute of Ministry of Public Security,China Abstract.

More information

Improving Suffix Tree Clustering Algorithm for Web Documents

Improving Suffix Tree Clustering Algorithm for Web Documents International Conference on Logistics Engineering, Management and Computer Science (LEMCS 2015) Improving Suffix Tree Clustering Algorithm for Web Documents Yan Zhuang Computer Center East China Normal

More information

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM Masahito Yamamoto, Hidenori Kawamura and Azuma Ohuchi Graduate School of Information Science and Technology, Hokkaido University, Japan

More information

Improving Range Query Performance on Historic Web Page Data

Improving Range Query Performance on Historic Web Page Data Improving Range Query Performance on Historic Web Page Data Geng LI Lab of Computer Networks and Distributed Systems, Peking University Beijing, China ligeng@net.pku.edu.cn Bo Peng Lab of Computer Networks

More information

COMMUNITY SHELL S EFFECT ON THE DISINTEGRATION OF SOCIAL NETWORKS

COMMUNITY SHELL S EFFECT ON THE DISINTEGRATION OF SOCIAL NETWORKS Annales Univ. Sci. Budapest., Sect. Comp. 43 (2014) 57 68 COMMUNITY SHELL S EFFECT ON THE DISINTEGRATION OF SOCIAL NETWORKS Imre Szücs (Budapest, Hungary) Attila Kiss (Budapest, Hungary) Dedicated to András

More information

Hypergraph-Theoretic Partitioning Models for Parallel Web Crawling

Hypergraph-Theoretic Partitioning Models for Parallel Web Crawling Hypergraph-Theoretic Partitioning Models for Parallel Web Crawling Ata Turk, B. Barla Cambazoglu and Cevdet Aykanat Abstract Parallel web crawling is an important technique employed by large-scale search

More information

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler

Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.12, December 2008 349 Self Adjusting Refresh Time Based Architecture for Incremental Web Crawler A.K. Sharma 1, Ashutosh

More information

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING

EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING Chapter 3 EXTRACTION OF RELEVANT WEB PAGES USING DATA MINING 3.1 INTRODUCTION Generally web pages are retrieved with the help of search engines which deploy crawlers for downloading purpose. Given a query,

More information

A BEGINNER S GUIDE TO THE WEBGRAPH: PROPERTIES, MODELS AND ALGORITHMS.

A BEGINNER S GUIDE TO THE WEBGRAPH: PROPERTIES, MODELS AND ALGORITHMS. A BEGINNER S GUIDE TO THE WEBGRAPH: PROPERTIES, MODELS AND ALGORITHMS. Debora Donato Luigi Laura Stefano Millozzi Dipartimento di Informatica e Sistemistica Università di Roma La Sapienza {donato,laura,millozzi}@dis.uniroma1.it

More information

On Compressing Social Networks. Ravi Kumar. Yahoo! Research, Sunnyvale, CA. Jun 30, 2009 KDD 1

On Compressing Social Networks. Ravi Kumar. Yahoo! Research, Sunnyvale, CA. Jun 30, 2009 KDD 1 On Compressing Social Networks Ravi Kumar Yahoo! Research, Sunnyvale, CA KDD 1 Joint work with Flavio Chierichetti, University of Rome Silvio Lattanzi, University of Rome Michael Mitzenmacher, Harvard

More information

Research and Improvement of Apriori Algorithm Based on Hadoop

Research and Improvement of Apriori Algorithm Based on Hadoop Research and Improvement of Apriori Algorithm Based on Hadoop Gao Pengfei a, Wang Jianguo b and Liu Pengcheng c School of Computer Science and Engineering Xi'an Technological University Xi'an, 710021,

More information

Reading Time: A Method for Improving the Ranking Scores of Web Pages

Reading Time: A Method for Improving the Ranking Scores of Web Pages Reading Time: A Method for Improving the Ranking Scores of Web Pages Shweta Agarwal Asst. Prof., CS&IT Deptt. MIT, Moradabad, U.P. India Bharat Bhushan Agarwal Asst. Prof., CS&IT Deptt. IFTM, Moradabad,

More information

TELCOM2125: Network Science and Analysis

TELCOM2125: Network Science and Analysis School of Information Sciences University of Pittsburgh TELCOM2125: Network Science and Analysis Konstantinos Pelechrinis Spring 2015 Figures are taken from: M.E.J. Newman, Networks: An Introduction 2

More information

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES

AN OVERVIEW OF SEARCHING AND DISCOVERING WEB BASED INFORMATION RESOURCES Journal of Defense Resources Management No. 1 (1) / 2010 AN OVERVIEW OF SEARCHING AND DISCOVERING Cezar VASILESCU Regional Department of Defense Resources Management Studies Abstract: The Internet becomes

More information

Massively Parallel Approximation Algorithms for the Knapsack Problem

Massively Parallel Approximation Algorithms for the Knapsack Problem Massively Parallel Approximation Algorithms for the Knapsack Problem Zhenkuang He Rochester Institute of Technology Department of Computer Science zxh3909@g.rit.edu Committee: Chair: Prof. Alan Kaminsky

More information

LET:Towards More Precise Clustering of Search Results

LET:Towards More Precise Clustering of Search Results LET:Towards More Precise Clustering of Search Results Yi Zhang, Lidong Bing,Yexin Wang, Yan Zhang State Key Laboratory on Machine Perception Peking University,100871 Beijing, China {zhangyi, bingld,wangyx,zhy}@cis.pku.edu.cn

More information

Constructing Websites toward High Ranking Using Search Engine Optimization SEO

Constructing Websites toward High Ranking Using Search Engine Optimization SEO Constructing Websites toward High Ranking Using Search Engine Optimization SEO Pre-Publishing Paper Jasour Obeidat 1 Dr. Raed Hanandeh 2 Master Student CIS PhD in E-Business Middle East University of Jordan

More information

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches

Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Incorporating Satellite Documents into Co-citation Networks for Scientific Paper Searches Masaki Eto Gakushuin Women s College Tokyo, Japan masaki.eto@gakushuin.ac.jp Abstract. To improve the search performance

More information

Searching the Web What is this Page Known for? Luis De Alba

Searching the Web What is this Page Known for? Luis De Alba Searching the Web What is this Page Known for? Luis De Alba ldealbar@cc.hut.fi Searching the Web Arasu, Cho, Garcia-Molina, Paepcke, Raghavan August, 2001. Stanford University Introduction People browse

More information

Some Applications of Graph Bandwidth to Constraint Satisfaction Problems

Some Applications of Graph Bandwidth to Constraint Satisfaction Problems Some Applications of Graph Bandwidth to Constraint Satisfaction Problems Ramin Zabih Computer Science Department Stanford University Stanford, California 94305 Abstract Bandwidth is a fundamental concept

More information

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments

Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing Environments Send Orders for Reprints to reprints@benthamscience.ae 368 The Open Automation and Control Systems Journal, 2014, 6, 368-373 Open Access Apriori Algorithm Research Based on Map-Reduce in Cloud Computing

More information

Comparative Study of Subspace Clustering Algorithms

Comparative Study of Subspace Clustering Algorithms Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that

More information

5. Lecture notes on matroid intersection

5. Lecture notes on matroid intersection Massachusetts Institute of Technology Handout 14 18.433: Combinatorial Optimization April 1st, 2009 Michel X. Goemans 5. Lecture notes on matroid intersection One nice feature about matroids is that a

More information

An Application of Personalized PageRank Vectors: Personalized Search Engine

An Application of Personalized PageRank Vectors: Personalized Search Engine An Application of Personalized PageRank Vectors: Personalized Search Engine Mehmet S. Aktas 1,2, Mehmet A. Nacar 1,2, and Filippo Menczer 1,3 1 Indiana University, Computer Science Department Lindley Hall

More information

Page Rank Link Farm Detection

Page Rank Link Farm Detection International Journal of Engineering Inventions e-issn: 2278-7461, p-issn: 2319-6491 Volume 4, Issue 1 (July 2014) PP: 55-59 Page Rank Link Farm Detection Akshay Saxena 1, Rohit Nigam 2 1, 2 Department

More information

Link Analysis and Web Search

Link Analysis and Web Search Link Analysis and Web Search Moreno Marzolla Dip. di Informatica Scienza e Ingegneria (DISI) Università di Bologna http://www.moreno.marzolla.name/ based on material by prof. Bing Liu http://www.cs.uic.edu/~liub/webminingbook.html

More information

Efficient extraction of news articles based on RSS crawling

Efficient extraction of news articles based on RSS crawling Efficient extraction of news articles based on RSS crawling George Adam, Christos Bouras and Vassilis Poulopoulos Research Academic Computer Technology Institute, and Computer and Informatics Engineer

More information

Domain Specific Search Engine for Students

Domain Specific Search Engine for Students Domain Specific Search Engine for Students Domain Specific Search Engine for Students Wai Yuen Tang The Department of Computer Science City University of Hong Kong, Hong Kong wytang@cs.cityu.edu.hk Lam

More information

Searching the Deep Web

Searching the Deep Web Searching the Deep Web 1 What is Deep Web? Information accessed only through HTML form pages database queries results embedded in HTML pages Also can included other information on Web can t directly index

More information

Fast trajectory matching using small binary images

Fast trajectory matching using small binary images Title Fast trajectory matching using small binary images Author(s) Zhuo, W; Schnieders, D; Wong, KKY Citation The 3rd International Conference on Multimedia Technology (ICMT 2013), Guangzhou, China, 29

More information

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul

CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul 1 CS224W Project Write-up Static Crawling on Social Graph Chantat Eksombatchai Norases Vesdapunt Phumchanit Watanaprakornkul Introduction Our problem is crawling a static social graph (snapshot). Given

More information