Web Structure Mining: Exploring Hyperlinks and Algorithms for Information Retrieval

Size: px

Start display at page:

Download "Web Structure Mining: Exploring Hyperlinks and Algorithms for Information Retrieval"

Bathsheba Holt
6 years ago
Views:

1 eb Structure Mg: Explorg Hyperlks and Algorithms for Information Retrieval Ravi Kumar P Department of ECEC, Curt University of Technology, Sarawak Campus, Miri, Malaysia ravi2266@gmail.com Ashutosh Kumar Sgh Department of ECEC, Curt University of Technology, Sarawak Campus, Miri, Malaysia ashutosh.s@curt.edu.my Abstract This paper focus on the Hyperlk analysis, the algorithms used for lk analysis, compare those algorithms and the role of hyperlk analysis eb searchg. In the hyperlk analysis, the number of comg lks to a page and the number of gog lks from that page will be analyzed and the reliability of the lkg will be analyzed. Authorities and Hubs concept of eb pages will be explored. The different algorithms used for Lk analysis like PageRank, HITS (Hyperlk-Induced Topic Search) and other algorithms will be discussed and compared. The formula used by those algorithms will be explored. Keywords - eb Mg, eb Content, eb Structure, eb Graph, Information Retrieval, Hyperlk Analysis, PageRank, eighted PageRank and HITS. I. INTRODUCTION The eb is a massive, explosive, diverse, dynamic and mostly unstructured data repository, which delivers an credible amount of formation, and also creases the complexity of dealg with the formation from the different perspectives of knowledge seekers, eb service providers and busess analysts. The followg are considered as challenges [1] the eb mg: eb is huge and eb Pages are semi-structured eb formation tends to be diversity meang Degree of quality of the formation extracted Conclusion of the knowledge from the formation extracted eb mg techniques along with other areas like Database (DB), Information Retrieval (IR), Natural Language Processg usage of the eb data used as put the data mg process, namely, eb Content Mg (CM), eb Usage Mg (UM) and eb Structure Mg(SM)eb content mg is concerned with the retrieval of formation from to more structured forms and dexg the formation to retrieve it quicklyeb usage mg is the process of identifyg the browsg patterns by analyzg the user s navigational behavioreb structure mg is to discover the model underlyg the lk structures of the eb pages, catalog them and generate formation such as the similarity and (NLP), Mache Learng etc. can be used to solve the above challengeseb mg is the use of data mg techniques to automatically discover and extract formation from the orld ide eb ()eb structure mg helps the users to retrieve the relevant documents by analyzg the lk structure of the eb. This paper is organized as follows. Next Section provides concepts of eb Structure mg and eb Graph. Section III provides Hyperlk analysis, algorithms and their comparisons. Paper is concluded Section IV. II. EB STRUCTURE MINING A. Overview Accordg to Kosala et al [2], eb mg consists of the followg tasks: Resource fdg: the task of retrievg tended eb documents. Information selection and pre-processg: automatically selectg and pre-processg specific formation from retrieved eb resources. Generalization: automatically discovers general patterns at dividual eb sites as well as across multiple sites. Analysis: validation and/or terpretation of the med patterns. There are three areas of eb mg accordg to the relationship between them, takg advantage of their hyperlk topology. Hyperlk analysis and the algorithms discussed here are related to eb Structure mg. Even though there are three areas of eb mg, the differences between them are narrowg because they are all terconnected. B. How big is eb A Google report [3] on 25 th July 2008 says that there are 1 trillion (1,000,000,000,000) unique URLs (Universal Resource Locator) on the eb. The actual number could be more than that 1 ICT_09_curt

2 and Google could not dex all the pageshen Google first created the dex 1998 there were 26 million pages and 2000 Google dex reached 1 billion pages. In the last 9 years, eb has grown tremendously and the usage of the web is unimagable. So it is important to understand and analyze the underlyg data structure of the eb for effective Information Retrieval. Ceb Data Structure The traditional formation retrieval system basically focuses on formation provided by the text of eb documentseb mg technique provides additional formation through hyperlks where different documents are connected. The eb may be viewed as a directed labeled graph whose nodes are the documents or pages and the edges are the hyperlks between them. This directed graph structure the eb is called as eb Graph. A graph G consists of two sets V and E, Horowitz et al [4]. The set V is a fite, nonempty set of vertices. The set E is a set of pairs of vertices; these pairs are called edges. The notation V(G) and E(G) represent the sets of vertices and edges, respectively of graph G. It can also be expressed G = (V, E) to represent a graph. The graph Fig. 1 is a directed graph with 3 Vertices and 3 edges. Figure 1 A Directed Graph G The vertices V of G, V(G) = {A, B, C}. The Edges E of G, E(G) ={(A, B), (B, A), (B, C)}. In a directed graph with n vertices, the maximum number of edges is n(n-1)ith 3 vertices, the maximum number of edges can be 3(3-1) = 6. In the above example, there is no lk from (C, B), (A, C) and (C, A). A directed graph is said to be strongly connected if for every pair of distct vertices u and v V(G), there is a directed path from u to v and also from v to u. The above graph Fig. 1 is not strongly connected, as there is no path from vertex C to B. Accordg to Broader et al. [5], a eb can be imaged as a large graph contag several hundred million or billion of nodes or vertices, and a few billion arcs or edges. The followg section explas the hyperlk analysis and the algorithms used the hyperlk analysis for formation retrieval. III. A B C HYPERLINK ANALYSIS Many eb Pages do not clude words that are descriptive of their basic purpose (for example rarely a search enge portal cludes the word search its home page), and there exist eb pages which conta very little text (such as image, music, video resources), makg a text-based search techniques difficult. However, how others exemplify this page may be useful. This type of characterization is cluded the text that surrounds the hyperlk potg to the page. Many researches [6, 7, 8, 11, 12] have done and solutions have suggested to the problem of searchg, dexg or queryg the eb, takg to account its structure as well as the meta-formation cluded the hyperlks and the text surroundg them. There are a number of algorithms proposed based on the Lk Analysis. Usg citation analysis, Co-citation algorithm [13] and Extended Co-citation algorithm [14] are proposed. These algorithms are simple and deeper relationships among the pages can not be discovered. Three important algorithms PageRank[15], eighted PageRank (PR)[16] and Hypertext Induced Topic Search HITS[17] are discussed below detail and compared. A. PageRank Br and Page developed PageRank [15] algorithm durg their Ph D at Stanford University based on the citation analysis [9, 10]. PageRank algorithm is used by the famous search enge, Google. They applied the citation analysis eb search by treatg the comg lks as citations to the eb pages. However, by simply applyg the citation analysis techniques to the diverse set of eb documents did not result efficient comes. Therefore, PageRank provides a more advanced way to compute the importance or relevance of a eb page than simply countg the number of pages that are lkg to it (called as backlks ). If a backlk comes from an important page, then that backlk is given a higher weightg than those backlks comes from non-important pages. In a simple way, lk from one page to another page may be considered as a vote. However, not only the number of votes a page receives is considered important, but the importance or the relevance of the ones that cast these votes as well. Assume any arbitrary page A has pages T 1 to T n potg to it (comg lk). PageRank can be calculated by the followg equation (1). PR A) PR( T ) / C( T ) PR( T / C( T )) (1) ( 1 1 n n The parameter d is a dampg factor, usually sets it to 0.85 (to stop the other pages havg too much fluence, this total vote is damped down by multiplyg it by 0.85). C(A) is defed as the number of lks gog of page A. The PageRanks form a probability distribution over the eb pages, so the sum of all eb pages PageRank will be one. PageRank can be calculated usg a simple iterative algorithm, and corresponds to the prcipal eigenvector of the normalized lk matrix of the eb. Let us take an example of hyperlk structure of three pages A, B and C as shown Fig. 2. The PageRank for pages A, B and C can be calculated by usg equation (1). 2 ICT_09_curt

3 Page A Page B Convergence of PageRank Calculation Figure 2 Hyperlk Structure for 3 pages Let us assume the itial PageRank as 1.0 and do the calculation. The dampg factor d is set to PR(A) = (1-d) + d (PR(C)/C(C)) = (1-0.85) (1/2) = = (1a) PR(B) = (1-d) (PR(A)/C(A) + (PR(C)/C(C)) = (1b) PR(C) = (1-d) (PR(A)/C(A) + (PR(B)/C(B)) = (1c) Do the second iteration by takg the above PageRank values from (1a), (1b) and (1c). PR(A) = (1.091/2)) = PR(B) = ((0.614/2)+(1.091/2)) = PR(C) = ((0.614/2)+(0.875/1)) = (1d) (1e) (1f) After dog many more iterations of the above calculation, the followg PageRanks arrived as shown Table I. TABLE I. Page C ITERATIVE CALCULATION FOR PAGERANK Iteration PR(A) PR(B) PR(C) For a smaller set of pages, it is easy to calculate and fd the PageRank values but for a eb havg billions of pages, it is not easy to do the calculation like above. In the above Table I, you can notice that PageRank of C is higher than PageRank of B and A. It is because Page C has 2 comg lks and 2 gog lks as shown Fig. 2. Page B has 2 comg lks and 1 gog lk. Page A has the lowest PageRank because Page A has only one comg lk and 2 gog lks. From the Table I, after the iteration 15, the PageRank for the pages gets normalized. Previous experiments [18, 19] shows that the PageRank gets converged to a reasonable tolerance. The convergence of PageRank calculation for the Table I is shown as a graph Fig. 3. PageRank Values Figure 3 Iteration Convergence of PageRank Calculation Page A Page B Page C Beighted PageRank Algorithm enpu Xg and Ali Ghorbani [16] proposed a eighted PageRank (PR) algorithm which is an extension of the PageRank algorithm. This algorithm assigns a larger rank values to the more important pages rather than dividg the rank value of a page evenly among its gog lked pages. Each gog lk gets a value proportional to its importance. The importance is assigned terms of weight values to the comg and gog lks and are denoted as (m, n) and (m, n) respectively (m, n) as shown equation (2) is the weight of lk(m, n) calculated based on the number of comg lks of page n and the number of comg lks of all reference pages of page m. I n ( m, n ) = (2) I p R( m) p On ( m, n ) = (3) O p R( m) p here I n and I p are the number of comg lks of page n and page p respectively. R(m) denotes the reference page list of page m (m, n) is as shown equation (3) is the weight of lk(m, n) calculated based on the number of gog lks of page n and the number of gog lks of all reference pages of mhere O n and O p are the number of gog lks of page n and p respectively. The formula as proposed by enpu et al for the PR is as shown equation (4) which is a modification of the PageRank formula (equation 1). PR ( n ) = (1 d ) + d PR ( m ) ( m, n) ( m, n) (4) m B ( n) 3 ICT_09_curt

4 Use the same hyperlk structure as shown Fig. 2 and do the PR Calculation. The PR equations for Pages A, B and C are as follows. PR A) C). ( ) (4a) ( C, A ) ( C, A ) PR ( B ) = (1 d ) + d ( PR ( A). + PR ( C ) + B). ( C, B ) C) A). ( B, C ) ( C, B ) ) ( B, C ) ( A, C ) ( A, B ) ) ( A, C ) ( A, B ) (4b) (4c) The comg lk and gog lk weights are calculated as follows: ( C, A ) = I A /( I A + I B) = 1/(1 + 2) = 1/ 3 (4d) ( C, A ) = OA /( OA + OB) = 2 /(2 + 1) = 2/ 3 (4e) By substitutg the values of equations (4d) and (4e) to equation (4a), you will get the PR of Page A by takg a value of 0.85 for d and the itial value of C) = 1. PR ( A) = (1 0.85) (1*1/ 3* 2 / 3) = 0.69 (4f) as a directed graph G(V,E), where V is a set of Vertices representg pages and E is a set of edges that correspond to lks. There are two major steps the HITS algorithm. The first step is the Samplg Step and the second step is the Iterative step. In the Samplg step, a set of relevant pages for the given query are collected i.e. a sub-graph S of G is retrieved which is high authority pages. This algorithm starts with a root set R, a set of S is obtaed, keepg md that S is relatively small, rich relevant pages ab the query and contas most of the good authorities. The second step, Iterative step, fds hubs and authorities usg the put of the samplg step usg equations (5) and (6). H = (5) p A q q I ( p) A = (6) p H q q B( p) here H p is the hub weight, A p is the Authority weight, I(p) and B(p) denotes the set of reference and referrer pages of page p. The page s authority weight is proportional to the sum of the hub weights of pages that it lks to it, Kleberg [20]. Similarly, a page s hub weight is proportional to the sum of the authority weights of pages that it lks to. Fig. 4 shows an example of the calculation of authority and hub scores. B) = ( ((0.69*1/ 2*1/ 3) = 0.44 (4g) Q1 R1 C) = (1 0.85) ((0.69*1/ 2*2 / 3) = 0.47 (4h) Q2 P R2 The values of A), B) and C) are shown equations (4f), (4g) and (4h) respectively. In this, A)>C)>B). This results shows that the page rank order is different from PageRank. C. The HITS Algorithm - Hubs & Authorities Kleberg [17] identifies two different forms of eb pages called hubs and authorities. Authorities are pages havg important contents. Hubs are pages that act as resource lists, guidg users to authorities. Thus, a good hub page for a subject pots to many authoritative pages on that content, and a good authority page is poted by many good hub pages on the same subject. Hubs and Authorities and their calculations are shown Fig. 4. Kleberg says that a page may be a good hub and a good authority at the same time. This circular relationship leads to the defition of an iterative algorithm called HITS (Hyperlk Induced Topic Search). The HITS algorithm treats Q3 A P = H Q1 + H Q2 + H Q3 R3 H P = A R1 + A R2 + A R3 Figure 4. Calculation of hubs and Authorities The followg are the constrats of HITS algorithm [6]: Hubs and authorities: It is not easy to distguish between hubs and authorities because many sites are hubs as well as authorities. Topic drift: Sometime HITS may not produce the most relevant documents to the user queries because of equivalent weights. 4 ICT_09_curt

5 Automatically generated lks: HITS gives equal importance for automatically generated lks which may not produce relevant topics for the user query. Efficiency: HITS algorithm is not efficient real time. Table II shows the comparison [21] of all the algorithms discussed above. TABLE II. Algorithm Criteria Mg technique used orkg I/P Parameters COMPARISON OF HYPERLINK ALGORITHMS PageRank eighted PageRank HITS SM SM SM & CM Computes scores at dex time. Results are sorted on the importance of pages. Backlks Computes scores at dex time. Results are sorted on the Page importance. Backlks, Forward lks Computes scores of n highly relevant pages on the fly. Backlks, Forward Lks & content Complexity O(log N) <O(log N) <O(log N) Limitations Query dependent Query dependent Topic drift and efficiency problem Search Enge Google Research model Clever IV. CONCLUSION This paper covers the basics of eb mg. The importance of the eb structure mg Information retrieval is explaed. The ma purpose of this paper is to explore the hyperlk structure and understand the eb graph a simple way. This paper also focuses on the important algorithms used for hyperlk analysis, explore those algorithms and compare them. [6] S. Chakrabarti, B.Dom, D.Gibson, J. Kleberg, R. Kumar, P. Raghavan,S. Rajagopalan, A. Tomks, Mg the Lk Structure of the orld ide eb, IEEE Computer, Vol. 32, pp , [7] T.H. Haveliwala, A. Gionis, D. Kle, P. Indyk, Evaluatg Strategies for Similarity Search on the eb, Proc. of the 11, Hawaii, USA, [8] I. Varlamis, M. Vazirgiannis, M. Halkidi, B. Nguyen, THESUS, A Closer View on eb Content Management Enhanced with Lk Semantics, IEEE Transaction on Knowledge and Data Engeerg Journal. [9] Eugene Garfield, Citation Analysis as a tool journal evaluation, Science 178, pp , [10] G. Pski and F. Nar, Citation fluence for journal aggregates of scientific publications: Theory, with application to the literature of physics, Information Processg and Management, [11] D. Gibson, J. Kleberg, P. Raghavan, Inferrg eb Communities from Lk Topology, Proc. of the 9 th ACM Conference on Hypertext and Hypermedia, [12] R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomks, Trawlg the eb for Emergg Cyber-Communities, Proc. of the 8 th Conference(8), [13] J. Dean and M. Henzger, Fdg Related Pages the orld ide eb, Proc. Eight Int l orld ide eb Conf., pp , [14] J. Hou and Y. Zhang, Effectively Fdg Relevant eb Pages from Lkage Information, IEEE Transactions on Knowledge and Data Engeerg, Vol. 15, No. 4, [15] S. Br, L. Page, The Anatomy of a Large Scale Hypertextual eb search enge,, Computer Network and ISDN Systems, Vol. 30, Issue 1-7, pp , [16] enpu Xg and Ali Ghorbani, eighted PageRank Algorithm, Proc. of the Second Annual Conference on Communication Networks and Services Research (CNSR 04), IEEE, [17] J. Kleberg, Authoritative Sources a Hyper-Lked Environment, Journal of the ACM 46(5), pp , [18] L. Page, S. Br, R. Motwani, and Tograd, The Pagerank Citation Rankg: Brgg order to the eb. Technical Report, Stanford Digital Libraries SIDL-P , [19] C. Ridgs and M. Shishig, PageRank Convered. Technical Report, [20] J. Kleberg, Hubs, Authorities, and Communities, ACM Computg Surveys, 31(4), [21] N. Duhan, A.K. Sharma and K.K. Bhatia, Page Rankg Algorithms: A Survey, Proceedgs of the IEEE International Conference on Advance Computg, REFERENCES [1] M.G. da Gomes Jr. and Z. Gong, eb Structure Mg: An Introduction, Proceedgs of the IEEE International Conference on Information Acquisition, [2] R. Kosala, H. Blockeel, eb Mg Research: A Survey, SIGKDD Explorations, Newsletter of the ACM Special Interest Group on Knowledge Discovery and Data Mg Vol. 2, No. 1 pp 1-15, [3] [4] E. Horowitz, S. Sahni and S. Rajasekaran, Fundamentals of Computer Algorithms, Galgotia Publications Pvt. Ltd., pp , [5] A. Broder, R. Kumar, F Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomks, Jiener, Graph Structure the eb, Computer Networks : The International Journal of Computer and telecommunications Networkg, Vol. 33, Issue 1-6, pp , ICT_09_curt

A Comparative Study of Page Ranking Algorithms for Information Retrieval

orld Academy of Science, Engeerg and Technology International Journal of Computer and Information Engeerg A Comparative Study of Page Rankg Algorithms for Information Retrieval Ashutosh Kumar Sgh, Ravi