A Comparative Study of Page Ranking Algorithms for Information Retrieval

Size: px

Start display at page:

Download "A Comparative Study of Page Ranking Algorithms for Information Retrieval"

Colin Osborne
6 years ago
Views:

1 orld Academy of Science, Engeerg and Technology International Journal of Computer and Information Engeerg A Comparative Study of Page Rankg Algorithms for Information Retrieval Ashutosh Kumar Sgh, Ravi Kumar P International Science Index, Computer and Information Engeerg waset.org/publication/3153 Abstract This paper gives an troduction to eb mg, then describes eb Structure mg detail, and explores the data structure used by the eb. This paper also explores different Page Rank algorithms and compare those algorithms used for Information Retrieval. In eb Mg, the basics of eb mg and the eb mg categories are explaed. Different Page Rank based algorithms like PageRank (PR), PR (eighted PageRank), HITS (Hyperlk-Induced Topic Search), DistanceRank and DirichletRank algorithms are discussed and compared. PageRanks are calculated for PageRank and eighted PageRank algorithms for a given hyperlk structure. Simulation Program is developed for PageRank algorithm because PageRank is the only rankg algorithm implemented the search enge (Google). The puts are shown a table and chart format. Keywords eb Mg, eb Structure, eb Graph, Lk Analysis, PageRank, eighted PageRank, HITS, DistanceRank, DirichletRank, T I. INTRODUCTION HE orld ide eb () is growg tremendously on all aspects and is a massive, explosive, diverse, dynamic and mostly unstructured data repository. As on today is the largest formation repository for knowledge reference. There are a lot of challenges [1] the eb: eb is huge, eb pages are semi-structured, eb formation tends to be diversity meang, degree of quality of the formation extracted and the conclusion of the knowledge from the extracted formation. A Google report [5] on 25 th July 2008 says that there are 1 trillion (1,000,000,000,000) unique URLs (Universal Resource Locator) on the eb. The actual number could be more than that and Google could not dex all the pages. hen Google first created the dex 1998 there were 26 million pages and 2000 Google dex reached 1 billion pages. In the last 9 years, eb has grown tremendously and the usage of the web is unimagable. So it is important to understand and analyze the underlyg data structure of the eb for effective Information Retrieval. eb mg techniques along with other areas like Database (DB), Information Retrieval (IR), Natural Language Processg (NLP), Mache Learng etc. can be used to solve the above challenges. Ashutosh Kumar Sgh is with the Department of Electrical and Computer Engeerg, Curt University of Technology, Miri, Sarawak, Malaysia ( ashutosh.s@curt.edu.my). Ravi Kumar P is with the Department of Computer & I.T., Jefri Bolkiah College of Engeerg, Brunei, dog his PhD at Curt University of Technology, Miri, Malaysia ( ravi2266@gmail.com). ith the rapid growth of and the user s demand on knowledge, it is becomg more difficult to manage the formation on and satisfy the user needs. Therefore, the users are lookg for better formation retrieval techniques and tools to locate, extract, filter and fd the necessary formation. Most of the users use formation retrieval tools like search enges to fd formation from the. There are tens and hundreds of search enges available but some are popular like Google, Yahoo, Bg etc., because of their crawlg and rankg methodologies. The search enges download, dex and store hundreds of millions of web pages. They answer tens of millions of queries every day. So eb mg and rankg mechanism becomes very important for effective formation retrieval. The sample architecture [2] of a search enge is shown Fig. 1. eb Crawler Indexer Index Query Interface eb Mg Query Processor Fig. 1 Sample Architecture of a Search Enge There are 3 important components a search enge. They are Crawler, Indexer and Rankg mechanism. The crawler is also called as a robot or spider that traverses the web and downloads the web pages. The downloaded pages are sent to an dexg module that parses the web pages and builds the dex based on the keywords those pages. An alphabetical dex is generally mataed usg the keywords. hen a user types a query usg keywords on the terface of a search enge, the query processor component match the query keywords with the dex and returns the URLs of the pages to the user. But before presentg the pages to the user, a rankg mechanism is done by the search enges to present the most relevant pages at the top and less relevant ones at the bottom. It makes the search results navigation easier for the user. The rankg mechanism is explaed detail later this paper. This paper is organized as follows. Section II provides the basic eb mg concepts and the three areas of eb mg. In this section eb Structure mg is described detail because most of the Page Rank algorithms are based on 1154

2 orld Academy of Science, Engeerg and Technology International Journal of Computer and Information Engeerg International Science Index, Computer and Information Engeerg waset.org/publication/3153 the eb Structure Mg. Section III describes Data Structure used for eb particular the eb Graph. Section IV explores important algorithms based on Page Rank and compares those algorithms. Section V shows the simulation results and Section VI concludes this paper. A. Overview II. EB MINING eb mg is the use of data mg techniques to automatically discover and extract formation from the orld ide eb (). Accordg to Kosala et al [3], eb mg consists of the followg tasks: Resource fdg: the task of retrievg tended eb documents. Information selection and pre-processg: automatically selectg and pre-processg specific formation from retrieved eb resources. Generalization: automatically discovers general patterns at dividual eb sites as well as across multiple sites. Analysis: validation and/or terpretation of the med patterns. Resource fdg is the process of retrievg the data that is either onle or offle from the electronic newsgroups, newsletters, newswire, Libraries, HTML documents that are available as text sources on the eb. Information selection and pre-processg is selectg the HTML documents and transform the HTML documents by removg HTML tags, stop words, stemmg etc. Generalization is the process of discoverg general patterns at dividual web sites as well as across multiple sites. Analysis referred to the validation and/or terpretation of the med patterns. Human plays an important role the formation or knowledge discovery process on the eb sce eb is an teractive medium. This is especially important for validation and/or terpretation. Database Approach eb Content Mg Agent based Approach eb Mg eb Structure Mg General Access Pattern Trackg Fig. 2 eb Classification eb Usage Mg Customized Usage Trackg There are three areas of eb mg accordg to the usage of the eb data used as put the data mg process, namely, eb Content Mg (CM), eb Usage Mg (UM) and eb Structure Mg (SM). eb content mg is concerned with the retrieval of formation from to more structured form and dexg the formation to retrieve it quickly. eb usage mg is the process of identifyg the browsg patterns by analyzg the user s navigational behavior. eb structure mg is to discover the model underlyg the lk structures of the eb pages, catalog them and generate formation such as the similarity and relationship between them, takg advantage of their hyperlk topology. eb classification [4] is shown Fig 2. Even though there are three areas of eb mg, the differences between them are narrowg because they are all terconnected. eb Content mg and eb Structure mg are related. They are basically used to extract the knowledge from the orld ide eb. eb content is concerned with the retrieval of formation from to more structured form. eb structure mg helps to retrieve more relevant formation by analyzg the lk structure. Most of the researchers now focus on the combation of the three areas of eb mg to produce a better research. Defitions eb Content Mg (CM) eb Content Mg is the process of extractg useful formation from the contents of web documents. The web documents may consists of text, images, audio, video or structured records like tables and lists. Mg can be applied on the web documents as well the results pages produced from a search enge. There are two types of approach content mg called agent based approach and database based approach. The agent based approach concentrate on searchg relevant formation usg the characteristics of a particular doma to terpret and organize the colleted formation. The database approach is used for retrievg the semi-structure data from the web. eb Usage Mg (UM) eb Usage Mg is the process of extractg useful formation from the secondary data derived from the teractions of the user while surfg on the eb. It extracts data stored server access logs, referrer logs, agent logs, client-side cookies, user profile and meta data. eb Structure Mg (SM) The goal of the eb Structure Mg is to generate the structural summary ab the eb site and eb page. It tries to discover the lk structure of the hyperlks at the terdocument level. Based on the topology of the hyperlks, eb Structure mg will categorize the eb pages and generate the formation like similarity and relationship between different eb sites. This type of mg can be performed at the document level (tra-page) or at the hyperlk level (terpage). It is important to understand the eb data structure for Information Retrieval. III. DATA STRUCTURE FOR EB The traditional formation retrieval system basically focuses on formation provided by the text of eb 1155

3 orld Academy of Science, Engeerg and Technology International Journal of Computer and Information Engeerg International Science Index, Computer and Information Engeerg waset.org/publication/3153 documents. eb mg technique provides additional formation through hyperlks where different documents are connected. The eb may be viewed as a directed labeled graph whose nodes are the documents or pages and the edges are the hyperlks between them. This directed graph structure the eb is called as eb Graph. A graph G consists of two sets V and E, Horowitz et al [6]. The set V is a fite, nonempty set of vertices. The set E is a set of pairs of vertices; these pairs are called edges. The notation V(G) and E(G) represent the sets of vertices and edges, respectively of graph G. It can also be expressed G = (V, E) to represent a graph. In an undirected graph the pair of vertices representg any edge is unordered. Thus the pairs (u, v) and (v, u) represent the same edge. In a directed graph each edge is represented by a directed pair (u, v); u is the tail and v is the head of the edge. Therefore, (v, u) and (u, v) represent two different edges. The graph Fig. 3 is a directed graph with 3 Vertices and 6 edges. Fig. 3 A Directed Graph, G The vertices V of G, V(G) = {A, B, C}. The Edges E of G, E(G) ={(A, B), (B, A), (B, C), (C, B), (A, C), (C, A)}. In a directed graph with n vertices, the maximum number of edges is n(n-1). ith 3 vertices, the maximum number of edges can be 3(3-1) = 6. A directed graph is said to be strongly connected if for every pair of distct vertices u and v V(G), there is a directed path from u to v and also from v to u. The above graph Fig. 3 is strongly connected, because all the vertices are connected both directions. Accordg to Broader et al. [7], a eb can be imaged as a large graph contag several hundred million or billion of nodes or vertices, and a few billion arcs or edges. The degree of a vertex is the number of edges cident to that vertex. If G is a directed graph, we defe the -degree of a vertex v to be the number of edges for which v is the head. The -degree is defed to be the number of edges for which v is the tail. In Fig. 3, the graph G, vertex B has -degree 2, -degree 2 and degree 4. If d i is the degree of vertex i a graph G with n vertices and e edges, then the number of edges is as shown (1). n d i i= 1 A C e = ( ) / 2 (1) Several researches have done to analyze the properties of the graph [8, 9]. Broader et al. showed the structure of the eb graph looks like a giant bow tie as shown Fig. 4. This, eb macroscopic structure has four pieces. The first piece is a B central core, all of whose pages can reach one another along directed lks -- this "giant strongly connected component" (SCC) is at the heart of the web. The second and third pieces are called IN and OUT. IN consists of pages that can reach the SCC, but cannot be reached from it - possibly new sites that people have not yet discovered and lked to. OUT consists of pages that are accessible from the SCC, but do not lk back to it, such as corporate websites that conta only ternal lks. Fally, the TENDRILS conta pages that cannot reach the SCC, and cannot be reached from the SCC. The macroscopic structure of the eb [7] is shown Fig. 4. Accordg to Broader et al., the size of the SCC is relatively small comparg with IN, OUT and Tendrils. Almost all the sets have roughly the same size. So it is evident that the eb is growg rapidly and it is a huge structure. The followg section explas important page rankg algorithms and compares those algorithms used for formation retrieval. Fig. 4 Macroscopic Structure of eb [7] IV. PAGE RANK ALGORITHMS ith the creasg number of eb pages and users on the eb, the number of queries submitted to the search enges are also creasg rapidly. Therefore, the search enges needs to be more efficient its process. eb mg techniques are employed by the search enges to extract relevant documents from the web database and provide the necessary formation to the users. The search enges become very successful and popular if they use efficient Rankg mechanism. Google search enge is very successful because of its PageRank algorithm. Page rankg algorithms are used by the search enges to present the search results by considerg the relevance, importance and content score and web mg techniques to order them accordg to the user terest. Some rankg algorithms depend only on the lk structure of the documents i.e. their popularity scores (web structure mg), whereas others look for the actual content the documents (web content mg), while some use a combation of both i.e. they use content of the document as well as the lk structure to assign a rank value for a given document. If the search results are not displayed accordg to the user terest then the search enge will loose its popularity. So the rankg algorithms become very important. 1156

4 orld Academy of Science, Engeerg and Technology International Journal of Computer and Information Engeerg International Science Index, Computer and Information Engeerg waset.org/publication/3153 Some of the popular page rankg algorithms are discussed the followg section. A. Citation Analysis Lk analysis is similar to social networks and citation analysis. The citation analysis was developed formation science as a tool to identify core sets of articles, authors, or journals of a particular field of study. Impact factor [10] developed by Eugene Garfield is used to measure the importance of a publication. This metric takes to account the number of citations received by a publication. The impact factor is proportional to the total number of citations a publications has. This treats all the references equally. Some important references which are referred many times should be given an additional weight. Pski et al [11] proposed a model to overcome this problem called fluence weights where the weight of each publication is equal to the sum of its citations, scaled by the importance of these citations. The same prciple is applied to the eb for rankg the web pages where the notion of citations corresponds to the lks potg to a eb page. This simplest rankg of a eb page could be done by summg up the number of lks potg to it. This favors only the most popular eb sites, such as universally known portals, news pages, news broadcasters etc. In the eb, the quality of the page and the content s diversity should also be considered. In the scientific literature, co-citations are between the same networks of knowledge. On the other hand, eb contas lot of formation, servg for different purposes. These hyperlked communities that appear to span a wide range of terests and disciples, Gibson et al [12] are called as eb communities and the process of identifyg them is called as trawlg, R. Kumar et al [13].There are a number of algorithms proposed based on the Lk Analysis. Usg citation analysis, Co-citation algorithm [14] and Extended Cocitation algorithm [15] are proposed. These algorithms are simple and deeper relationships among the pages can not be discovered. Five Page Rank based algorithms PageRank [PR] [16], eighted PageRank (PR) [17], Hypertext Induced Topic Search HITS [18], DistanceRank [23] and DirichletRank [27] algorithms are discussed below detail and compared. B. PageRank Br and Page developed PageRank algorithm durg their Ph D at Stanford University based on the citation analysis. PageRank algorithm is used by the famous search enge, Google. They applied the citation analysis eb search by treatg the comg lks as citations to the eb pages. However, by simply applyg the citation analysis techniques to the diverse set of eb documents did not result efficient comes. Therefore, PageRank provides a more advanced way to compute the importance or relevance of a eb page than simply countg the number of pages that are lkg to it (called as backlks ). If a backlk comes from an important page, then that backlk is given a higher weightg than those backlks comes from non-important pages. In a simple way, lk from one page to another page may be considered as a vote. However, not only the number of votes a page receives is considered important, but the importance or the relevance of the ones that cast these votes as well. Assume any arbitrary page A has pages T 1 to T n potg to it (comg lk). PageRank can be calculated by the followg (2). PR A) = (1 d) + d( PR( T ) / C( T ) PR( T / C( T )) (2) ( 1 1 n n The parameter d is a dampg factor, usually sets it to 0.85 (to stop the other pages havg too much fluence, this total vote is damped down by multiplyg it by 0.85). C(A) is defed as the number of lks gog of page A. The PageRanks form a probability distribution over the eb pages, so the sum of all eb pages PageRank will be one. PageRank can be calculated usg a simple iterative algorithm, and corresponds to the prcipal eigenvector of the normalized lk matrix of the eb. Let us take an example of hyperlk structure of four pages A, B, C and D as shown Fig. 5. The PageRank for pages A, B, C and D can be calculated by usg (2). Page A Page C Page B Page D Fig. 5 Hyperlk Structure for 4 pages Let us assume the itial PageRank as 1 and do the calculation. The dampg factor d is set to PR(A) = (1-d) + d (PR(B)/C(B)+PR(C)/C(C)+PR(D)/C(D)) = (1-0.85) (1/3+1/3+1/1) = (2a) PR(B) = (1-d) + d((pr(a)/c(a) + (PR(C)/C(C)) = (2b) PR(C) = (1-d) + d((pr(a)/c(a) + (PR(B)/C(B)) = (2c) PR(D)= (1-d) + d((pr(b)/c(b) + (PR(C)/C(C)) = (2d) Do the second iteration by takg the above PageRank values from (2a), (2b), (2c) and 2(d). PR(A) = (( /3)+( /3)+( /1) = (2e) PR(B) = (( /2)+( /3)) = (2f) PR(C) = (( /2)+( /3)) = (2g) PR(D) = (( /3)+( /3)) = (2h) Durg the 34 th iteration, the PageRank gets converged as shown Table I. The table with the graph is shown the Simulation Results section. 1157

5 orld Academy of Science, Engeerg and Technology International Journal of Computer and Information Engeerg International Science Index, Computer and Information Engeerg waset.org/publication/3153 TABLE I ITERATIVE CALCULATION FOR PAGERANK Iteration A B C D For a smaller set of pages, it is easy to calculate and fd the PageRank values but for a eb havg billions of pages, it is not easy to do the calculation like above. In the above Table I, you can notice that PageRank of A is higher than PageRank of B, C and D. It is because Page A has 3 comg lks, Page B, C and D have 2 comg lks as shown Fig. 5. Page B has 2 comg lks and 3 gog lk. Page C has 2 comg lks and 3 gog lks. Page D has 1 comg lk and 2 gog lks. From the Table I, after the iteration 34, the PageRank for the pages gets normalized. Previous experiments [19, 20] showed that the PageRank gets converged to a reasonable tolerance. The convergence of PageRank calculation is shown as a graph Fig.11 the Simulation Result section. PageRank is displayed on the toolbar of the browser if the Google Toolbar [32] is stalled. The Toolbar PageRank goes from 0 10, like a logarithmic scale with 0 is the low page rank and 10 is the highest page rank. The PageRank of all the pages on the eb changes every month when Google does its re-dexg. PageRank says nothg ab the content or size of a page, the language it s written, or the text used the anchor of a lk. C. eighted PageRank Algorithm enpu Xg and Ali Ghorbani [17] proposed a eighted PageRank (PR) algorithm which is an extension of the PageRank algorithm. This algorithm assigns a larger rank values to the more important pages rather than dividg the rank value of a page evenly among its gog lked pages. Each gog lk gets a value proportional to its importance. The importance is assigned terms of weight values to the comg and gog lks and are denoted as (m, n) and (m, n) respectively. (m, n) as shown (3) is the weight of lk(m, n) calculated based on the number of comg lks of page n and the number of comg lks of all reference pages of page m. I n ( m, n ) = (3) I p p R( m) On ( m, n ) = (4) O p R( m) p here I n and I p are the number of comg lks of page n and page p respectively. R(m) denotes the reference page list of page m. (m, n) is as shown (4) is the weight of lk(m, n) calculated based on the number of gog lks of page n and the number of gog lks of all reference pages of m. here O n and O p are the number of gog lks of page n and p respectively. The formula as proposed by enpu et al for the PR is as shown (5) which is a modification of the PageRank formula. PR ( n ) = (1 d ) + d PR ( m ) ( m, n ) ( m, n (5) ) m B ( n ) Use the same hyperlk structure as shown Fig. 5 and do the PR Calculation. The PR equation for Page A, B, C and D are as follows. A) = (1 d) + d( B).. ( ).. (, ) ( + PR C B A B, A) ( C, A) ( C, A) (5a) + D).. ) ( D, A) ( D, A) B) = (1 d) + d( A). + C). + B). ( C, B ). C) = (1 d) + d( A). ( B, C ). ( A, B ). ( C, B )) ( A, C ). ( B, C )) ( A, C ) ( A, B ) (5b) (5c) D) = (1 d) + d( B).. ( B, D ) ( B, D ) C). (, ). (5d) + C D ( C, D )) The comg lk and gog lk weights are calculated as follows: ( B, A ) = I A /( I A + I C) = 3 /(3 + 2) = 3 / 5 (5e) (, ) = O A /( O A + OC + OD) = 2 /( ) = 2 / 6 = 1/ 3 (5f) B A ( C, A ) = I A /( I A + I B) = 3 /(3 + 2) = 3 / 5 (5g) ( C, A ) = O A /( O A + OB + OD) = 2 /( ) = 2 / 6 = 1/ 3 (5h) ( D, A ) = I A /( I B + I C) = 3 /(2 + 2) = 3 / 4 (5i) (, ) / 2 / D A = O A O A = 2 = 1 (5j) By substitutg the values of (5e), (5f), 5(g), 5(h), 5(i) and 5(j) to (5a), you will get the PR of Page A by takg a value of 0.85 for d and the itial value of B), C) and D) = 1. PR ( A) = (1 0.85) (1* 3 / 5 *1/ 3 + 1* 3 / 5 *1/ 3 + 1* 3 / 4 *1) = (5k) A = I B /( I B + I C + I D) = 2 /( ) = 2 / 6 1/ 3 (5l) (, B ) = (, ) /( ) 3 /(3 3) 3 / A B = OB OB + OC = + = 6 = 1/ 2 (5m) ( C, B ) = I B /( I A + I B) = 2 /(3 + 2) = 2 / 5 (5n) 1158

6 orld Academy of Science, Engeerg and Technology International Journal of Computer and Information Engeerg International Science Index, Computer and Information Engeerg waset.org/publication/3153 (, ) /( ) 3/(2 3 1) 3/ C B = OB OA + OB + OD = + + = 6 = 1/ 2 (5o) By substitutg the values of (5k), (5l), (5m), (5n) and (5o) to (5b), you will get the PR of Page B by takg d as 0.85 and the itial value of C) = 1 PR ( B ) = ( ((1.127 *1 / 3 *1 / 2 + 1* 2 / 5 *1 / 2) (5p) = A c (5q) (, ) = IC /( I B + IC + I D) = 2/( ) = 2/ 6 = 1/ 3 (, ) /( ) 3 /(3 3) 3 / A C = OC OB + OC = + = 6 = 1/ 2 (5r) ( B, C ) = I C /( I A + I B) = 2(3 + 2) = 2 / 5 (5s) C ( B, ) = OC /( OA+ OC + OD) = 3/( ) == 3/ 6 = 1/ 2 (5t) By substitutg the values of (5k), (5p), (5q), (5r), (5s) and (5t) to (5c), you will get the PR of Page C by takg d as PR ( C) = (1 0.85) ((1.127 *1/ 3 *1/ 2) + (0.499 * 2 / 5 *1/ 2)) = (5u) ( B, D ) = I D /( I B + I C) = 2(2 + 2) = 2 / 4 = 1/ 2 (5v) ( B, D ) = OD / O A = 2 / 2 = 1 (5w) ( C, D ) = I D /( I A + I B) = 2(2 + 3) = 2 / 5 (5x) C D = OD / ( O + OB + OD) = 2 / = 2 / 6 1/ 3 (5y) (, ) A = By substitutg the values of (5p), (5u), (5v), (5w), (5x) and (5y) to (5d), you will get the PR of Page D by takg d as PR ( D) = (1 0.85) ((0.499 *1/ 2 *1) + (0.392 * 2 / 5 *1/ 3)) = (5z) The values of A), B), C) and D) are shown (5k), (5p), (5u) and (5z) respectively. In this, A)>B)>D)>C). This results shows that the page rank order is different from PageRank. To differentiate the PR from the PageRank, enpu et al, categorized the resultant pages of a query to four categories based on their relevancy to the given query: They are: 1. Very Relevant Pages (VR): The pages contag very important formation related to a given query. 2. Relevant Pages (R): Pages are relevant but not havg important formation ab a given query. 3. eak Relevant Pages (R): Pages may have the query keywords but they do not have the relevant formation. 4. Irrelevant Pages (IR): Pages not havg any relevant formation and query keywords. The PageRank and PR algorithms both provide ranked pages the sortg order to users based on the given query. So, the resultant list, the number of relevant pages and their order are very important for users. enpu et al proposed a Relevance Rule to calculate the relevancy value of each page the list of pages. That makes PR is different from PageRank. Relevancy Rule: The Relevancy Rule is as shown equation (6). The Relevancy of a page to a given query depends on its category and its position the page-list. The larger the relevancy value, the better is the result. k = ( n i)* (6) i R( p) i here i denotes the i th page the result page-list R(p), n represents the first n pages chosen from the list R(p), and i is the weight of i th page as given below: i = ( v1, v2, v3, v4) here v1, v2, v3 and v4 are the values assigned to a page if the page is VR, R, R and IR respectively. The values are always v1>v2>v3>v4. Experimental studies by enpu et al. showed that PR produces larger relevancy values than the PageRank. C. The HITS Algorithm - Hubs & Authorities Kleberg [18] identifies two different forms of eb pages called hubs and authorities. Authorities are pages havg important contents. Hubs are pages that act as resource lists, guidg users to authorities. Thus, a good hub page for a subject pots to many authoritative pages on that content, and a good authority page is poted by many good hub pages on the same subject. Hubs and Authorities are shown Fig. 6. Kleberg says that a page may be a good hub and a good authority at the same time. This circular relationship leads to the defition of an iterative algorithm called HITS (Hyperlk Induced Topic Search). Hubs Authorities Fig. 6 Hubs and Authorities The HITS algorithm treats as a directed graph G(V,E), where V is a set of Vertices representg pages and E is a set of edges that correspond to lks. 1159

7 orld Academy of Science, Engeerg and Technology International Journal of Computer and Information Engeerg International Science Index, Computer and Information Engeerg waset.org/publication/3153 HITS orkg Method There are two major steps the HITS algorithm. The first step is the Samplg Step and the second step is the Iterative step. In the Samplg step, a set of relevant pages for the given query are collected i.e. a sub-graph S of G is retrieved which is high authority pages. This algorithm starts with a root set R, a set of S is obtaed, keepg md that S is relatively small, rich relevant pages ab the query and contas most of the good authorities. The second step, Iterative step, fds hubs and authorities usg the put of the samplg step usg (7) and (8). H = (7) p A q q I ( p) A = (8) p H q q B( p) here H p is the hub weight, A p is the Authority weight, I(p) and B(p) denotes the set of reference and referrer pages of page p. The page s authority weight is proportional to the sum of the hub weights of pages that it lks to it, Kleberg [21]. Similarly, a page s hub weight is proportional to the sum of the authority weights of pages that it lks to. Fig. 7 shows an example of the calculation of authority and hub scores. Q1 Q2 Q3 A P = H Q1 + H Q2 + H Q3 P H P = A R1 + A R2 Fig. 7 Calculation of hubs and Authorities Constrats of HITS The followg are the constrats of HITS algorithm [22]: Hubs and authorities: It is not easy to distguish between hubs and authorities because many sites are hubs as well as authorities. Topic drift: Sometime HITS may not produce the most relevant documents to the user queries because of equivalent weights. Automatically generated lks: HITS gives equal importance for automatically generated lks which may not produce relevant topics for the user query. Efficiency: HITS algorithm is not efficient real time. HITS was used a prototype search enge called Clever [22] for an IBM research project. Because of the above R1 R2 constrats HITS could not be implemented a real time search enge. D. DistanceRank Algorithm DistanceRank algorithm is proposed by Ali Mohammad Zareh Bidoki and Nasser Yazdani [23]. They proposed a novel recursive method based on reforcement learng [24] which considers distance between pages as punishment, called DistanceRank to compute ranks of web pages. The number of average clicks between two pages is defed as distance. The ma objective of this algorithm is to mimize distance or punishment so that a page with smaller distance to have a higher rank. Most of the current rankg algorithms have the rich-getricher problem [25] i.e. the popular high rank web pages become more and more popular and the young high quality pages are not picked by the rankg algorithms. There are many solutions [23, 25] suggested for this rich-get-richer problem. The DistanceRank algorithm algorithms is less sensitive to the rich-get-richer problem and fds important pages faster than others. This algorithm is based on the reforcement learng such that the distance between pages is treated as punishment factor. Normally related pages are lked to each other so the distance based solution can fd pages with high qualities more quickly. In PageRank algorithm, the rank of each page is defed as the weighted sum of ranks of all pages havg back lks or comg lks to the page. Then, a page has a high rank if it has more back lks to this page have higher ranks. These two properties are true for DistanceRank also. A page havg many comg lks should have low distance and if pages potg to this page have low distance then this page should have a low distance. The above pot is clarified usg the followg defition. m n o p Fig. 8 A Sample Graph Defition 1. If page a pots to page b then the weight of lk between a and b is equal to Log 10 O(a) where O(a) shows a s degree or gog lks. Defition 2. The distance between two pages a and b is the weight of the shortest path (the path with the mimum value) q r s t u v 1160

8 orld Academy of Science, Engeerg and Technology International Journal of Computer and Information Engeerg International Science Index, Computer and Information Engeerg waset.org/publication/3153 from a to b. This is called logarithmic distance and is denoted as d ab. For example, Fig. 8, the weight of -lks or gog lks pages m, n, o and p is equal to log(3), log(2), log(2) and log(3) respectively and the distance between m and t is equal to log(3) + log(2) if the path m o t was the shortest path between m and t. The distance between m and v is log(3) + log(3) as shown Fig. 8 even though both t and v are the same lk level from m (two clicks) but t is closer to m. Defition 3. If d ab shows the distance between two pages a and b as Defition 2, then d b denotes the average distance of page b and is defed as the followg where V shows number of web pages: d b v a =1 d = V ab (21) In this defition, the authors used an average click stead of the classical distance defition. The weight of each lk is equal to log(o(a)). If there is no path between a and b, then d ab will be set a big value. In this method after the distance computation, pages are sorted the ascendg order and pages with smaller average distances will have high rankg. This method is dependent on the degree or gog lks of nodes the web graph like other algorithms. Apart from that it also follows the web graph like the random-surfer model [16] used PageRank that each put lk of page a is selected with probability 1/O(a). That is rank s effect of a on page b as the verse product of the -degrees of pages the logarithmic shortest path between a and b. For example, if there is the logarithmic shortest path with sgle length 3 from a to b like a c d b, then a s effect on b is (1/O(a)) * (1/O(c)) * (1/O(d)) * (1/O(b)). In other words, the probability that a random surfer started from page a to reach to page b is (1/O(a)) * (1/O(c)) * (1/O(d)) * (1/O(b)). Accordg to the authors that if the distance between a and b, d ab is less than the distance between a and c, d ac then a s rank effect, r ab on b is more than on c, i.e if d ab < d ac the r ab > r ac.. In other words, the probability that a random surfer reach b from a is more than the probability to reach from c. The purpose of the DistanceRank is to compute the average distance of each page and there is a dependency between the distance of each page and its comg lks or back lks. For example, if page b has only one back lk and it is from page a, to compute the average distance for page b, d b is as follows. d b = d a + log(o(a)) (22) In general, suppose O(a) denotes the number of forwardg or gog lks from page a and B(b) denotes the set of pages potg to page b. The DistanceRank of page b denoted by d b is given as follows. d b = m( d + logo( a)), a B( b) (23) a The distance d t from Fig. 8 is calculated as follows. d t = m{d o + log2, d p + log3} = m{d m + log3 + log2, d m + log3 + log3} = {d m + log3 + log2} = d m Accordg to the authors, the DistanceRank is similar to PageRank rankg pages. Usg (23), the authors proposed the followg formula based on the Q-learng that is a type of reforcement learng algorithm [24] to compute the distance of page b (a lks to b). * d = (1 α ) * d + α * m(log( O( a)) + γ d ), a B( b),0 < α 1, 0 γ 1 (24) bt + 1 bt at where α is learng rate and log(o(a) is the stantaneous punishment it receives transition state from a to b. d bt and d at show distance of page b and a time t respectively and d bt + 1 is distance of page b at time t + 1. In other words, the distance of page b at time t + 1 depends on its previous distance, its father distance (d a ) and log(a), the stantaneous punishment from selection page b by the user. The discount factor γ is used to regulate the effects of the distance of pages the path leadg to page b on the distance of page b. For example, if there is a path m n o p, then the effect of the distance of m on o is regulated with a γ factor. In this fashion, the sum of received punishments is gog to decrease. Sce (24) is based on the reforcement learng algorithm, it will converge fally and reach to the global optimum state [24]. The followg (25) shows the learng rate α, where t shows time or iteration number and β is a static value to control regularity of the learng rate. If the learng rate is properly adjusted, the system will convergence and reach to the stability state very fast with a high throughput. In the begng the distances of pages are not known, so itially α is set to one and then decreases exponentially to zero. β*t α = e (25) Accordg to the authors, the user is an agent surfg the web randomly and each step it receives some punishments from the environment. The goal is to mimize the sum of punishments. In each state, the agent has some selections, next pages for click, and the page with the mimum received punishment will be selected as the next page for visitg. ith that the (24) can be modified as follows: d b = α * (previous punishment of selectg b) + (1- α) * (current punishment + stantaneous punishment that user will receive from selection b) So d b is the total punishment an agent receives from selection page b. This system tries to simulate the real user surfg the web. hen a user starts browsg a random page, he/she does not have any background ab the web. Then, by browsg and visitg web pages, he/she clicks lks based on both the current status of web pages and the previous experiences. As the time goes, the user gas knowledge browsg and gets the favorite pages faster. DistanceRank uses the same kd of 1161

9 orld Academy of Science, Engeerg and Technology International Journal of Computer and Information Engeerg International Science Index, Computer and Information Engeerg waset.org/publication/3153 approach like a real user, sets itially α = 1 and after visitg more pages the system gets more formation, α decreases and effectively selects the next pages. The DistanceRank is computed recursively like PageRank as shown (24). The process iterates to converge. It is possible [23] to compute distances with O(p * E ) time complexity when p << V which is very close to an ideal state. For stance, p is 7 for 7 millions pages implyg that 7 iterations are enough for an acceptable rankg. After convergence, the DistanceRank vector is produced. Pages with low DistanceRank will have high rankg and are sorted the ascendg order. The authors used two scenarios for experimental purpose. One is crawlg schedulg and the other is rank orderg. The objective of the crawlg schedulg is to fd more important pages faster. In the rank orderg, DistanceRank is compared with PageRank and Google s rank with and with respect to a user query. Based on the experimental results done by the authors, the crawlg algorithms used by the DistanceRank performs [23] other algorithms like Breadth-first, Partial PageRank, Back-Lk and OPIC terms of throughput. That is DistanceRank fds high important pages faster than other algorithms. Also on the rank orderg, the DistanceRank was better than PageRank and Google s rank. The results of DistanceRank are closer to Google than PageRank. DistanceRank and rankg problems One of the ma problems the current search enges is rich-get-richer that causes the new high quality pages receives less popularity. To study more on this problem Cho et al [26] proposed two models on how users discovers new pages, Random-Surfer and Search-Domant model. In Random-Surfer, a user fds new pages by surfg the web randomly with the help of search enges while Search- Domant model fds new pages usg search enges. The authors found that it takes 60 times longer for a new page to become popular under Search-Domant model than Random-Surfer model. If a rankg algorithm can fd new high quality pages and crease their popularity earlier [25] then that algorithm is less sensitive to rich-get-richer problem. That is the algorithms should predict popularity that pages will get the future. This DistanceRank algorithm is less sensitive to rich-getricher problem. DistanceRank algorithm also provides good prediction of pages for future rankg. The convergence speed of this algorithm is fast with a little number of iterations. In DistanceRank, it is not necessary to change the web graph for computation. Therefore some parameters like dampg factor can be removed and can work on the real graph. F. DirichletRank Algorithm DirichletRank algorithm is proposed by Xuanhui ang et al [27]. This algorithm elimates the zero-one gap problem found the PageRank algorithm proposed by Br and Page [19]. The zero-one gap problem occurs due to the current ad hoc way of computg transition probabilities the random surfg model. The authors proposed a novel DirichletRank algorithm which calculates the probabilities usg the Bayesian estimation of Dirichlet prior. This zero-one gap problem can be exploited to spam PageRank results and make the state-of-art lk-based anti spammg techniques effective. DirichletRank is a form of PageRank and the authors have shown that DirichletRank algorithm is free from the zero-one gap problem. They have also proved that this algorithm is more robust agast several common lk spams and is more stable under lk perturbations. The authors also claim that this is as efficient as PageRank and it is scalable to large-scale web applications. Search enges are gettg more and more popular and it becomes the default method for formation acquisition. Every body wants their pages to be on the top of the search result. This leads to the web spammg. eb spammg [28] is a method to maliciously duce bias to search enges so that certa target pages will be ranked much higher than they deserve. Consequently, it leads to poor quality of search results and turn will reduce the trust of search enge. Anti-spammg is now a big challenge for all the search enges. Earlier eb spammg was done by addg a variety of query keywords on page contents regardless of their relevance. This type of spammg is easy to detect and now the spammers are tryg to use the lk spammg [29] after the popularity of the lk-based algorithm like PageRank. In lk spammg, the spammers tentionally set up lk structures, volvg a lot of terconnected pages to boost the PageRank scores of a small number of target pages. This lk spammg is not only creases rank gas but also harder to detect by the search enges. Fig. 9 (b) shows a sample lk spam structure. Here, the leakage is used to refer to the PageRank scores that reach the lk farm from external pages. In this, a web owner creates a large number of bogus web pages called B s (their sole purpose is to promote the target page s rankg score), all potg to and poted by a sgle target page T. The PageRank assigns a higher rankg score to T than it deserves (sometime up to 10 times of the origal score) because it can be deceived by the lk spammg. Xuanhui ang et al [27] proved that PageRank has a zeroone gap flaw which can be potentially exploited by spammers to easily spam PageRank results. This zero-one gap problem occurs from the ad hoc way of computg the transition probabilities the random surfg model adopted the current. The probability that the random surfer clicks on one lk is solely given by the number of lks on that page. This is why one page's PageRank is not completely passed on to a page it lks to, but is divided by the number of lks on the page. So, the probability for the random surfer reachg one page is the sum of probabilities for the random surfer followg lks to this page. Now, this probability is reduced by the dampg factor d. The justification with the Random Surfer Model, therefore, is that the surfer does not click on an fite number of lks, but gets bored sometimes and jumps to another page at random. The zero-one gap problem refers to the unreasonable dramatic difference between a page with no -lk and one with a sgle -lk their probabilities of randomly jumpg to any page. The authors provided a novel 1162

10 orld Academy of Science, Engeerg and Technology International Journal of Computer and Information Engeerg International Science Index, Computer and Information Engeerg waset.org/publication/3153 DirichletRank algorithm based on the Bayesian estimation with a Dirichlet prior to solve the zero-one gap problem specially the transition probabilities. Zero-one gap problem The basic PageRank assumes each row of the matrix M has at least one non-zero entry i.e. correspondg node G has at least one -lk. But reality it does not true. Many web pages does not have any -lks and many web applications only consider a sub-graph of the whole web. Even if a page has -lks it might have been removed when the whole web is projected to a sub-graph. Removg all the pages with -lks is not a solution because it generates new zero-lk pages. This danglg page problem has been described by Br and Page [16], Bianchi et al [30] and Dg et al [31]. The probability of jumpg to random pages is 1 zero-lk page, but it drops to λ ( most cases, λ = 0.15) for a page with a sgle -lk. There is a big difference between 0 and 1 -lk. This problem is referred as zero-one gap. This problem is a serious flaw the PageRank because it allows spammers to manipulate the rankg results of PageRank. DirichletRank Algorithm DirichletRank algorithm based on the Bayesian estimation of transition probabilities. Accordg to the authors this algorithm not only solves the zero-one gap problem but also provides a way to solve the zero--lk problem. The authors compared DirichletRank with PageRank and showed that DirichletRank is less sensitive to the changes the local structure and more robust than PageRank. In DirichletRank, a surfer would more likely follow the lks of the current page if the page has many -lks. Bayesian estimation provides a proper way for settg the transition probabilities and the authors showed that it is not only solves the zero--lk problem but also solves the zeroone gap problem. The random jumpg probability of DirichletRank is (a) Fig. 9 (a) is a structure with lk spammg (only lk) and Fig. 9 (b) shows a typical spammg structure with all bogus pages B s havg back lks to the target page T. The authors denote r o (.) as the PageRank score Fig.9 (a) and r s (.) denotes the PageRank score Fig. 9 (b). The authors proved that r s (T) r o (T) over the range of all λ values. Usually a small λ is preferred PageRank so the result r s (T) is much larger than r o (T). For example if λ = 0.15, r s (T) is ab 3 times larger than r o (T). Fig. 9 (b), the addition of the bogus pages makes the PageRank score of the target page 3 times larger than before. This is because a surfer is forced to jump back to the target page with a high probability Fig 9 (b). ith the default value of λ = 0.15, the sgle -lk a bogus page forces a surfer to jump back to the target page with a probability This zero-one-gap problem denotes a serious flaw of PageRank, which makes it sensitive to a local structure change and thus vulnerable to lk spammg. Fig. 9 Sample contrast structures (b) µ ω ( n) =, 0 n (26) n + µ where n is the number of -lks and µ is the Dirichlet parameter. The author set µ = 20 and plotted ω(n) and showed the jumpg probability DirichletRank is smoothed and no gap between 0 and 1 -lk. The authors calculated DirichletRank scores d o (.) and d s (.) for the structures shown Fig. 9 (a) and (b) usg the followg formula. d o d s τ ( T ) =σ + (27) N ( T) = 1 + µ 2 k k + µ + 1 τ σ + + ( k + 1) µ µ + 1 N (28) 1163

orld Academy of Science, Engeerg and Technology International Journal of Computer and Information Engeerg International Science Index, Computer and Information Engeerg waset.

11 orld Academy of Science, Engeerg and Technology International Journal of Computer and Information Engeerg International Science Index, Computer and Information Engeerg waset.org/publication/3153 and d s ( T ) d o ( T ) for any positive teger k. The authors claimed that they obtaed a similar score of PageRank i.e. d s (T) is constantly larger than or equal to d o (T) but d s (T) is fact close to d o (T). It also shows that no significant change T s DirichletRank scores before and after spammg. Hence DirichletRank is more stable and less sensitive to the change of local structure. DirichletRank also don t take extra time cost. And it makes suitable for ebscale applications. The authors also proved that DirichletRank is more stable than PageRank durg lk perturbation i.e. removg a small number of lks or pages. Stability is an important factor for a reliable rankg algorithm. Their experiment results showed that DirichletRank is more effective than PageRank due to its more reasonable allocation of transition probabilities. Algorithm Criteria Mg technique used orkg PageRank IV SIMULATION RESULTS The program was developed for the PageRank algorithm usg Java and tested on an Intel Core (2 duo) with 4GB RAM mache. The put is shown Fig. 10, the user can select the put file which contas the number of nodes, the number of comg and gog lks of the nodes. The put is shown Fig. 11 (a) and (b). Fig. 11 (a) is the put of the PageRank convergence scores and Fig. 11 (b) is the XY chart for the PageRank scores. TABLE II COMPARISON OF PAGERANK BASED ALGORITHMS eighted PageRank Table II shows the comparison [2] of all the algorithms discussed above. The ma criteria used for comparison are mg techniques used, workg method, put parameters, complexity, limitations and the search enge usg the algorithm. Among all the algorithms, PageRanka and HITS are most important ones. PageRank is the only algorithm implemented the Google search enge. HITS is used the IBM prototype search enge called Clever. A similar algorithm is used the Teoma search enge and later it is acquired by Ask.com. HITS can not be implemented directly a search enge due to its topic drift and efficiency problem. That is the reason we have taken PageRank algorithm and implemented a Java program. HITS Distance Rank Dirichlet Rank SM SM SM & CM SM SM Computes scores at dex time. Results are sorted on the importance of pages. Computes scores at dex time. Results are sorted on the Page importance. Computes scores of n highly relevant pages on the fly. Computes scores by calculatg the mimum average distance between pages orks same as PageRank but computes transition probabilities usg Bayesian estimation Backlks, Backlks, Forward I/P Parameters Backlks Backlks Backlks Forward lks Lks & content Complexity O(log N) <O(log N) <O(log N) O(log N) O(log N) Needs to work Needs to work Query Topic drift and Limitations Query dependent along with along with dependent efficiency problem PageRank PageRank Search Enge Google Research model Clever Research Model Research Model Iteration A B C D Fig. 11 (a) PageRank Convergence Scores Fig. 10 PageRank Program Input Entry dow 1164

Web Structure Mining: Exploring Hyperlinks and Algorithms for Information Retrieval

Web Structure Mining: Exploring Hyperlinks and Algorithms for Information Retrieval eb Structure Mg: Explorg Hyperlks and Algorithms for Information Retrieval Ravi Kumar P Department of ECEC, Curt University of Technology, Sarawak Campus, Miri, Malaysia ravi2266@gmail.com Ashutosh Kumar