Volume 3, Issue 9, September 2013 ISSN: 2277 128X International Journal of Advaned Researh in Computer Siene and Software Engineering Researh Paper Available online at: www.ijarsse.om A New-Fangled Algorithm Based on Popularity of Pages to Boost Page Ranking Dr. Sandeep Gupta Department of Computer Siene & Engineering NIET, Gr.Noida (Affiliated to Mahamaya Tehnial University, Noida), UP, INDIA Abstrat The urrent generation of WWW searh engines reportedly makes extensive use of evidene derived from the struture of the WWW to better math relevant douments and identify potentially authoritative pages. However, despite this reported use, there has been little analysis whih supports the inlusion of web evidene in doument ranking, or whih examines preisely what its effet on searh results might be. The suess of doument ranking in the urrent generation of WWW searh engines is attributed to a number of web analysis tehniques. Two page ranking algorithms, HITS and PageRank, are ommonly used in web struture mining. Both algorithms treat all links equally when distributing rank sores. Several algorithms have been developed to improve the performane of these methods. We introdue the Credene PageRank Algorithm (CPR) to improve the performane of web page ranking. Keywords-Web Mining, Web Usage Mining, Page Rank, Web Map, CPR. I. INTRODUCTION During the past few years the World Wide Web has turn into the biggest and most admired way of ommuniation and information dissemination. It serves as a platform for exhanging various breeds of information, ranging from researh papers, and eduational ontent, to multimedia ontent, software and personal logs (blogs). Every day, the web grows by roughly a million eletroni pages, adding to the hundreds of millions pages already on-line. Beause of its rapid and haoti growth, the resulting network of information laks of organization and struture. Users often feel disoriented and get lost in that information overload that ontinues to expand. On the other hand, the e-business setor is rapidly evolving and the need for web market plaes that antiipate the needs of their ustomers is more than ever evident. Therefore, the ultimate need nowadays is that of prediting the user needs in order to improve the usability and user retention of a web site. II. HYPERLINK RECOMMENDATION A. Link ounting / in-degree A page s in-degree sore is a measure of its diret prestige, and is obtained through a ount of its inoming links. It is widely believed that a web page s in-degree may give some indiation of its importane or popularity [2]. B. PageRank The PageRank algorithm, one of the most widely used page ranking algorithms, states that if a page has important links to it, its links to other pages also beome important. Therefore, PageRank takes the baklinks into aount and propagates the ranking through links: a page has a high rank if the sum of the ranks of its baklinks is high[3][[4]. Figure 1 shows an example of baklinks: page A is a baklink of page B and page C while page B and page C are baklinks of page D. A slightly simplified version of PageRank [3][4] is defined as PR( v) PR ( vb ( Nv where u represents a web page. B( is the set of pages that point to u. PR( and PR(v) are rank sores of page u and respetively. N v denotes the number of outgoing links of page v. is a fator used for normalization. In PageRank, the rank sore of a page, p, is evenly divided among its outgoing links. The values assigned to the outgoing links of page p are in turn used to alulate the ranks of the pages to whih page p is pointing. The rank sores of pages of a website ould be alulated iteratively starting from any webpage. Within a website, two or more pages might onnet to eah other to form a loop. If these pages did not refer to but are referred to by other webpages outside the loop, they would aumulate rank but never distribute any rank. This senario is alled a rank sink. To solve the rank sink problem, we observed the users ativities. A phenomenon is found that not all users follow the existing links[7]. For example, after viewing page a, some users may not deide to follow the existing links but diretly go to page b, whih is not diretly linked to page a. 2013, IJARCSSE All Rights Reserved Page 1088
For this purpose, the users just type the URL of page b into the URL text field and jump to page b diretly. In this ase, the rank of page b should be affeted by page a even though these two pages are no diretly onneted. Therefore, there is no absolute rank sink. Figure 1. An example of baklinks PR ( (1 d) d vb ( PR( v) where d is a dampening fator that is usually set to 0.85. We also ould think of d as the probability of users following the links and ould regard (1 d) as the pagerank distribution from non-diretly linked pages. To test the utility of the PageRank algorithm, Google applied it to the Google searh engine [4]. In the experiments, the PageRank algorithm works effiiently and effetively beause the rank value onverges to a reasonable tolerane in the roughly logarithmi (log n)[3][4]. The rank sore of a web page is divided evenly over the pages to whih it links. Even though the PageRank algorithm is used suessfully in Google, one problem still exists: in the atual web, some links in a web page may be more important than are the others. III. CREDENCE PAGERANK (CPR) The more popular web pages are, the more linkages that other webpages tend to have to them or are linked to by them. The proposed extended PageRank algorithm assigns larger rank values to more important (popular) pages instead of dividing the rank value of a page evenly among its outlink pages. Eah outlink page gets a value proportional to its popularity (its number of inlinks and outlinks). The popularity from the number of inlinks and outlinks is reorded as respetively. in Nv in ( and ( is the redit of link( alulated based on the number of inlinks of page u and the number of inlinks of all referene pages of page v. in ( l u I p pr ( v) where I u and I p represent the number of inlinks of page u and page p, respetively. R(v) denotes the referene page list of page out v. ( is the redit of link( alulated based on the number of outlinks of page u and the number of outlinks of all referene pages of page v. out Ou ( O pr( v) p out ( P1 A P2 B P3 Figure 2. Links of a website 2013, IJARCSSE All Rights Reserved Page 1089
Where O u and O p represent the number of outlinks of page u and page p, respetively. R(v) denotes the referene page list of page v. Figure 2 shows an example of some links of a hypothetial website. In this example, Page A has two referene pages: p1 and p2. The inlinks and outlinks of these two pages are I p1 = 2, I p2 = 1, O p1 = 2, and O p2 = 3. Therefore, and in C /( ) ( A, p1) I p1 I p1 I p2 out C /( ) ( A, p1) Op 1 Op 1 Op2 Considering the importane of pages, the original PageRank formula is modified as in out PR ( (1 d) d PR( v) C( v, C( vb ( IV. EXPERIMENTAL EVALUATION A. Methodology To evaluate the CPR algorithm, we implemented CPR and the standard PageRank algorithms to ompare their results. Figure 6 illustrates different omponents involved in the implementation and evaluation of the CPR algorithm. The simulation studies we have arried out in this work onsist of six major ativities: 1). Finding a web site: Finding a web site with rih hyperlinks is neessary beause the standard PageRank and the CPR algorithms rely on the web struture. After omparing the strutures of several web sites of various olleges, the website of Azad Institute of Pharmay and Researh (AIPR), Luknow (www.aipr.a.in) has been hosen. 2). Building a web map: Using Jspider (An open soure spider software)[5] and Web Traer (A software tool to visualize the struture of the web)[6] is used to generate the required web map. The link struture and web map Azad Institute of Pharmay and Researh (AIPR) website are shown in figure 3 and 4. 2 3 2 5 Figure 3. Web Map of aipr.a.in Figure 3. Web Map of aipr.a.in 2013, IJARCSSE All Rights Reserved Page 1090
Figure 4. Link Struture output of aipr.a.in by using Jspider 3). Finding the root set: A searh engine was developed for finding whether a webpage is relevant, very relevant, week relevant or irrelevant orresponding to a keyword (query). This searh engine was developed in.net Environment using C#. We named it as Relevany Searh Engine. The sreenshot of output is shown in figure 5. Figure 5(a). Main page of Relevany Searh Engine (Sreenshot) 2013, IJARCSSE All Rights Reserved Page 1091
Figure 5(b). Searh Result A set of pages relevant to a given query is retrieved using this searh engine. We aim to embed this searh engine in the web site. We all rootset to this set of pages 4). Finding the base set: By the help of web map we reate a base set by expanding the root set with pages that diretly point to or are pointed to by the pages in the root set. 5). Applying algorithms: The Standard PageRank and the CPR algorithms are applied to the base set. 6). Evaluating the results: The algorithms are evaluated by omparing their results. Normally, websites in different domains fous on different topis. Usually, the websites have rih linkages to desribe the foused topis. On the other hand, they do a poor job desribing non-foused topis. To test the CPR algorithm for both foused and non-foused topis, we hoose several queries from both ategories. Figure 6. Arhitetural omponents of the system used to implement and evaluate the CPR algorithm V. EVALUATION The query topis travel agent and sholarship are used in the evaluation of the CPR and the standard PageRank algorithms. Travel agent represents a non-foused topi whereas sholarship represents a foused (popular) topi in the website of AIPR, Luknow. The results of the evaluation are summarized in the following subsetions. A. The determination of the relevany of the pages to the given query The Standard PageRank and the CPR algorithms provide important information about a given query by using the struture of the website. Some pages irrelevant to a given query are inluded in the results as well. For example, even though the home 2013, IJARCSSE All Rights Reserved Page 1092
page of Azad Institute of Pharmay and Researh, Luknow http://www.aipr.a.in/, is not related to the given query, it still reeives the highest rank beause of its many existing inlinks and outlinks. To redue the noise resultant from irrelevant pages, we ategorized the pages in the results into four lasses based on their relevany to the given query: Very Relevant pages (VR), whih ontain very important information about the given query, Relevant pages (R), whih have relevant but not important information about the given query, Weak-Relevant pages (WR), whih do not have relevant information about the given query even though they ontain the keywords of the given query, and Irrelevant pages (IR), whih inlude neither the keywords of the given query nor relevant information about it. An objetive ategorization of the results (lists of pages) is ahieved by integrating the responses from several people: for eah page, we ompared the ount of eah ategory (i.e., VR, R, WR and IR) and hose the ategory with the largest ount as the type of that page. B. The Calulation of the relevany of the page lists to the given query The performanes of the CPR and the standard PageRank algorithms have been evaluated to identify the algorithm that produes better results (i.e., results that are more relevant to the given query). The CPR and the standard PageRank algorithms provide sorted lists (i.e., ranked pages) to users based on the given query. Therefore, in the result list, the number of relevant pages and their order are of great importane. The following rule has been adopted to alulate the relevany value of eah page in the list of pages. C. Relevany Rule: The relevany of a page to a given query depends on its ategory and its position in the page-list. The larger the relevany value is, the better is the result. The relevany, K, of a page-list is a funtion of its ategory and position: K ( n i) ir ( p) where i denotes the ith page in the result page-list R(p), n represents the first n pages hosen from the list R(p), and Ci is the weight of page i. th V1, if the i page is VR th V2, if the i page is R i th V 3, if the i page is WR th V 41, if the i page is IR Where ν1 > ν2 > ν3 > ν4. The value of C i for an experiment ould be deided through experimental studies. For our experiment, we set ν 1, ν 2, ν 3 and ν4 to 1.0, 0.5, 0.1 and 0, respetively, based on the relevany of eah ategory. The relevany values for the query travel agent are shown in Table 1. In this table, relevant pages represent the pages in the ategory VR as well as in the ategory R. From Table 1, we see that CPR produes larger relevany values, whih indiate that CPR performs better than standard PageRank does. Figure 7 illustrates the performane. Moreover, the following two points are observed from Table 1: Within the first 10 pages, one relevant page is identified by CPR whereas no relevant page is determined by standard PageRank. This ase indiates that CPR may be able to identify more relevant pages from the top of the result list than an standard PageRank. Within the first 20 pages, the relevany value obtained from CPR is larger than that obtained from standard PageRank, even though one more relevant page is identified by standard PageRank. This senario indiates that the relevant pages determined by CPR are either more relevant or ranked higher inside the list. Table 1. The relevany values for the query travel agent produed by PageRank and CPR using different page sets Travel agent for PageRank and CPR Size of the page set Number of Relevant Pages 2013, IJARCSSE All Rights Reserved Page 1093 C i Relevany Value(κ) Page Rank CPR Page Rank CPR 10 0 1 0.1 0.5 20 4 3 13.1 16.8 30 4 4 47.1 49.8 40 4 4 82.1 84.8 50 4 4 117.1 119.8 60 5 5 159.6 162.3 70 7 7 211.7 214.4
Figure 7. The relevany value versus the size of the page set of the query Table 2. The relevany values for the query sholarship produed by PageRank and CPR using different page sets of the query sholarship for PageRank and CPR Size of the page set Number of Relevant Pages Relevany Value(κ) Page Rank CPR Page Rank CPR 5 2 3 2 5.5 10 2 4 9.5 22 20 4 4 34.5 57 30 8 5 87.5 99 40 80 10 16 8 15 158.5 624.8 159.3 655.3 100 22 19 999.2 1045.3 120 25 20 1470.4 1473.3 Figure 8. The relevany value versus the size of the page set D. Foused topi queries This subsetion evaluates the results obtained for the query sholarship. This query is a foused topi within the website of Azad Institute of Pharmay and Researh, Luknow. The relevany values of the results are shown in Table 2. Similar to the query travel agent, Figure 8 demonstrates that the CPR algorithm produes better results (larger relevany values) for the query sholarship. Moreover, the two points derived from the query travel agent are shown more learly in this ase (see Table 2). VI. CONCLUSION Web mining is used to extrat information from users past behavior. Web struture mining plays an important role in this approah. This paper introdues the CPR algorithm, an extension to the PageRank algorithm. CPR takes into aount the importane of both the inlinks and the outlinks of the pages and distributes rank sores based on the popularity of the pages. 2013, IJARCSSE All Rights Reserved Page 1094
Simulation studies using the website of AIPR, Luknow show that CPR is able to identify a larger number of relevant pages to a given query ompared to standard PageRank. REFERENCES [1] Sandeep Gupta et al. A Quantitative Approah to Perk up Navigation Effiieny in proeedings of 1st International Conferene On Management Of Tehnologies & Information seurity (ICMIS 2010), held at IIIT, Allahabad, India, Jan 21st - 24th, 2010, ISBN No. 978-81-8329-375-4, pp-378-386. [2] ZHU, X., AND GAUCH, S. Inorporating Quality Metris in Centralized/Distributed Information Retrieval on the World Wide Web. Teh. rep., Department of Eletrial Engineering and Computer Siene, University of Kansas,2000. [3] Page, L., Brin, S., Motwani, S., and T. Winograd. The pager-k itation ranking: Bringing order to the web. Tehnial report, Stanford Digital Libraries SIDL-WP-1999-0120, 1999. [4] C. Ridings and M. Shishigin. Pagerank unovered. Tehnial report, 2002. [5] Jspider http://www.j-spider.soureforge.net/ [6] Webstraer2, www.nullpointer.o.uk/-/webtraer2.htm [7] Sandeep Gupta et al. Reuperating Website Link Struture Using Fuzzy Relations between the Content and Web Pages, in International Journal International Journal of Computer Appliations, Vol. 1, 2010, URL http://www.ijaonline.org/arhives/number11/245-402. 2013, IJARCSSE All Rights Reserved Page 1095