Weighted Page Content Rank for Ordering Web Search Result

Weighted Page Content Rank for Ordering Web Search Result Abstract: POOJA SHARMA B.S. Anangpuria Institute of Technology and Management Faridabad, Haryana, India DEEPAK TYAGI St. Anne Mary Education Society, Delhi PAWAN BHADANA B.S. Anangpuria Institute of Technology and Management Faridabad, Haryana, India With the explosive growth of information sources available on the World Wide Web, it has become increasingly necessary for user s to utilize automated tools in order to find, extract, filter and evaluate the desired information and resources. Web structure mining and content mining plays an effective role in this approach. There are two Ranking algorithms PageRank and Weighted PageRank. PageRank is a commonly used algorithm in Web Structure Mining. Weighted Page Rank also takes the importance of the inlinks and outlinks of the pages but the rank score to all links is not equally distributed. i.e. unequal distribution is performed. In this paper we proposed a new algorithm, Weighted Page Content Rank (WPCR)based on web content mining and structure mining that shows the relevancy of the pages to a given query is better determined, as compared to the existing PageRank and Weighted PageRank algorithms. Keywords: Web Mining, outlinked inlinked, PageRank, Weighted PageRank, Weighted Page Content Rank. 1. Introduction The World Wide Web [4][2] is the collection of information resources on the Internet that are using the Hypertext Transfer Protocol. It is a repository of many interlinked hypertext documents, accessed via the Internet. Web may contain text, images, video and other multimedia data. In order to analyze such data, some techniques called web mining techniques are used by various web applications and tools. Web mining describes the use of data mining techniques to automatically discover Web documents and services, to extract information from the Web resources and to uncover general patterns on the Web. Over the years, Web mining research has been extended to cover the use of data mining and similar techniques to discover resources, patterns, and knowledge from the Web-related data (such as Web usage data or Web server logs).it is used to understand customer behavior, evaluate the effectiveness of a particular Web and help quantify the success of a marketing campaign. It is a rapidly growing research area. The rest of the paper is structured as follows. In section 2, we describes the process of web mining. In Section 3 we focus on Web mining categories. We then present the PageRanking algorithms in section 4. 2. Process Of Web Mining The complete process of extracting knowledge from Web data [1] is shown in figure1. Mining Representation Raw Pattern Data Tools & visualization Knowledge Fig.1 Web Mining Process The various steps are explained as follows. (1) Resource finding: It is the task of retrieving intended web documents. (2) Information selection and pre-processing: Automatically selecting and pre- processing specific from information retrieved Web resources. ISSN: 0975-5462 7301

(3) Generalization: Automatically discovers general patterns at individual Web site as well as multiple sites. (4) Analysis: Validation and interpretation of the mined patterns. 3. Web Mining Categories Web mining research overlaps substantially with other areas, including data mining, text mining, information retrieval, and Web retrieval.. The classification is based on two aspects: the purpose and the data sources. Retrieval research focuses on retrieving relevant, existing data or documents from a large database or document repository, while mining research focuses on discovering new information or knowledge in the data. On the basis of this, Web mining can be classified into web structure mining, web content mining, and web usage mining as shown in figure 2. a) Web Content Mining Web Content Mining [3] [8] [6] is the process of extracting useful information from the contents of Web documents. Content data corresponds to the collection of facts a Web page was designed to convey to the users. Web content mining is related but is different from data mining and text mining. It is related to data mining because many data mining techniques can be applied in web content mining. Web Mining Web content Mining Web Structure Mining Web Usage Mining Text and Multimedia Documents Hyper Link Structure Web Log Records Fig. 2Web Mining Categories and Objects It is related to text mining because much of web contents are text based. It is different from data mining because web data are mainly semi-structured and or unstructured. Web content mining is also different from text mining because of the semi-structure nature of the web, while text mining focuses on unstructured texts. The technologies that are normally used in web content mining are NLP (Natural language processing) and IR (Information retrieval). b) Web usage mining Web usage mining [3][5] is the application of data mining techniques to discover usage patterns from Web data in order to understand and better serve needs of Web based applications. It consists of three phases, namely preprocessing, pattern discovery, and pattern analysis. Web servers, proxies, and client applications can quite easily capture data about Web usage. However, one of the major challenges faced by Web usage mining applications is that Web server log data are anonymous, making it difficult to identify users and user sessions from the data. c) Web structure mining The goal of web structure mining [7] is to generate structural summary about the website and web page. The first kind of web structure mining is extracting patterns from hyperlinks in the web. A hyperlink is a structural component that connects the web page to a different location. The other kind of the web structure mining is mining the document structure. It is using the tree-like structure to analyze and describe the HTML (Hyper Text Markup Language) or XML (Extensible Markup Language). 3.1 Search Engine Architecture Search engines [10] are the key to finding specific information on the vast expanse of the World Wide Web. We use the term search engine in relation to the Web. These usually refer to the actual search forms, which searches through databases of the HTML documents. There are at least three elements which contain important: information for a search engine: discovery & the database, the user search, presentation and ranking of results. ISSN: 0975-5462 7302

Here we represent the elements of search engine architecture. a) User Search: The users do besides typing a few relevant words into the search form. Can they specify those words which must be in the title of a page? About specifying those words which must be in an URL. Most engines allow you to type in a few words, and then search for occurrences of these words in their data base. b) Crawlers: A crawler is a program that visits Web sites and reads their pages and other information in order to create entries for a search engine index. The search engines on the Web have a program, which is also known as a "spider" or a "bot." Crawlers are typically programmed to visit sites that have been submitted by their owners as new or updated. Entire sites or specific pages can be selectively visited and indexed. Crawlers apparently gained the name because they crawl through a site a page at a time, the links to other pages on the site until all pages have been read. WWW crawler Web Pages Query User Response Indexer Index Generates Search Interface Query Processor Used by Fig: 3 Search Engine Architecture c) WWW: The World Wide Web is: all the resources and users on the Internet that are using the Hypertext Transfer Protocol. The World Wide Web is a system of interlinked hypertext documents accessed via the Internet. Web that may contain text, images, video, and other multimedia and navigates between them using hyperlinks. d) Indexer: ( Internet indexing") provide a more useful vocabulary for Internet or onsite search engine. With the increase in the number of periodicals that have articles online, web indexing is also becoming important for periodical websites 4. Page Ranking Algorithms World Wide Web is large sized repository of interlinked hypertext documents accessed via the Internet. Web may contain text, images, video, and other multimedia data. The user navigates through this using hyperlink. Search Engine gives millions of results and applies Web mining techniques to order the results. The sorted order of search results is obtained by applying some special algorithms called Page ranking algorithms. There are two Page Ranking algorithms; PageRank and Weighted PageRank. They are the commonly used algorithm in Web Structure Mining. The algorithm measures the importance of the pages by analyzing the number of inlinked and outlinked pages. 4.1 PageRank PageRank [9][11][12] is a numeric value that represents how important a page is on the web. PageRank is the Google s method of measuring a page's "importance." When all other factors such as Title tag and keywords are taken into account, Google uses PageRank to adjust results so that more "important" pages move up in the results page of a user's search result display. Google figures that when a page links to another page, it is effectively casting a vote for the other page. Google calculates a page's importance from the votes cast for it. How important each vote is taken into account when a page's PageRank is calculated. It matters because it is one of the factors that determine a page s ranking in the search results. It isn't the only factor that Google uses to rank pages, but it is an important one. The order of ranking in Google works like this: (1) Find all pages matching the keywords of the search. (2) Adjust the results by PageRank scores. ISSN: 0975-5462 7303

4.1.1 Algorithm of PageRank The PageRank algorithm[14][12], one of the most widely used page ranking algorithms, states that if a page has important links to it, its links to other pages also become important. Therefore, PageRank takes the back links into account and propagates the ranking through links. A page has a higher rank, if the sum of the ranks of its backlinks is high. Figure 4 shows an example of back links wherein page A is a backlink of page B and page C while page B and page C are backlinks of page D. a) The calculation of Page Rank The original PageRank algorithm is given in eq (4.1) PR(P)=(1-d)+d(PR(T1)/C(T1)+..PR(Tn)/C(Tn)) (4.1) Where, PR(P)= PageRank of page P B A D C Fig4: An example of backlinks PR (T i ) = PageRank of page Ti which link to page C (T i ) =Number of outbound links on page T D = Damping factor which can be set between 0 and 1 First of all, the PageRank does not rank web as a whole, but is determined for each page individually. The PageRank of page P is recursively defined by the PageRank of those pages which link to page P from eq (4.1) 4.1.2 Simplified Page Rank A slightly simplified version of PageRank is defined as: PR( V ) PR(U) =d Nv (4.2) v B( u) where U represents a web page. B(u) is the set of pages that point to u. PR(u) and PR(V) are rank scores of page U and V, respectively. Nv denotes the number of outgoing links of page V. d is a factor used for normalization. d = 1.0 to simplify the calculation. In Page Rank, the rank score of a page, p, is evenly divided among its outgoing links. The values assigned to the outgoing links of page p are in turn used to calculate the ranks of the pages to which page p is pointing. The rank scores of pages of a web could be calculated iteratively starting from any webpage. Based on the consideration of the phenomenon mentioned above, the original Page Rank is PR( V ) PR(u) = (1 d) + d Nv (4.3) v B( u) Where d is a damping factor that is usually set to 1.0. We also could think of d as the probability of users following the links and could regard (1 d) as the PageRank distribution from non-directly linked pages. To test the utility of the PageRank algorithm, Google applied it to the Google search engine In the experiments, the PageRank algorithm works efficiently and effectively because the rank value converges to a reasonable tolerance in the roughly logarithmic (log n) scale. The rank score of a web page is divided evenly over the pages to which out links. Even though the PageRank algorithm is used successfully in Google, one problem still exists: in the actual web that is some links to a web page may be more important than are the others. ISSN: 0975-5462 7304

4.2 Weighted Page Rank Extended PageRank algorithm- Weighted PageRank[9] Assigns large rank value to more important pages instead of dividing the rank value of a page evenly among its outlink pages. Each outlink page gets a value proporhonal to its popularity (no. of inlinks and outlinks). The popularity from the number of inlinks and outlinks and is recorded as W in (V,U) and W out (V,U) In Weighted PageRank all links are not equally distributed i.e. unequal distribution. Original Weighted PageRank formula is PR(U)=(1-d)+d PR( V ) W in (U,V)W out (U,V) (4.4) v B( u) Where d = damping factor to set value 0 to 1 W in (U,V)=weighted of link (U,V) Calculation based on the number of inlinks of page U and the number of inlinks of page V W in (U,V)= Iu Ip P R ( V ) (4.5) Where I U = number of inlinks of page u, I p = number of inlinks of page p R(v)=Reference page text of page, W out (V,U)= weight of outlink(v,u) Calculated based on the number of outlinks of page u and no. of outlinks of page v W out (U,V)= Ou Op P R ( V ) Where O u =number of outlink of page u, (4.6) Op= number of outlink of page p In Weighted PageRank all links are not equally distributed while in PageRank these are equally distributed, to the outgoing links. 4.3 Limitations of Pagerank and Weighted Pagerank The PageRank and the Weighted PageRank is being used by Google presently. It has some limitations of both algorithm is given below. 4.3.1 PageRank (1) PageRank is equally distributed to outgoing links (figure.5). (2) It is purely based on the number of inlinks and outlinks. P1 P2 P3 15 15 P4 15 Fig. 5 Distribution of PageRank 4.3.2 Weighted PageRank (1)Weighted PageRank algorithm provides important information about a given query by using the structure of the web. While some pages may be irrelevant to a given query, it still receives the highest rank because it has many inlinks and many outlinks. (2) There is a less determination of the relevancy of the pages to a given query (3) This algorithm relies mainly on the number connected inlinks and outlinks. ISSN: 0975-5462 7305

5. Proposed Weighted Page Content Rank Although PageRank and Weighted PageRank algorithms are used by many search engines but the users may not get the required relevant documents easily on the top few pages. With a view to resolve the problems found in both algorithms, a new algorithm called Weighted Page Content Rank has been proposed which employs Web structure mining as well as Web content mining techniques. This algorithm is aimed at improving the order of the pages in the result list so that the user may get the relevant and important pages easily in the list. 5.1 Weighted Page Content Rank Weighted Page Content Rank Algorithm (WPCR) is a proposed page ranking algorithm which is used to give a sorted order to the web pages returned by a search engine in response to a user query. WPCR is a numerical value based on which the web pages are given an order. This algorithm employs web structure mining as well as web content mining techniques. Web structure mining is used to calculate the importance of the page and web content mining is used to find how much relevant a page is? Importance here means the popularity of the page i.e. how many pages are pointing to or are referred by this particular page. It can be calculated based on the number of inlinks and outlinks of the page. Relevancy means matching of the page with the fired query. If a page is maximally matched to the query, that becomes more relevant. 5.1.1 Modified Search Engine Architecture Search engines are the key to finding specific information on the vast expanse of the World Wide Web. There are at least three elements which contain important: information for a search engine: discovery of the database, the user search, presentation and ranking of results. With the proposed WPCR, the search engine architecture is modified so as to add the components for calculating importance and relevancy of pages. The modified architecture is displayed in Figure 6. User Query Results Search Interface Crawler WWW Query Processor Indexer Map Generator Error! Returned URLs Index Web Map Relevance Calculator Probability &Content Wts WPCR Calculator Weight Calculator In & out weight of all links Fig 6 Modified Search Engine Architecture The various components and search process are explained below to have an understanding of the existing as well as modified architecture. a) WWW: The World Wide Web is a system of interlinked hypertext documents accessed via the Internet. Web may contain text, images, video, and other multimedia data and user navigates between them using hyperlinks. b) Crawler: A crawler is a program that visits Web and reads their pages and other information in order to ISSN: 0975-5462 7306

create entries for a search engine index. The search engines on the Web have crawlers embedded in them. These are typically programmed to visit sites that have been submitted by their owners as new or updated. Entire sites or specific pages can be selectively visited and indexed. c) Search Interface: It is the Graphical interface of a search engine on which the user can enter his query e.g. the Google interface. d) Query Processor: It is the component used for taking the user query from the search interface and processing it word by word. e) Indexer: It provides a more useful vocabulary for Internet or onsite search engine. This component uses HTML parser to extract the terms from web pages after ignoring stop words, prepositions and word stems. The index is usually built in an alphabetical order of terms and contains extra information regarding the page such as its URL, frequency and position of terms etc. f) Map Generator: This module generates a map/graphical structure of the WWW. This map is used to further find out the inlinks and outlinks of the web pages. g) Weight Calculator: Weight calculator calculates weight of inlinks and outlinks of the pages. Weight determines how much important a page is on the web. h) Relevance Calculator: Relevance calculator determines the relevancy of the pages to the given query. The page should be matched maximum to the content of the query fired by the user. i) WPCR Calculator: It combines the output of weight calculator and relevance calculator to determine the weighted Page Content Rank of all the pages returned after matching the query with the index. 5.1.1.1 Intermediate Outputs Search engine generates some intermediate outputs, which can be stored in some appropriate data structures. These outputs can be represented in the form of tables and are used to calculate the importance and relevance of the web page. The first output Table1 is generated by the map generator component. A module in this component gives information about inlinks and outlinks of all web pages. The weight calculator preprocesses this information. Table1: Output of Web Map Generator Document ID/URL Inlinks Outlinks a) Document ID: represents the URL of the web page. This column contains the URLs of all pages on the web. b) Inlinks: of a document are the numbers of documents referring that document. c) Outlinks: of a document are the numbers of documents referred by that document. Weight calculator calculates the inweight and outweight of a page and determining its importance. Table 2: Output obtained after matching query from index Document id Query string Frequency Selection a) Document ID: represents the URL of the web page. This column contains the URLs of all pages on the web. b) Query String: Query String is a meaningful ordered sequence of words of the query fired by the user. c) Frequency: Frequency is number of occurrences of Query String in document returned by a search engine d) Selection: This column represents whether the query string is selected for consideration in the particular document or not. Its value is either 0 or 1. 5.2 WPCR Algorithm As stated earlier, WPCR is a numerical value to represent the rank of a web page. The algorithm to find WPCR of a web page is given in Figure7. ISSN: 0975-5462 7307

Algorithm: WPCR calculator Input: Page P, Inlink and Outlink Weights of all backlinks of P, Query Q, d (damping factor). Output: Rank score Step 1: Relevance calculation: a) Find all meaningful word strings of Q (say N) b) Find whether the N strings are occurring in P or not? Z= Sum of frequencies of all N strings. c) S= Set of the maximum possible strings occurring in P. d) X= Sum of frequencies of strings in S. e) Content Weight(CW)= X/Z f) C= No. of query terms in P g) D= No. of all query terms of Q while ignoring stop words. h) Probability Weight(PW)= C/D Step 2: Rank calculation: a) Find all backlinks of P (say set B). b)pr(p)= (1-d)+d[ PR ( V ) W in (P,V)W out (P,V) ](CW+PW) V B c) Output PR(P) i.e. the Rank score Fig 7: The algorithm of WCPR calculation It can be seen from Fig 6.3 that the formula to calculate the Weighted Page Content Rank of a page U is (eq 5.1): PR(U)=(1-d)+d PR( V ) W in (U,V)W out (U,V)*(Cw+Pw) ( 5.1) v B( u) where PR(U)=PageRank of page U, B(U)= Set of all pages referring to page U. D= Damping factor which can be set between 0 and 1, W in (U,V)= inweight of link (U,V) W out (U,V)= outweight of link (U,V), Cw=Content weight of page U Pw=Probability weight of page U The various steps of the algorithm are explained below in detail. 5.2.1Weight calculation The W in (v,u) and W out (v,u) are the preprocessed weights. Both are just inputted to the algorithm. W in (v,u) is the weight of link(v, u) calculated based on the number of inlinks of page u and the number of inlinks of all reference pages of page v and is given in eq (5.2). W in (U,V)= Iu Ip P R ( V ) (5.2) Where I U = number of inlinks of page u, I p = number of inlinks of page p, R(v)=Reference page text of page v The W out (v,u) is the weight of link(v, u) calculated based on the number of outlinks of page u and the number of outlinks of all reference pages of page v given in eq (5.3). Ou W out (U,V)= Op (5.3) P R ( V ) Where O u =number of outlink of page u, Op= number of outlink of page p ISSN: 0975-5462 7308

5.2.2 Relevance Calculation Relevance calculator calculates the relevance of a page on the fly in terms of two factors: one represents the probability of the query in the page and other gives the maximum matching of the query to the page. Probability Weight: It is the probability of the query terms in the web page. This factor is the ratio of the query terms present in the document and the total number of terms in the fired query (after ignoring stop words etc.). The formula is given in eq (5.4). Probability weight (PW i ) =Y i /N (5.4) where Y i = Number of query terms in ith document., N=Total number of terms in query Content Weight: It is the weight of content of the web page with respect to query terms. This factor is the ratio of the sum of frequencies of highest possible query strings in order and sum of frequencies of all query strings in order. The maximum possible strings are selected in such a way that all such strings represent a different logical combination of words. The formula for Content Weight is given in eq (5.5). Content weight (CW i ) = Xi/M (5.5) Where X i = Sum of cardinalities of highest possible query strings in order M= Sum of cardinalities of all possible meaningful query strings in order. 5.3 Comparison between PageRank/ Weighted PageRank and Proposed Weighted Page Content Rank S. No 1 2 3 4 Page Rank/Weighted Page Rank They rely only on links. PR/WPR is the structure mining only. Minimum determination of the relevancy of the pages to the given query PR/WPR algorithms provide important information about a given query Weighted Page Content Rank They rely links as well as content WPCR is the combination of structure mining and content mining Maximum determination of the relevancy of the pages to the given query WPCR algorithms provide important information and relevancy about a given query 6. Conclusion Web mining is used to extract useful information from users past behavior. In this paper we focused that PageRank and Weighted PageRank algorithms are used by many search engines but the users may not get the required relevant documents easily on the top few pages. With a view to resolve the problems found in both algorithms, a new algorithm called Weighted Page Content Rank has been proposed which employs Web structure mining as well as Web content mining techniques. This algorithm is aimed at improving the order of the pages in the result list so that the user may get the relevant and important pages easily in the list. References [1] Cooley, R., Mobasher, B., and Srivastava, J. Web mining: Information and pattern discovery on the World Wide Web. In proceedings of the 9th IEEE International conference on Tools with Artificial Intelligence (ICTAI 97), Newposrt Beach, CA, 1997. [2] A. A. Barfourosh, H.R. Motahary Nezhad, M. L. Anderson, D. Perlis, Information Retrieval on the World Wide Web and Active Logic: A Survey and Problem Definition, 2002. [3] R. Kosala and H. Blockeel. Web mining research : A survey. ACM SIGKDD Explorations, 2(1):1 15, 2000. [4] O. Etzioni. The World Wide Web: Quagmire or gold mine. Communications of the ACM 39(11):65-68, 1996. ISSN: 0975-5462 7309

[5] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, Pag-Ning Tan, and Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data, ACM SIGKDD Explorations Newsletter, January 2000, Volume 1 Issue. [6] Wang Jicheng, Huang Yuan, Wu Gangshan, Zhang Fuyan. Web mining: knowledge discovery on the Web. Systems, Man, and Cybernetics, 1999.IEEE SMC '99 Conference Proceedings. 1999 IEEE International Conference. [7] S. Chakrabarti, B. E. Dom, S. R. Kumar, P. Raghavan, S. Rajagopalan, A.Tomkins, D. Gibson, and J. Kleinberg. Mining the Web s link structure. Computer, 32(8):60 67, 1999. [8] Raymond Kosala, Hendrik Blockeel, Web Mining Research : A Survey, ACM SIGKDD Explorations Newsletter, June 2000, Volume 2 Issue 1. [9] L. Page, S. Brin, R. Motwani, and T. Winograd. The pagerank citation ranking: ringing order to the web. Technical report, Stanford Digital Libraries SIDL-WP- 1999-0120, 1999. [10] 26 http://www.google.com/. [11] C. Ridings and M. Shishigin. Pagerank uncovered. Technical report, 2002. [12] Taher H. Haveliwala. Topic-Sensitive PageRank: A Context-Sensitive Ranking Algorithm for Web Search. IEEE Transactions on Knowledge and Data Engineering, Vol. 15, No4, July/August 2003, 784-796. [13] J. Wang, Z. Chen, L. Tao, W. Ma, and W. Liu. Ranking user s relevance to a topic through link analysis on web logs. WIDM, pages 49 54, 2002. Cooper, C. 8 [14] S. Brin and L. Page. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1 7):107 117, 1998. ISSN: 0975-5462 7310