A Novel Link and Prospective terms Based Page Ranking Technique

Size: px

Start display at page:

Download "A Novel Link and Prospective terms Based Page Ranking Technique"

Emery Page
6 years ago
Views:

1 URLs International Journal of Engineering Trends and Technology (IJETT) Volume 7 Number 6 - September 015 A Novel Link and Prospective terms Based Page Ranking Technique Ashlesha Gupta #1, Ashutosh Dixit *, Taruna Kumari #3 #1 Assistant Professor, # Associate Professor, #3 Student Department of Computer Engineering YMCA University of Science & Technology Faridabad, India. Abstract Since size of the web is of the order of more than a billion pages, finding relevant information is a tedious task therefore many Internet users make use of search engines to find desired information on WWW. These Search engines find relevant information based on important words i.e. keywords supplied in the form of queries. For a given query search engine may return large number of web pages in the result-set which may or may not contain relevant information Since users hardly look at results coming after first search result page therefore it is necessary to rank these pages in order of relevance so that top pages contain most relevant data. Therefore page ranking mechanisms are being employed by search engines to rank the web pages. Present page ranking algorithms either consider the link structure of the web page or the keywords entered in the query for rank purpose.these algorithms however suffer from topic drift and lack of quality problems and as a result users have to scan through large result-sets refining it manually to gather the required information. So there is a need to improve these page ranking mechanisms. This paper focuses on a prospective term based page ranking mechanism that not only considers the link structure and query keywords of the web page but takes a perspective view by taking into consideration synonyms and related keywords to provide better ranking solution wherein the user gets the desired information with less number of clicks. The proposed algorithm is aimed at improving user satisfaction by providing full information within the first few URL s thereby improving search experience. The results of the proposed algorithm are analyzed and compared with the existing scheme. Keywords Search Engine, Page Ranking, Page content, Link popularity, Prospective Keywords I. INTRODUCTION WWW is a large collection of information resources that include text, image, audio, video and metadata.an explosive growth in the size of WWW has made it very difficult to manage and access the desired information on the web. Therefore, Internet users today use tools like search engines for accessing the desired information on the Internet. These Search Engine help locate information by presenting a list of clickable URL s generated on the basis of search terms entered by the user. The search engine maintains a huge repository of web pages in its database for search purposes. The general architecture of web search engine is shown in fig 1. Basically a web search engine has three major components: Crawler, Indexer and Query Engine. A crawler downloads the web pages while traversing the web and stores the downloaded pages in a large buffer. World Wide Web Dispatcher Crawler Repository User Indexer docurl buffer Search Interface Documents Query Engine Fig. 1 : Architecture of Search Engine The Indexer than processes the pages in this buffer. It first extracts the keywords from each page and maintains an index containing information about the keyword and the URL where each keyword has occurred in a large repository. The query engine is responsible for receiving and filling search requests from user. When a user fires a query, query engine receives it and after matching the query keywords with the index, returns the URL s of the pages to the user. For a given query, Query Engine may return hundreds of URL s that match the query keywords. This returned result set however may contain a mixture of both relevant and irrelevant information. Therefore it is necessary to arrange the web pages in order of their importance. So page ranking mechanisms are used by most search engines for putting the important pages on top leaving the less important pages in the bottom of result list. ISSN: Page 9

2 International Journal of Engineering Trends and Technology (IJETT) Volume 7 Number 6 - September 015 Current Page Ranking algorithm either use Linkstructure of the web page or the Content Information of the web page to calculate page rank. But both techniques have some shortcoming and they suffer from topic drift and lack of quality in the result-sets. Moreover users try to find desired information on the first page of the search result only and results coming after first search result page are nearly invisible for general user. If user does not get information on the first page they consider the search to be a miss and try to reformulate the query to find the desired result. Keeping in view the above mentioned problems a Prospective based page ranking mechanism is being proposed that not only considers link structure of a web page but also combines query dependent factors like occurrence frequency of keyword, synonyms along with the prospective words (words having direct/indirect relation with the word) for ranking to improve overall quality of search result. The rest of the paper is organized as follows. The Related work and Back Ground is covered in Section II. Section III discusses the architecture, components and algorithms of the proposed page ranking algorithm. Section IV discusses the implementation and results of the proposed algorithm. Section V includes the conclusion. II. BACKGROUND & RELATED WORK (Size 10 & Normal) Search engines use two different kinds of ranking factors: query-dependent factors (i.e word frequency, position of query terms etc) and query independent factors (i.e. link popularity, click popularity etc.) for ranking web page documents. Query-dependent are all ranking factors that are specific to a given query, while query-independent factors are attached to the documents, regardless of a given query. Link structure based page ranking for determining the importance of web pages has become an important technique in today s search engines. Some of the common page ranking algorithms are PageRank Algorithm [], Weighted Page Rank Algorithm [4] and Hyperlinked Induced Topic search Algorithm[5]. Page Rank takes the back links into account and propagates the ranking through links. Rank score of a page p is evenly divided among its outgoing links. Whereas WPR takes into account the importance of both inlinks and the outlinks of the pages and distributes rank scores based on the popularity of the pages [,3,4]. HITS (Hyperlink-Induced Topic Search rank pages based on their textual contents to a given query, after assembling the pages it ignores textual content and focuses itself on the structure of web only [5]. These link based algorithms based on global rank however suffer from quality problems and are biased. Moreover the importance of a page may depend on different interests and knowledge of different people therefore a global rank may not provide actual importance of the page for a given individual user. Conventional Query Dependent Page Ranking algorithms like Page Content Algorithm(PCR) use only term occurrence frequency and occurrence position of the given query keywords for computing page rank. They do not consider pages for page ranking that may contain either a synonym of query keywords or pages that may contain the related information with respect to given query even without containing the actual keywords in the query. For example the query holiday would not return pages that contain the term vacation. As two terms are synonyms of each other computer should provide web pages that contain either of the terms. Similarly a query about Ayurved in India should provide resultant pages containing information about Baba Ram Dev, because they are indirectly related to each others. Since traditional ranking is limited to keywords only, users have to scan through the resultsets refining the query multiple times to acquire all the needed information. A critical look at the available literature indicates the following deficiencies in the existing page ranking techniques: Some web pages may get higher ranking because of duplicate links and self links that are meant only for increasing the popularity of the web page, but actually they do not contain any relevant information. Similarly new web pages that actually contain the latest information can t get higher page rank values, because of lack of the corresponding back links. Different people may have different preference; therefore a global rank may not provide actual importance of web page for a given individual user Traditional Query Dependent Page ranking algorithms are limited to keywords only. Therefore, there is a need to introduce other query dependent factors to provide a better ranking solution. III. PROPOSED WORK All Due to the prevalent deficiencies in the current page ranking algorithms, users are not able to get desired results in top pages and have to scan through multiple search pages to meet their demand. To overcome these shortcomings, a novel page ranking mechanism is being proposed that considers popularity of a web page (based on in-links and outlinks) and occurrence frequency of keywords along with the synonyms and prospective words (words having direct/indirect relation with the keyword) for ranking to improve overall quality of search results. Prospective based page rank mechanism uses computed Document weight to rank the web pages. The computation of Document weight of a web page ISSN: Page 93

3 International Journal of Engineering Trends and Technology (IJETT) Volume 7 Number 6 - September 015 is a sum of its link score and content score. Link score is specified by calculating total number of in-links and out-links of a web Page and content score is based on the occurrence frequency of both query keywords and prospective keywords of a web page For each keyword available in a web page, a prospective table is constructed that contains keywords that may relate with entered keywords syntactically or may have some direct/indirect relation. For example prospective table for keyword Web-Mining would contain the related keywords such as Search Engine, Architecture-of-Search-Engine, Indexing Techniques, Crawler, Page rank etc. The architecture of the proposed ranking algorithm is shown in Fig. SEARCH INTERFAC E QUERY PROCESSO R Prospective Table that contains keywords that may relate with the given query keywords syntactically or may have some direct/indirect relation with the query keywords. The prospective table that suggests the prospective words for a given keyword is created at the search engine side. A user generally supplies a query to search engine with multiple keywords. Based on this assumption, Perspective table is created according to the following rules: Rule 1: If query contains only single keyword say X then perspective table will contain: Synonym of X and/or Inferred keywords (words having direct/indirect relation) with keyword X. For example the records of prospective table for the keyword Crawlerr will contain the following: Automated Program, Topical, Focused, Incremental PageRankScore Matched Documents Repository Rule : If query contains two keywords say X and Y then prospective table will contain: Link Popularity Score Content Based Score Fig. Architecture of Proposed Algorithm The user first enters a search query in the search interface. This query is passed to the Query Processor, which then processes the query by parsing it, removing stop words and identifying the query terms. The prospective keywords for the query terms are then fetched from the Prospective table. These are then passed to the Indexer to fetch all the URL s that contain either or both of the query and perspective terms. For the fetched URLs a page rank score based on link and content score is calculated. The web pages are then ranked on the basis of Page Rank Score and passed to Query Processor, which then presents the results to the user. There are three main stages of the proposed algorithm namely: Link Popularity weight calculation, Prospective table construction, and Context weight calculation. A. Link Popularity calculation The popularity of each web page is measured with the help of its in-links and out-links. Link popularity calculation is based on equation (1) Link_Score= (No. of Inlinks)+(No. of Outlinks) -(1) B. Prospective Table Construction For each keyword available in a page, a list of prospective terms are created from prospective table Synonym of X and/or Inferred keywords (words having direct/indirect relation) with keyword X. Synonym of Y and/or Inferred keywords (words having direct/indirect relation) with keyword Y. Inferred keywords (words having direct/indirect relation) with the combination of keywords X and Y. For example the records of perspective table for the query Protest Delhi is shown in Table I: TABLE I: PERSPECTIVE WORDS TABLE Keywords Prospective Table Record Election Vote, Choice, Commissioner, Election-poll Delhi Capital of India, Delhi-map, Delhi-Tourism, Delhi-Metro Election Delhi CM, Kiran Bedi, Arvind Kejriwal, 7 th February And likewise new rule may be generated for queries containing more keywords. This table is created by the search engine at the back end by using classification algorithms such as apriori algorithm. The table may get dynamically updated with respect to the news sites for latest keyword relation and current perception. The Apriori Algorithm is a classic algorithm for mining frequent item sets for boolean association rules., the algorithm attempts to find subsets which are common to at least a minimum number C of the ISSN: Page 94

4 International Journal of Engineering Trends and Technology (IJETT) Volume 7 Number 6 - September 015 itemsets. Apriori uses a "bottom up" approach, where frequent subsets are extended one item at a time (a step known as candidate generation), and groups of candidates are tested against the data. The algorithm terminates when no further successful extensions are found. Apriori algorithm can be used to find the terms which are co occurring in various documents of index file of a search engine and these co occurring terms are called context terms. For example if the keywords laptop, desktop, keyboard, mouse are co occurring with the term computer in some minimum number of documents then we can say these terms are contextually related to the term computer. The working of the Apriori algorithm is explained below: Let D be the index of web documents. Support S, is the occurrence frequency of a keyword in a document. Frequent k-term set is the set of k terms which co occur in some minimum no. of documents. 1. Scan the index of web documents D, to get the support S of each 1-keyword (term) set, compare S with min_sup, and get a set of frequent 1- term sets, L1.. Use L k-1 join L k-1 to generate a set of candidate k-term set. 3. Scan the index database to get the support S of each candidate k-term set in the final set, compare S with min-sup, and get a set of frequent k- term set, L k 4. If candidate set is empty then stop else go to step. An example showing the working of the apriori algorithm is shown below: Consider an example database consisting 6 documents as shown below in Table II.Suppose minimum support count required is. Document TABLE II: DATABASE COLLECTION Keywords Step-1-> Generating 1-termset frequent pattern Scan the database D for count of each candidate to generate C1 TABLE III: C1 Support Count Computer 6 Computer generation Desktop 5 Laptop 5 CPU 1 SuperComputer 1 Keyboard Mouse Printer 3 Monitor 1 Hp 1 Dell 1 Compare candidate support count with min_sup count to generate L1. TABLE IV: L1 Support Count Computer 6 Computer generation Desktop 5 Laptop 5 Keyboard Mouse Printer 3 Step -> Generating -termset frequent pattern : Generate candidate set C from L1. Doc1 Doc Doc3 Doc4 Doc5 Doc6 Computer, computer generation, desktop, laptop, CPU Computer, computer generation, desktop, laptop, super computer Computer, desktop, keyboard, mouse, printer, monitor Computer, mouse, printer, keyboard, laptop Computer, hp, printer, laptop, desktop Computer, laptop, desktop, dell computer generation} laptop} keyboard} ISSN: Page 95

5 International Journal of Engineering Trends and Technology (IJETT) Volume 7 Number 6 - September 015 mouse} {computer generation, {computer generation, laptop} {computer generation, keyboard} {computer generation, mouse} {computer generation, {desktop, laptop} {desktop, keyboard} {desktop, mouse} {desktop, {laptop, keyboard} {laptop, mouse} {laptop, {keyboard, mouse} {keyboard, {mouse, Scan the database D for count of each candidate C. Support Count {Computer,computer generation} computer 5 {computer, laptop} 5 keyboard} mouse} 3 {computer generation, {computer generation, laptop} {computer generation, 0 keyboard} {computer generation, 0 mouse} {computer generation, 0 {desktop, laptop} 4 {desktop, keyboard} 1 {computer, computer generation, computer generation, laptop} desktop, laptop} desktop, laptop, keyboard, mouse} keyboard, mouse, {desktop, mouse} 1 {desktop, {laptop,keyboard} 1 {laptop,mouse} 1 {laptop, {keyboard, mouse} {keyboard, {mouse, Compare candidate support count with min_sup count and generate L. computer generation lapt op} keyboard} mouse} {computer generation, {computer generation laptop} {desktop, laptop} {desktop, {laptop, {keyboard, mouse} {keyboard, {mouse, L C Support Count Step 3-> Generating 3-termset frequent pattern Generate candidate set C3 from L ISSN: Page 96

6 International Journal of Engineering Trends and Technology (IJETT) Volume 7 Number 6 - September 015 Scan the database D for count of each candidate C3. Support count computer generation, computer generation, laptop} desktop, 4 laptop} desktop, laptop, keyboard, mouse} L3 keyboard, mouse, After comparing support count of C3 term set with min_support we get L3 as shown above. Step 4 generating 4-termset frequent pattern Generate C4 candidates from L3 and scan the database D, for count of each candidate computer generation, desktop, laptop} desktop, laptop, keyboard, mouse, Support Count Now it is not possible to generate C5 from L4. In this way by using apriori algorithm prospective table is created. C. Context weight calculation Context weight of document is calculated based on the presence of query term and prospective terms in the document. The weight is calculated as how many terms out of query term and prospective terms are present within the document. Content Score calculation is based on equation ntd Content_Score = () tnt Where ntd is number of terms (query term and prospective terms) present in the document and tnt is the total no. of terms in the web page. L4 D. Page Rank Score Calculation: The final rank of a web page is based on the sum of its link_score and content score. Page Rank Score is calculated according to equation 3. Page Rank Score= Link_Score + Content_Score (3) IV. IMPLEMENTATION To implement the proposed ranking system core java is used as front end development tool and mysql is used as database management system. To calculate popularity weight of web pages there is need to extract link information from the web pages. Program is developed which will extract link information from the web pages and store it in proper tables described as follows: 1. WebPages table: This table stores information about every web page. TABLE V : WEB_PAGE TABLE STRUCTURE Field Name Page_id Page_link Inlink Outlink Link_score Data Type Number Varchar Number Number Number Page_id field will store unique id given to a web page. Page_link field will store the complete link of the webpages. No. of inlniks and outlinks of the webpage will be stored in inlink and outlink field. Link_Score will Page field as in webpages_inlink table will store link of the webpage and outlink field will store outlinks of a webpage.. _doc table: This table is like index. It will store keywords and the documents containing them. TABLE VI: TERM_DOC TABLE Field Name Data Type Varchar Document Varchar 3. Prospective table: This table will store terms that are occurring together in various number of documents. To create prospective table, index of web pages is required. Program implementing this module will take term_doc table as input and applies apriori algorithm that will return prospective table as shown in Fig 3 ISSN: Page 97

International Journal of Engineering Trends and Technology (IJETT) Volume 7 Number 6 - September 015 The Ranking order of the URLs in response to the query Holidays in Delhi based on prospective

7 International Journal of Engineering Trends and Technology (IJETT) Volume 7 Number 6 - September 015 The Ranking order of the URLs in response to the query Holidays in Delhi based on prospective terms is shown in Table IX. Fig 3: Prospective Table Snapshot Query TABLE IX : SEARCH RESULTS USING PROPOSED MECHANISM Holiday in Delhi Prospective s Vacations, trip, break, tourist-guide, resort, packages, old-delhi, new-delhi, Delhi-map, Delhi-metro, Delhi-tourism, Delhi-airport, hotel When user gives query at search interface, the program will searches for prospective terms in the prospective term table. The documents that contain terms or prospective terms or both are then sorted according to link score stored in web_pages table. After that every matched document is sorted based on content_score and the results are returned to the user. A comparison between the results of popular search engine called Google and the proposed page rank method was also performed. A query Holidays in Delhi was fired to find information related to Tourist places in Delhi. The response URL s returned by google for query Holidays In Delhi are shown in Table VII. Rank TABLE VII : SEARCH RESULTS BY GOOGLE URLs The effect of the proposed Prospective s based page rank mechanism on the same set of web pages is analyzed below. The combined perspective terms for the given query are shown in Table VIII. TABLE VIII : PERSPECTIVE TABLE FOR HOLIDAYS IN DELHI Rank URLs It can be observed that the highest PageRankScore comes out to be that of since this site contains links and other information related to the keywords fired in the query as well as keywords in the perspective table. The URL is placed at second position. The site is a guide to tourist places in Delhi and also give details related to accomadation and travelling in Delhi. The site is a news site and gives information about Chath being declared as Public as Public Holiday. Since the URL is not related to query it is placed at the bottom. A survey was conducted to check the relevancy of the proposed algorithm. User s perception of the two systems were compared. In particular concentration was on two aspects: user satisfaction with the search and time of search to get the desired information. Survey was conducted with a group of graduate students. Volunteers were expected to select relevant URLs satisfying their choice of preference on both the systems and answer a questionnaire determining the quality of two systems by comparing the two systems as to in which of the two systems they were more satisfied i.e they were able to get all the information within first few URLs of the result-set and time required to get the desired information with respect to the given query. It can be observed that proposed system outperforms Google system in terms of user satisfaction. The advantage of the proposed mechanism is that user is able to retrieve all the information within the first few URLs. While these preliminary results are not highly significant statically given the very small user study, but they are promising. The proposed system seems to provide a mechanism that can help retrieve high quality documents with maximized user satisfaction. V. CONCLUSION Many users try to find desired information on the first page of the search result only and results coming after ISSN: Page 98

International Journal of Engineering Trends and Technology (IJETT) Volume 7 Number 6 - September 015 first search result page are nearly invisible for general user.

8 International Journal of Engineering Trends and Technology (IJETT) Volume 7 Number 6 - September 015 first search result page are nearly invisible for general user. Fig 3 : User Satisfaction Graph If user does not get information on the first page they consider the search to be a miss and try to reformulate the query to find the desired result. Moreover, due to deficiency in the page rank algorithms important pages may lie in relative lower order in the results. The mechanism proposed in this paper for computing the page rank not only considers the link popularity and keywords supplied in a query but adapts a perspective view by considering the synonyms and related query keywords, so that the pages that are indirectly related to users query may also be considered and be placed in the proper position in the results. The advantage of the proposed mechanism is that user gets the full information within the first few URLs and will not have to go deeper into the search results returned by the search engine. [8] World Wide Web searching technique, Vineel Katipally, Leong-Chiang Tee, Yang Yang Computer Science & Engineering Department Arizona State University [9] L. Page, S. Brin, R. Motwani, and T. Winograd, The PageRank Citation Ranking: Bringing Order to the Web, Technical Report, Stanford Digital Libraries SIDL-WP , [10] Bashman, Sierra & Bates. Head-First Servlet & JSP. O Reilly, nd Edition,(006). [11] Ivan Bayross. SQL, PL/SQL The programming Language of Oracle. BPB, 3rd Edition(006) [1] Ashutosh Dixit, A K Sharma, Ashlesha Gupta Perspective based Mathematical model for Page Ranking, ACM International Conference and workshop on emerging trends in technology, ICWET 01, Feb 4-5, 01 TCET Kundawali (E) Mumbai. [13] Debajyoti Mukhopadhyay, Pradipta Biswas, Young-Chon Kim A Syntactic Classification based Web Page Ranking Algorithm, 6th International Workshop on MSPT Proceedings006. [14] Mukhopadhyay, Debajyoti; Biswas, Pradipta; FlexiRank: An Algorithm Offering Flexibility and Accuracy for Ranking the Web Pages; Proceedings of the ICDCIT 005, December -5, 005; Bhubaneswar, India; LNCS 3816, Springer-Verlag, Berlin, Germany 005; pp [15] Clarke, Charles L.A. et. al.; Relevance ranking for one to three term queries; Information Processing and Management, 36, 000, pp [16] Our Search; Google Technology [17] Pooja Devi, ashlesha gupta, ashutosh dixit, Comparative Study of PageRank and HITS Link Based Ranking Algorithm, International Journal of Advance Research in Computer and Communication Engineering Vol., Issue, February 014. [18] Gyanendra Kumar, Neelam Duahn, and Sharma A. K., Page Ranking Based on Number of Visits of Web Pages, International Conference on Computer & Communication Technology (ICCCT)-011, REFERENCES [1] C. Ridings and M. Shishigin, "Pagerank Uncovered". Technical report,00. [] Neelam Duhan,A.K.Sharma and Komal Kumar Bhatia, Page Ranking Algorithms : A Survey,In proceedings of the IEEE International Advanced Computing Conference (IACC),009 [3] A Comparative Analysis of Web Page Ranking Algorithms, Dilip Kumar Sharma et al. / (IJCSE) International Journal on Computer Science and Engineering Vol. 0, No. 08, 010, [4] Comparative Analysis Of Pagerank And HITS Algorithms Ritika Wason Assistant professor, International Journal of Engineering Research & Technology (IJERT) Vol. 1 Issue 8, October - 01 [5] Mercy Paul Selvan,A.Chandra Sekar and A.Priya Dharshin Survey on Web Page Ranking Algorithms International Journal of Computer Applications ( ) Volume 41 No.19, March 01. [6] Ricardo Baeza-Yates and Emilio Davis,"Web page ranking using link attributes", In proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, PP.38-39, 004. [7] Link Analysis Ranking: Algorithms, Theory, and Experiments,allan borodin,university of Toronto ISSN: Page 99

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

International Journal of Soft Computing and Engineering (IJSCE) ISSN: 31-307, Volume-, Issue-3, July 01 Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page Neelam Tyagi, Simple