Recent Researches on Web Page Ranking

Similar documents
Web Structure Mining using Link Analysis Algorithms

Link Analysis and Web Search

Information Retrieval. Lecture 11 - Link analysis

Searching the Web [Arasu 01]

An Enhanced Page Ranking Algorithm Based on Weights and Third level Ranking of the Webpages

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

COMP5331: Knowledge Discovery and Data Mining

Bruno Martins. 1 st Semester 2012/2013

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Analysis of Link Algorithms for Web Mining

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Survey on Web Structure Mining

Weighted PageRank Algorithm

Word Disambiguation in Web Search

Weighted PageRank using the Rank Improvement

An Improved Computation of the PageRank Algorithm 1

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Bibliometrics: Citation Analysis

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

COMP 4601 Hubs and Authorities

E-Business s Page Ranking with Ant Colony Algorithm

Computer Engineering, University of Pune, Pune, Maharashtra, India 5. Sinhgad Academy of Engineering, University of Pune, Pune, Maharashtra, India

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

Mining Web Data. Lijun Zhang

Role of Page ranking algorithm in Searching the Web: A Survey

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

PageRank and related algorithms

Generalized Social Networks. Social Networks and Ranking. How use links to improve information search? Hypertext

Experimental study of Web Page Ranking Algorithms

A Review Paper on Page Ranking Algorithms

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

How to organize the Web?

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

COMP Page Rank

Introduction to Data Mining

Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL

PAGE RANK ON MAP- REDUCE PARADIGM

Information Networks: PageRank

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Social Network Analysis

Brief (non-technical) history

Reading Time: A Method for Improving the Ranking Scores of Web Pages

Authoritative Sources in a Hyperlinked Environment

Part 1: Link Analysis & Page Rank

CS425: Algorithms for Web Scale Data

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Slides based on those in:

CS6200 Information Retreival. The WebGraph. July 13, 2015

Lecture 17 November 7

Searching the Web What is this Page Known for? Luis De Alba

Big Data Analytics CSCI 4030

Learning to Rank Networked Entities

Big Data Analytics CSCI 4030

Web Mining: A Survey on Various Web Page Ranking Algorithms

Link Analysis. Hongning Wang

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Home Page. Title Page. Page 1 of 14. Go Back. Full Screen. Close. Quit

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

International Journal of Advance Engineering and Research Development. A Review Paper On Various Web Page Ranking Algorithms In Web Mining

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

Information Retrieval. Lecture 4: Search engines and linkage algorithms

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Architecture of A Scalable Dynamic Parallel WebCrawler with High Speed Downloadable Capability for a Web Search Engine

Breadth-First Search Crawling Yields High-Quality Pages

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

An Adaptive Approach in Web Search Algorithm

Review of Various Web Page Ranking Algorithms in Web Structure Mining

SIMILARITY MEASURE USING LINK BASED APPROACH

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Lecture #3: PageRank Algorithm The Mathematics of Google Search

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Weighted Page Rank Algorithm based on In-Out Weight of Webpages

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

CS 6604: Data Mining Large Networks and Time-Series

Lecture 8: Linkage algorithms and web search

The application of Randomized HITS algorithm in the fund trading network

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM

Abstract. 1. Introduction

Information Retrieval and Web Search

A Survey on k-means Clustering Algorithm Using Different Ranking Methods in Data Mining

Link Structure Analysis

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Lec 8: Adaptive Information Retrieval 2

Link Analysis in Web Mining

Unit VIII. Chapter 9. Link Analysis

Link analysis. Query-independent ordering. Query processing. Spamming simple popularity

Transcription:

Recent Researches on Web Page Pradipta Biswas School of Information Technology Indian Institute of Technology Kharagpur, India

Importance of Web Page Internet Surfers generally do not bother to go through the first 10 to 20 pages So the ordering of pages is important to compose an effective and efficient search result. 2

Problems in Web Page Huge size of Web Exponential increase in size Unstructured nature of web pages No control on content 3

The Two Basic Web Page Algorithms Hypertext Induced Topic Search By J. Kleinberg Used at AltaVista Page Rank By L. Page, S. Brin Used at Google 4

Hypertext Induced Topic Search Ref: Kleinberg, Jon; Authoritative Sources in a Hyperlinked Environment; Proc. ACM-SIAM Symposium on Discrete Algorithms, 1998; pp. 668-677

Hubs and Authorities Authorities are pages that are recognized as providing significant, trustworthy, and useful information on a topic. 1 2 3 4 a 4 = h 1 + h 2 + h 3 h 4 = a 5 + a 6 + a 7 4 5 6 7 Hubs are index pages that provide lots of useful links to relevant content pages (topic authorities). 6

HITS Iterative Algorithm Initialize for all p S: a p = h p = 1 For i = 1 to k: For all p S: For all p S: a p = h q q: q p h p = a q q: p q (update auth. scores) (update hub scores) For all p S: a p = a p /c c: For all p S: h p = h p /c c: p S p S ( a / ) 2 p c = 1 ( h / ) 2 p c = 1 (normalize a) (normalize h) 7

Convergence Algorithm converges to a fix-point if iterated indefinitely. Define A to be the adjacency matrix for the subgraph defined by S. A ij = 1 for i S, j S iff i j Authority vector, a, converges to the principal eigenvector of A T A Hub vector, h, converges to the principal eigenvector of AA T In practice, 20 iterations produces fairly stable results. 8

Drawbacks Pure link based computation-textual content is ignored Topic Drifting-Appear when hub discusses multiple topic 9

Page Rank Ref: Brin, Sergey; Page, Lawrence; The Anatomy of a Large-Scale Hypertextual Web Search Engine; 7th Int. WWW Conf. Proceedings, Brisbane, Australia; April 1998.

PageRank Alternative link-analysis method used by Google (Brin & Page, 1998). Does not attempt to capture the distinction between hubs and authorities. Ranks pages just by authority. Applied to the entire web rather than a local neighborhood of pages surrounding the results of a query. 11

Initial PageRank Idea Just measuring in-degree (citation count) doesn t account for the authority of the source of a link. Initial page rank equation for page p: R( q) R( p) = c q: q p Nq N q is the total number of out-links from page q. A page, q, gives an equal fraction of its authority to all the pages it points to (e.g. p). c is a normalizing constant set so that the rank of all pages always sums to 1. 12

Initial PageRank Idea (cont.) Can view it as a process of PageRank flowing from pages to the pages they cite..1.09.05.05.03.03.03.08.08.03 13

Initial Algorithm Iterate rank-flowing process until convergence: Let S be the total set of pages. Initialize p S: R(p) = 1/ S Until ranks do not change (much) (convergence) For each p S: R ( p) = q: q p p S R( q) Nq c = 1/ R ( p) For each p S: R(p) = cr (p) (normalize) 14

Sample Stable Fixpoint 0.4 0.2 0.2 0.4 0.2 0.4 0.2 15

Problem with Initial Idea A group of pages that only point to themselves but are pointed to by other pages act as a rank sink and absorb all the rank in the system. Rank flows into cycle and can t get out 16

Rank Source Introduce a rank source E that continually replenishes the rank of each page, p, by a fixed amount E(p). R( p) = c q: q p R( q) N q + E( p) 17

PageRank Algorithm Let S be the total set of pages. Let p S: E(p) = α/ S (for some 0<α<1, e.g. 0.15) Initialize p S: R(p) = 1/ S Until ranks do not change (much) (convergence) For each p S: R ( p) = (1 α) q: q c = 1/ R ( p) p S p R( q) Nq + E( p) For each p S: R(p) = cr (p) (normalize) 18

Random Surfer Model PageRank can be seen as modeling a random surfer that starts on a random page and then at each point: With probability E(p) randomly jumps to page p. Otherwise, randomly follows a link on the current page. R(p) models the probability that this random surfer will be on page p at any given time. E jumps are needed to prevent the random surfer from getting trapped in web sinks with no outgoing links. 19

Speed of Convergence Early experiments on Google used 322 million links. PageRank algorithm converged (within small tolerance) in about 52 iterations. Number of iterations required for convergence is empirically O(log n) (where n is the number of links). Therefore calculation is quite efficient. 20

Comparison of HITS and PageRank HITS Page Rank Assembles different root set and prioritizes pages in the context of query Looks forward and backward direction Assigns initial ranking and retains them independently from queries (fast) In the forward direction from link to link 21

Recent Modifications

Client Side EigenVector- Enhanced Retrieval Ref: Chakrabarti, S. et. al.,; Mining the link structure of the World Wide Web; IEEE Computer, 32(8), August 1999.

CLEVER Replacing the sums of HITS with weighted sums Assign to each link a non-negative weight Weight depends on the query term and end point Extension 1: Anchor Text using text that surrounds hyperlink definitions (href s) in Web pages, often referred as anchor text boost weight enhancements of links that occur near instances of query terms Extension 2: Mini Hub Pagelets breaking large hub into smaller units treat contiguous subsets of links as mini-hubs or pagelets contiguous sets of links on a hub page are more focused on single topic than the entire page 24

CLEVER: The Process Starts by collecting a set of pages Gathers all pages of initial link, plus any pages linking to them Ranks result by counting links Links have noise, not clear which pages are best Recalculate scores Pages with most links are established as most important, links transmit more weigh Repeat calculation no. of times till scores are refined 25

CLEVER: Advantages Used to populate categories of different subjects with minimal human assistance Able to leverage links to fill category with best pages on web Can be used to compile large taxonomies of topics automatically Emerging new directions: Hypertext classification, focused crawling, mining communities 26

Companion Algorithm Ref: Madria, Sanjay Kumar; Web Mining: A Bird s Eye View; Lecture Notes

Companion Algorithm An extension to HITS algorithm Features: Exploit not only links but also their order on a page Use link weights to reduce the influence of pages that all reside on one host Merge nodes that have a large number of duplicate links The base graph is structured to exclude grandparent nodes but include nodes that share child 28

Companion Algorithm (Cont d) Four steps 1. Build a vicinity graph for u 2. Remove duplicates and near-duplicates in graph. 3. Compute link weights based on host to host connection 4. Compute a hub score and a authority score for each node in the graph, return the top ranked authority nodes. 29

Weighted Page Rank Ref: Xing, W.; Ghorbani, A.; Weighted PageRank algorithm; Proceedings of the Second Annual Conference on Communication Networks and Services Research, 19-21 May 2004; pp. 305 314.

Weighted Page Rank Rationale Assigning larger rank values to more important (popular) pages instead of dividing the rank value of a page evenly among its outlink pages. Each outlink page gets a value proportional to its popularity (its number of inlinks and outlinks). 31

Weighted Page Rank (Contd..) The inlink weight of link(v, u) is calculated based on the number of inlinks of page u and the number of inlinks of all reference pages of page v. The outlink weight of link(v, u) calculated based on the number of outlinks of page u and the number of out-links of all reference pages of page v. where Iu and Ip represent the number of inlinks of page u and page p, respectively. R(v)denotes the reference page list of page v. where Ou and Op represent the number of outlinks of page u and page p, respectively. R(v)denotes the reference page list of page v. 32

Weighted Page Rank (Contd..) Page A has two reference pages: p1and p2 Ip1 =2 Ip2 =1 Op1 =2 Op2 = 3 33

Results of WPR The algorithm is tested using the query topics travel agent and scholarship. Travel agent represents a non-focused topic whereas scholarship represents a focused (popular) topic in the website of Saint Thomas University. The results are shown below as a bar graph The relevancy value versus the size of the page set of the query travel agent for PageRank and WPR The relevancy value versus the size of the page set of the query scholarship for PageRank and WPR 34

Query based

Personalized PageRank Page Rank can be biased (personalized) by changing E to a non-uniform distribution. Restrict random jumps to a set of specified relevant pages. For example, let E(p) = 0 except for one s own home page, for which E(p) = α This results in a bias towards pages that are closer in the web graph to your own homepage. 36

QUERY-SENSITIVE SELF-ADAPTABLE WEB PAGE RANKING ALGORITHM Ref: Wen-Xue Tao; Wan-Li Zuo; Query-sensitive selfadaptable Web page ranking algorithm Machine Learning and Cybernetics, 2003 International Conference on Volume 1, 2-5 Nov. 2003 Page(s):413-418 Vol.1

QUERY-SENSITIVE SELF-ADAPTABLE WEB PAGE RANKING ALGORITHM The algorithm uses the inverse document matrix i.e. a mapping from term to documents Each document is assigned a query sensitiveness value based on the user given query At the first stage, for each list of inverse table, the top nweb pages are selected from docs to construct a voting set VSet, All of the pages in VSet are assigned the same right to vote and to be voted 38

QUERY-SENSITIVE SELF-ADAPTABLE WEB PAGE RANKING ALGORITHM (Contd..) 1. If docis has a link to docit, (ignoring the number of links); 2. If there is docit s title, or the title (or name) of the website docit belongs to in docis s content, (ignoring the number of links); 3. If docis explicitly notes that the information is referred from docit, or from the website docit belongs to. The query Sensitiveness is calculated by counting the total number of votes got by a page 39

QUERY-SENSITIVE SELF-ADAPTABLE WEB PAGE RANKING ALGORITHM (Contd..) For two random Web pages doc, and doc2, let their query sensitiveness values are defined as QSl and QS2, and their global importance values are PRI and PR2. The three concrete integration strategies are as follows: 1. Query sensitiveness first, i.e. if (QSl>QS2)ll(QSI==QS&& PRI>PR2) then docl is ordered in front of doc2; 2. Global importance first, i.e. if (PRI>PR2)11(PRI==PR2 && QSl>QS2)then docl is ordered in front of doc2; 3. Integrating the above two as the rank value, i.e. if then docl is ranking in front of doc2, where 0<=α, β<=1 40

QUERY-SENSITIVE SELF-ADAPTABLE WEB PAGE RANKING ALGORITHM Results 41

Confidence Based Page Ref: Mukhopadhyay, Debajyoti; Giri, Debasis; Singh, Sanasam Ranbir; An Approach to Confidence Based Page for User Oriented Web Search; SIGMOD Record, Vol.32, No.2, June 2003; pp. 28-33.

Confidence Based A Modification of Page Rank Algorithm The pageranks are calculated with respect to a topic. Before running the pagerank algorithm, the web graph is properly pruned to remove the pages not relevant to the required topic. A new parameter viz. confidence of a page w.r.t. a topic [C(a,p)] is defined as the probability of accessing page p for the topic a Calculation of C(a,p) depends on the recent accesses to a page The resultant page rank is calculated by multiplying the page rank and confidence factor 43

Conclusion The page ranking algorithms are tend to extract relevance of a page by analyzing the hyperlinks The algorithms are trying to extract some more information besides the keywords from user given query string The algorithms tend to use some web classification information during ranking Researches are also going on to make personalized ranking 44

References 1. Kleinberg, Jon; Authoritative Sources in a Hyperlinked Environment; Proc. ACM-SIAM Symposium on Discrete Algorithms, 1998; pp. 668-677. 2. Brin, Sergey; Page, Lawrence; The Anatomy of a Large-Scale Hypertextual Web Search Engine; 7th Int. WWW Conf. Proceedings, Brisbane, Australia; April 1998. 3. Chakrabarti, S. et. al.,; Mining the link structure of the World Wide Web; IEEE Computer, 32(8), August 1999. 4. Madria, Sanjay Kumar; Web Mining: A Bird s Eye View; http://mandolin.cais.ntu.edu.sg/wise2002/slides.shtml; WISE 2002, Singapore. 5. Xing, W.; Ghorbani, A.; Weighted PageRank algorithm; Proceedings of the Second Annual Conference on Communication Networks and Services Research, 19-21 May 2004; pp. 305 314. 6. Wen-Xue Tao; Wan-Li Zuo; Query-sensitive self-adaptable Web page ranking algorithm Machine Learning and Cybernetics, 2003 International Conference on Volume 1, 2-5 Nov. 2003 Page(s):413-418 Vol.1 7. Mukhopadhyay, Debajyoti; Giri, Debasis; Singh, Sanasam Ranbir; An Approach to Confidence Based Page for User Oriented Web Search; SIGMOD Record, Vol.32, No.2, June 2003; pp. 28-33. 45

Thanks for Your Patience