CSI 445/660 Part 10 (Link Analysis and Web Search)

Similar documents
CS 6604: Data Mining Large Networks and Time-Series

Link Analysis: Web Structure and Search

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Social Networks 2015 Lecture 10: The structure of the web and link analysis

Information Networks: Hubs and Authorities

Social Network Analysis

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

COMP 4601 Hubs and Authorities

Introduction to Data Mining

Information Networks: PageRank

How to organize the Web?

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Social and Technological Network Analysis. Lecture 5: Web Search and Random Walks. Dr. Cecilia Mascolo

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Lecture 8: Linkage algorithms and web search

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Collaborative filtering based on a random walk model on a graph

Slides based on those in:

A project report submitted to Indiana University

Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

CS425: Algorithms for Web Scale Data

Reading Time: A Method for Improving the Ranking Scores of Web Pages

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Generalized Social Networks. Social Networks and Ranking. How use links to improve information search? Hypertext

Link Structure Analysis

Information Retrieval. Lecture 11 - Link analysis

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

Authoritative Sources in a Hyperlinked Environment

COMP5331: Knowledge Discovery and Data Mining

Link Analysis and Web Search

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

CS6200 Information Retreival. The WebGraph. July 13, 2015

Information Retrieval and Web Search

Bruno Martins. 1 st Semester 2012/2013

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Brief (non-technical) history

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Bibliometrics: Citation Analysis

A project report submitted to Indiana University

Motivation. Motivation

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Experimental study of Web Page Ranking Algorithms

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

Analysis of Large Graphs: TrustRank and WebSpam

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

PageRank and related algorithms

Big Data Analytics CSCI 4030

Introduction to Information Retrieval

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Recent Researches on Web Page Ranking

PAGE RANK ON MAP- REDUCE PARADIGM

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Part 1: Link Analysis & Page Rank

Graph and Link Mining

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Lecture 17 November 7

CS224W Final Report Emergence of Global Status Hierarchy in Social Networks

Personalized Information Retrieval

Search Engines. Dr. Johan Hagelbäck.

Web Structure Mining using Link Analysis Algorithms

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

HW 4: PageRank & MapReduce. 1 Warmup with PageRank and stationary distributions [10 points], collaboration

Searching the Web What is this Page Known for? Luis De Alba

Chapter 5 Graph Algorithms Algorithm Theory WS 2012/13 Fabian Kuhn

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Degree Distribution: The case of Citation Networks

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

COMP Page Rank

Big Data Analytics CSCI 4030

A novel approach of web search based on community wisdom

Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi

The Technology Behind. Slides organized by: Sudhanva Gurumurthi

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

On Page Rank. 1 Introduction

Network Centrality. Saptarshi Ghosh Department of CSE, IIT Kharagpur Social Computing course, CS60017

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

Google Scale Data Management

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

Section 7.12: Similarity. By: Ralucca Gera, NPS

An Improved Computation of the PageRank Algorithm 1

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

Lecture 3: Descriptive Statistics

TODAY S LECTURE HYPERTEXT AND

Unit VIII. Chapter 9. Link Analysis

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

Transcription:

CSI 445/660 Part 10 (Link Analysis and Web Search) Ref: Chapter 14 of [EK] text. 10 1 / 27

Searching the Web Ranking Web Pages Suppose you type UAlbany to Google. The web page for UAlbany is among the top few results displayed. Search engines use automated methods to rank pages. These methods are generally based on link analysis. Search engines also maintain and try to get clues from a user s search history. Common difficulties: Synonymy: Multiple ways to describe the same thing (e.g. scallions vs green onions ). Polysemy: Multiple meanings for the same word (e.g. mercury may refer to the planet, a car model, the chemical element or newspaper). 10 2 / 27

Information Retrieval Then and Now Pre-web era: Problem of scarcity. Example: A lawyer searching for certain types of cases could only locate a few documents. Now: Problem of abundance. The search engine should try to produce the most relevant information (from a whole lot of information). A popular area of research. Focus: Use of link analysis in ranking. Some basic issues: Suppose a user types a one word query Cornell into a search engine. Are there clues within the web to suggest that cornell.edu is a good answer to the query? 10 3 / 27

Information Retrieval... (continued) Idea 1 Voting by in-links: "Vote" for cornell.edu cornell.edu If many other pages link to cornell.edu, one can think of that page as receiving collective endorsement. Some of those pages may actually express negative opinions about cornell.edu. Idea 2 List finding: Consider the query newspapers to a search engine. There is no single best answer to this query. 10 4 / 27

Information Retrieval... (continued) Idea 2: List finding (continued): Suppose we collect a set of web pages that have the word newspapers and then check which pages they endorse (i.e., to which pages they have in-links). The answers typically consist of the following: High scores for web pages of prominent newspapers. High scores for other web pages such as Google, Amazon, Facebook, etc. Note: Web pages for Google, Amazon, Facebook, etc. generally receive a high score no matter what the query is. 10 5 / 27

Information Retrieval... (continued) Example: 10 6 / 27

List Finding... (continued) Pages that contain lists of resources relevant to a topic are also useful. For the query newspapers, we may try to find pages that have lists of links to newspapers. We can try to compute a measure that represents the value of a page as a list. One possible measure: The list value of a page X is the sum of the votes received by the pages voted for by X. 10 7 / 27

Information Retrieval... (continued) Example (with list values): 10 8 / 27

List Finding... (continued) Idea 3 Principle of iterative improvement: Since pages with high list values are important, their votes should be weighted more heavily. (Endorsements from more important people should count more.) So, it is useful to tabulate the votes again, using the list values. After this, we can recompute the list values again; that is, repeat the vote count and list count steps. The resulting algorithm (due to Kleinberg) is called HITS (Hyperlink-Induced Topic Search). 10 9 / 27

Information Retrieval... (continued) Example (with list values and new vote counts): 10 10 / 27

A Description of the HITS Algorithm Definitions: Authorities for a query: endorsed answers. Pages that are prominent and highly Hubs for a query: Pages that have high list values. Preliminary ideas: For each page p, we maintain two numerical values denoted by auth(p) and hub(p). Initially, auth(p) = hub(p) = 1. Two update rules are used. p1 p2 pr p 1. Authority update rule (or voting step): For each page p, update auth(p) to be the sum of the hub scores for all the pages that point to p. 10 11 / 27

A Description of the HITS Algorithm (continued) Update rules (continued): p p1 p2 pr 2. Hub update rule (or list finding step): For each page p, update hub(p) to be the sum of the authority scores for all the pages to which p points. Outline of the HITS Algorithm: 1 For each page p, set auth(p) = hub(p) = 1. Choose a value for the number of steps k. 2 Repeat the following steps k times: Apply the Authority update rule. Apply the Hub update rule. 3 Normalize the scores and output pages in non-increasing order of their authority scores. 10 12 / 27

HITS Algorithm... (continued) Result produced by the HITS Algorithm: 10 13 / 27

HITS Algorithm... (continued) Final remarks: Kleinberg [1999] shows that the scores converge to appropriate limits as k (except in some degenerate cases). It is possible to express the HITS Algorithm as an iterative algorithm on matrices MM T and M T M, where M is the adjacency matrix formed by the initial pages. The authority scores and hub scores of pages converge to specific eigenvectors of MM T and M T M respectively. The resulting authority and hub scores represent a form of equilibrium (under the authority update and hub update rules). 10 14 / 27

Page Rank HITS Algorithm works well in commercial contexts where competing firms don t (generally) link to each other. In other contexts (e.g. academic pages, scientific literature), page rank algorithm generally outperforms the HITS Algorithm. Page rank computation uses ideas similar to those of HITS: A page rank update rule. Idea of iterative improvement. A physical model for page rank: Think of page rank as a fluid that circulates through the links of the web network. The fluid accumulates at nodes that are most important. 10 15 / 27

Page Rank Computation (continued) Notation: For any page u, PR(u) denotes its page rank. OD(u) denotes its outdegree. Outline of the algorithm: 1 Let n denote the number of pages. For each node u, let PR(u) = 1/n. 2 Choose a value for k (the number of iterations). 3 Repeat the following step k times: Apply the Basic Page Rank Update Rule to all the nodes in parallel. 10 16 / 27

Page Rank Computation (continued) Basic Page Rank Update Rule: For any node u, 1 (Flow generation step) If OD(u) = 0 then u sends PR(u) to itself. If OD(u) 1, then u sends PR(u)/OD(u) along each of its outgoing edges. 2 (Flow accumulation step) Suppose node u has r incoming edges and the flow along the i th edge is α i. If OD(u) = 0, then PR(u) = PR(u)+ α 1 + α 2 + + α r. If OD(u) 1, then PR(u) = α 1 + α 2 + + α r. 10 17 / 27

Page Rank Computation (continued) Examples for the flow generation step: Example 1: u x y z Suppose PR(u) = 1/2. Since OD(u) = 3, u sends 1/6 along each of the three outgoing edges. Example 2: x y z Suppose PR(u) = 1/10. Since OD(u) = 0, u sends 1/10 to itself. u 10 18 / 27

Page Rank Computation (continued) Examples for the flow accumulation step: Example 3: x y z Here, OD(u) = 0. 1/3 1/4 u 1/5 Let the current value of PR(u) be 1/10. New value of PR(u) = 1/10 + 1/3 + 1/4 + 1/5 = 53/60. Example 4: x 1/3 1/4 y z 1/5 u Let the current value of PR(u) be 1/10. Since OD(u) = 2, u has already sent 1/20 to each of a and b. a b New value of PR(u) = 1/3 + 1/4 + 1/5 = 47/60. 10 19 / 27

Page Rank Computation (continued) A more detailed example: a (1/4) Initially, PR(a) = PR(b) = PR(c) = PR(d) = 1/4. (1/4) b d (1/4) c (1/4) Each node has outdegree > 0. So, in every step, each node sends out its page rank along the outgoing edges. Step 1: Node a receives 1/8 from b and 1/4 from c. So, PR(a) = 1/8 + 1/4 = 3/8. Node b receives 1/8 each from a and d. So, PR(b) = 1/8 + 1/4 = 3/8. Node c receives 1/8 from a. So, PR(c) = 1/8. Node d receives 1/8 from b. So, PR(d) = 1/8. 10 20 / 27

Page Rank Computation (continued) A more detailed example (continued): a (3/8) At the end of Step 1, PR(a) = PR(b) = (3/8) b c (1/8) 3/8 and PR(c) = PR(d) = 1/8. d (1/8) Step 2: Node a receives 3/16 from b and 1/8 from c. So, PR(a) = 3/16 + 1/8 = 5/16. Node b receives 3/16 from a and 1/8 from d. So, PR(b) = 3/16 + 1/8 = 5/16. Node c receives 3/16 from a. So, PR(c) = 3/16. Node d receives 3/16 from b. So, PR(d) = 3/16. 10 21 / 27

Page Rank Computation (continued) Table showing successive page rank values: Remarks: Step PR(a) PR(a) PR(a) PR(a) 0 1/4 1/4 1/4 1/4 1 3/8 3/8 1/8 1/8 2 5/16 5/16 3/16 3/16 There is no normalization here; the total page rank is always 1. It can be shows that (except for degenerate cases), the page rank values converge to a limit as k. 10 22 / 27

Page Rank Computation (continued) Example for equilibrium state (or fixed point): b a c Suppose PR(a) = 0, PR(b) = 1/2 and PR(c) = 1/2. These values won t change; that is, this is an equilibrium state. Another form of equilibrium: Initially, let PR(a) = 0, PR(b) = 3/4 and PR(c) = 1/4. At the end of Step 1: PR(a) = 0, PR(b) = 1/4 and PR(c) = 3/4. At the end of Step 2: PR(a) = 0, PR(b) = 3/4 and PR(c) = 1/4 (which is the initial state). Remark: If the network is strongly connected, it can be shown that there is a unique equilibrium state. 10 23 / 27

Page Rank Computation (continued) A drawback: In some networks, the page rank update rule allows wrong nodes to end up with all the page rank. b c d e f One would expect node a to have a high page rank. a However, the current page rank update rule cause all the page rank to flow out of a. g h All the page rank accumulates at g and h; it doesn t flow back to the other nodes. Remedy: Modify the page rank update rule. 10 24 / 27

Scaled Page Rank Update Rule Steps: 1 Pick a scaling factor s, where 0 < s < 1. 2 Apply the basic page rank update rule. 3 Scale down all the page rank values by the factor s. (This step reduces the total page rank from 1 to s.) 4 Divide the residual 1 s units of page rank equally among the n nodes; that is, add (1 s)/n units of page rank to each node. (This step restores the total page rank value to 1.) Remarks: It can be shows that (except for degenerate cases), the scaled page rank values converge to a limit as k. It is believed that the value of s used by Google is in the range 0.8 to 0.9. 10 25 / 27

A Random Walk Interpretation of Page Rank Basic page rank update rule and random walks: 1 Suppose we have n web pages p 1, p 2,..., p n. 2 Choose an initial page: each page is chosen with probability = 1/n. Let p i be the chosen page. 3 Repeat k times: Suppose the OD(p i ) = r. If r = 0, stay at p i itself. If r 1, choose one of the outgoing edges of p i with probability = 1/r. Update p i to the other end point of that edge. Theorem: For each i, 1 i n, the probability that the above random walk is at node p i is equal to the page rank of p i after k applications of the basic page rank update rule. 10 26 / 27

A Random Walk Interpretation of Page Rank (continued) Notes: The random walk approach provides another way to estimate page ranks. The approach can also be extended to the scaled page rank update rule. In the body of the loop for Step 3, do the following: With probability s (the chosen scale factor) continue the random walk as before. With probability 1 s choose another node, say p j, with all nodes being equally likely and continue the random walk from p j. Search engine companies are generally very secretive about the exact methods they use for computing page ranks. 10 27 / 27