Lecture 17 November 7

Similar documents
COMP5331: Knowledge Discovery and Data Mining

Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo

Information Retrieval. Lecture 11 - Link analysis

A FAST COMMUNITY BASED ALGORITHM FOR GENERATING WEB CRAWLER SEEDS SET

Web Structure Mining using Link Analysis Algorithms

Information Retrieval. Lecture 9 - Web search basics

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

PageRank and related algorithms

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

Authoritative Sources in a Hyperlinked Environment

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Collaborative filtering based on a random walk model on a graph

Compact Encoding of the Web Graph Exploiting Various Power Laws

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

Topology Generation for Web Communities Modeling

The application of Randomized HITS algorithm in the fund trading network

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Link Analysis. Hongning Wang

On Finding Power Method in Spreading Activation Search

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Searching the Web [Arasu 01]

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

Review: Searching the Web [Arasu 2001]

Connected Components, and Pagerank

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Information Retrieval and Web Search

.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Social Network Analysis

Information Networks: PageRank

COMP 4601 Hubs and Authorities

Finding Top UI/UX Design Talent on Adobe Behance

Social Networks 2015 Lecture 10: The structure of the web and link analysis

E-Business s Page Ranking with Ant Colony Algorithm

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

Link Analysis and Web Search

Recent Researches on Web Page Ranking

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

PageRank. CS16: Introduction to Data Structures & Algorithms Spring 2018

Do TREC Web Collections Look Like the Web?

Computer Engineering, University of Pune, Pune, Maharashtra, India 5. Sinhgad Academy of Engineering, University of Pune, Pune, Maharashtra, India

A brief history of Google

Development of MapReduced Topic Sensitive PageRank

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

How to organize the Web?

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

An Improved Computation of the PageRank Algorithm 1

TODAY S LECTURE HYPERTEXT AND

The PageRank Citation Ranking: Bringing Order to the Web

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

2013/2/12 EVOLVING GRAPH. Bahman Bahmani(Stanford) Ravi Kumar(Google) Mohammad Mahdian(Google) Eli Upfal(Brown) Yanzhao Yang

5/30/2014. Acknowledgement. In this segment: Search Engine Architecture. Collecting Text. System Architecture. Web Information Retrieval

Finding Web Communities by Maximum Flow Algorithm using Well-Assigned Edge Capacities.

SimRank : A Measure of Structural-Context Similarity

Bibliometrics: Citation Analysis

Link Analysis in Web Information Retrieval

Mining Web Data. Lijun Zhang

LINK GRAPH ANALYSIS FOR ADULT IMAGES CLASSIFICATION

Motivation. Motivation

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Finding Neighbor Communities in the Web using Inter-Site Graph

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Ranking web pages using machine learning approaches

Using Non-Linear Dynamical Systems for Web Searching and Ranking

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Survey on Web Structure Mining

Searching the Web What is this Page Known for? Luis De Alba

Breadth-First Search Crawling Yields High-Quality Pages

PROS: A Personalized Ranking Platform for Web Search

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

Structural Analysis of Paper Citation and Co-Authorship Networks using Network Analysis Techniques

PAGE RANK ON MAP- REDUCE PARADIGM

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

Part 1: Link Analysis & Page Rank

HIPRank: Ranking Nodes by Influence Propagation based on authority and hub

Bow-tie Decomposition in Directed Graphs

AN EFFICIENT COLLECTION METHOD OF OFFICIAL WEBSITES BY ROBOT PROGRAM

Web Search Ranking. (COSC 488) Nazli Goharian Evaluation of Web Search Engines: High Precision Search

Information Retrieval and Web Search Engines

An Adaptive Approach in Web Search Algorithm

Degree Distribution: The case of Citation Networks

Lecture 8: Linkage algorithms and web search

A P2P-based Incremental Web Ranking Algorithm

Using Bloom Filters to Speed Up HITS-like Ranking Algorithms

Hyperlink-Induced Topic Search (HITS) over Wikipedia Articles using Apache Spark

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Integrating Content Search with Structure Analysis for Hypermedia Retrieval and Management

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

Link Structure Analysis

Extracting Information from Complex Networks

Recent Researches in Ranking Bias of Modern Search Engines

Single link clustering: 11/7: Lecture 18. Clustering Heuristics 1

Transcription:

CS 559: Algorithmic Aspects of Computer Networks Fall 2007 Lecture 17 November 7 Lecturer: John Byers BOSTON UNIVERSITY Scribe: Flavio Esposito In this lecture, the last part of the PageRank paper has been presented [3] and a discussion about the Hubs and Authorities paper [2] begun. 17.1 The Bow-tie model Definition 1. Given a directed graph G = G(V, E), the fan out or out-degree of a vertex V in G is the size of the subset F of E of edges that are going from V to all the other vertices. Definition 2. Given a directed graph G = G(V, E), the fan in or in-degree of a vertex V in G is the size of the subset H of E of edges that are going to V from all the other vertices. Note that a vertex V can be highly ranked in a graph G if it has either (both) high fan out or (and) high fan in. The insight is the fan in is more important than the fan out but both brings up in the ranking a web page. Recall that in the PageRank approach the graph G(V, E) has as vertex the web pages and as edges the hyperlinks. Core Figure 17.1. bow-tie model: If we only consider the graph bounded by the clique (core), we can prove that an equilibrium point exist: if we have also tends parts, i.e. vertex with fan in zero (left tend) or with fan out zero (right tend) then during a walk we can get stuck (dead end). Recalling the PageRank markovian approach, the idea is to find an equilibrium vector P of probabilities of a random walker being in vertex V after k steps when k. The process of finding the vector P ends when P does not change more than a given δ. So If we consider all the bow tie, we might have to wait for a bigger amount of steps to find the final P, or we can have no equilibrium point at all [1]. 17-1

We use then random jumps in the graph instead of a linear walk to visit the all nodes. A random jump approach has two properties: Order persevering respect to a vertex V as starting point. First of all, if I change a starting point the order has to be the same. Moreover, if I sort the elements of the vector P, without jumps, and then on the same graph I sort the element of the P found, I might find different values of p i but the bigger is the same, the second bigger is the same, and so on. Speed up convergence for sampling based methods, i.e. P k1 < P k2 where P k2 is the final equilibrium point after k 2 steps without random jumps and P k1 is with random jumps, given the same δ. 17.1.1 Boost the ranking The idea is that, a page is visited more often if I am being linked from more popular websites. In other words, a way to increase the probability of getting hit by a random walker surfing the web is by getting a link from a very popular website and or to create link to popular websites. (so increase the fan in and the fan out of one page boost the ranking of a page). Another approach could be to create a set of nodes (web pages) all linked one another in a fully connected graph fashion. In this manner, the probability of an user hitting the set of node (pages) connected by edges (pages) will be higher. 17.2 Hubs and Authorities Assume that we have a topic T (represented by the set of pages in the cloud), and a set of pages i.e. a region related to T. Definition 3. Given a directed graph G = G(V, E) and a topic T, a vertex V is called hub for T if there is a significant number of edges starting from V and ending into a vertex (page) of T. Definition 4. Given a directed graph G = G(V, E) and a topic T, a vertex V is called authority for T if there is a significant number of edges starting from any vertex in T and ending in V. Observation : Like is depicted in figure, hubs can have also edges that go out of T and an authority can have edges that comes from vertex outside from T but we are not interested in this edges Kleinberg idea to rank the web pages is based on two main steps: 1. Find a collection of URL s on a topic T. 2. Derive an algorithm to identify hubs and authorities from the previous step(1.). 17-2

Topic T Hub Authority Figure 17.2. Hubs and Authorities: on the left, a page (vertex) with many edges (links) pointing to other pages (vertex) of the same topic T i.e. an hub. On the right a page (vertex) with many edges (links) pointing from other pages (vertex) of the same topic T i.e. an authority. The challenging part is of course to find the collection of hubs and authorities. To accomplish point 1., Kleinberg conducts a text based search: to find some (200) results with a web search engine (HotBot, Altavista) and then, broading the set of URL s by adding one-hop neighbor downstream (the edge representing a son in the large graph), and one-hop neighbor upstream (the father). Two observations: the first one is that in order to gather information of the down and upstream neighborhood we need the knowledge of the graph. The second observation is the reason why Kleinberg uses only one hop neighborhood information: whenever the node found is an authority, the set of k ( k > 1 ) hop neighborhood can be too vast; that is why the identification algorithm has as input only with one hop information. Just looking at the degree of any node in the subgraph S of G associated with the topic T, the system is able to cluster hubs and authorities. Kleinberg formalizes an authorities and an hub weight as: Definition 5. Given a graph G = G(V, E) and a topic T of G (subgraph S), an authority of weight p for T, is the quantity {x <p> } Definition 6. Given a graph G = G(V, E) and a topic T of G (subgraph S), an hub of weight p for T, is the quantity {y <p> } The algorithm identifies hubs and authorities updating k times the weights with two operations, denoted O meaning Output degree of a vertex V of G, and I meaning Input degree. x <p> y <q> (17.1) p:(p,q) E 1 The algorithm runs with good results if k = 20. After 20 steps of the algorithm experimentally has been found that no changes of the first top 50 (up to 200) are observed (equilibrium values have been reached). 17-3

and y <p> p:(p,q) E x <q> (17.2) Moreover, a maintaining invariance conditions (normalizing the coefficient to 1 after each step) have to be satisfied, i.e. p S σ (x <p> ) 2 = 1 and p S σ (y <p> ) 2 = 1 where σ is a query string. As we can see form the formulas above, the hubs power are given by the authorities and vice-versa. 17.2.1 Pseudocode of the Iterative Algorithm Let us see the pseudocode of the algorithm to identify hubs and authorities Iterate(G, k) G is a graph collection of n linked pages, k number of rounds z is an unitary vector of R n (x 0 = z and y 0 = z i ) for i = 1,...,k do apply I operation to x i 1, y i 1 x i apply O operation to x i 1, y i 1 y i Normalize x i x i Normalize y i y i end for return (x k, y k ) 17.2.2 Sketch of the proof of convergence Given M n x n symmetric and A adjacency matrix (matrix of connectivity and so symmetric and n x n as well) and the vectors x i of authorities and y i of hubs, the proof of convergence is based on the follow logic steps: Hubs and Authorities are updated as follow: x i A t y i 1 and y i Ax i 1. Expanding until there are no more iterations, we have: x k (A t A) k 1 A t x 0 and y k (A t A) k y 0 notice that in class the updates where written in a slightly different notation from that one in the paper: but the matrix M in your notes is suppose to be A: x i (A t A)x i 2 x 2k (A t A) k x 0 and the same for y. From linear algebra is known that if x 0 is a vector not orthogonal to the principal eigenvector ω of M, then the vector x 0 in the direction of M k x 0 converge to ω and so the sequence {x k } x 0 for k. 17-4

Bibliography [1] Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. Graph structure in the web. In Proceedings of the 9th international World Wide Web conference on Computer networks : the international journal of computer and telecommunications networking, pages 309 320, Amsterdam, The Netherlands, The Netherlands, 2000. North-Holland Publishing Co. [2] J.Kleinberg. Authoritative Sources in a Hyperlinked Environment. Journal of the ACM, 46, 1999. [3] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The Pagerank Citation Ranking: Bringing Order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998. 5