.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Similar documents
1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Information Retrieval. Lecture 11 - Link analysis

Link Analysis and Web Search

COMP Page Rank

A brief history of Google

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Lecture 17 November 7

The PageRank Citation Ranking: Bringing Order to the Web

PAGE RANK ON MAP- REDUCE PARADIGM

COMP5331: Knowledge Discovery and Data Mining

Link Analysis. Hongning Wang

THE KNOWLEDGE MANAGEMENT STRATEGY IN ORGANIZATIONS. Summer semester, 2016/2017

PageRank and related algorithms

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Beyond PageRank: Machine Learning for Static Ranking

Announcements. Assignment 6 due right now. Assignment 7 (FacePamphlet) out, due next Friday, March 16 at 3:15PM

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Lecture 5: Graphs. Rajat Mittal. IIT Kanpur

Discrete Mathematics, Spring 2004 Homework 8 Sample Solutions

Discrete Mathematics for CS Spring 2008 David Wagner Note 13. An Introduction to Graphs

Discrete Mathematics and Probability Theory Fall 2009 Satish Rao,David Tse Note 8

Math/Stat 2300 Modeling using Graph Theory (March 23/25) from text A First Course in Mathematical Modeling, Giordano, Fox, Horton, Weir, 2009.

CPSC 532L Project Development and Axiomatization of a Ranking System

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

The application of Randomized HITS algorithm in the fund trading network

Synonyms. Hostile. Chilly. Direct. Sharp

Graph Theory. Part of Texas Counties.

Weighted Page Rank Algorithm Based on Number of Visits of Links of Web Page

Social Network Analysis

Big Data Analytics CSCI 4030

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL

Advanced Computer Architecture: A Google Search Engine

CS2 Algorithms and Data Structures Note 9

Social Network Analysis With igraph & R. Ofrit Lesser December 11 th, 2014

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

PageRank. CS16: Introduction to Data Structures & Algorithms Spring 2018

COMP 4601 Hubs and Authorities

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Centrality. Peter Hoff. 567 Statistical analysis of social networks. Statistics, University of Washington 1/36

1. a graph G = (V (G), E(G)) consists of a set V (G) of vertices, and a set E(G) of edges (edges are pairs of elements of V (G))

BACKGROUND: A BRIEF INTRODUCTION TO GRAPH THEORY

Anatomy of a search engine. Design criteria of a search engine Architecture Data structures

Graph Algorithms. Revised based on the slides by Ruoming Kent State

Graph Theory. ICT Theory Excerpt from various sources by Robert Pergl

Hypercubes. (Chapter Nine)

Big Data Analytics CSCI 4030

Further Mathematics 2016 Module 2: NETWORKS AND DECISION MATHEMATICS Chapter 9 Undirected Graphs and Networks

Lecture 27: Learning from relational data

University of Maryland. Tuesday, March 2, 2010

Web Structure Mining using Link Analysis Algorithms

Elements of Graph Theory

3. GRAPHS. Lecture slides by Kevin Wayne Copyright 2005 Pearson-Addison Wesley Copyright 2013 Kevin Wayne. Last updated on Sep 8, :19 AM

2013/2/12 EVOLVING GRAPH. Bahman Bahmani(Stanford) Ravi Kumar(Google) Mohammad Mahdian(Google) Eli Upfal(Brown) Yanzhao Yang

Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank. Hans De Sterck

On Finding Power Method in Spreading Activation Search

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

Graph Algorithms Using Depth First Search

CS6200 Information Retreival. The WebGraph. July 13, 2015

Lec 8: Adaptive Information Retrieval 2

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

Project Problems SNU , Fall 2014 Chung-Kil Hur due: 12/21(Sun), 24:00

Lecture 1: Introduction and Motivation Markus Kr otzsch Knowledge-Based Systems

Social Networks. Slides by : I. Koutsopoulos (AUEB), Source:L. Adamic, SN Analysis, Coursera course

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

KNOWLEDGE GRAPHS. Lecture 1: Introduction and Motivation. TU Dresden, 16th Oct Markus Krötzsch Knowledge-Based Systems

Link Analysis in the Cloud

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Lecture 8: Linkage algorithms and web search

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Part 1: Link Analysis & Page Rank

Goals! CSE 417: Algorithms and Computational Complexity!

Mathematics of Networks II

Module 2: NETWORKS AND DECISION MATHEMATICS

Graphs and Graph Algorithms. Slides by Larry Ruzzo

Application of PageRank Algorithm on Sorting Problem Su weijun1, a

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

COP 4531 Complexity & Analysis of Data Structures & Algorithms

EE/CSCI 451 Midterm 1

Exam IST 441 Spring 2011

SimRank : A Measure of Structural-Context Similarity

Antisymmetric Relations. Definition A relation R on A is said to be antisymmetric

Searching the Web [Arasu 01]

The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems

Algorithms for Finding Dominators in Directed Graphs

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

Basic Graph Algorithms (CLRS B.4-B.5, )

UCSD CSE 21, Spring 2014 [Section B00] Mathematics for Algorithm and System Analysis

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

CSE 21 Spring 2016 Homework 5. Instructions

Weighted Page Rank Algorithm based on In-Out Weight of Webpages

Graph Theory Review. January 30, Network Science Analytics Graph Theory Review 1

Introduction to Graphs

Graph Theory for Network Science

6c Lecture 3 & 4: April 8 & 10, 2014

Continuity and Tangent Lines for functions of two variables

BIL694-Lecture 1: Introduction to Graphs

Graphs: A graph is a data structure that has two types of elements, vertices and edges.

Transcription:

.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Link Analysis in Graphs: PageRank Link Analysis Graphs Recall definitions from Discrete math and graph theory. Graph. A graph G is a structure V,E, where V = {v,,v n } is a finite set of vertices or nodes; E = {(v,w) v,w V }, is a set of pairs of vertices called edges. Undirected and directed graph. In a directed graph, an edge e = (v,w) is interpreted as a connection from v to w but not a connection from w to v. In an undirected graph, an edge e = (v,w) is interpreted as a connection between v and w. Representations. Graphs can be represented in a number of ways: Set notation. A representation of a graph that follows the definition above. Example. G = {A,B,C,D,E}, {(A,B),(A,C),(A,E),(B,C),(B,E),(C,D)}. Graphical representation. A graph can be represented as a drawing. Each node is drawn as a point or circle on a plane, and each edge is a line connecting the representations of its two vertices. To draw a directed graph, arrows are added to the edge lines to point from the first vertex in the edge to the second. Example. Figure shows the graphical representations of G in the cases when G is directed and undirected.

A B A B C D C D E E Figure : Undirected (left) and directed (right) graphs. Matrix. A graph can be represented as an adjacency matrix M G whose rows and columns are vertices. If edge (v i,v j ) E, M G [i,j] =, otherwise, M G [i,j] = 0. Undirected graphs have symmetrical adjacency matrices (or, alternatively, only uppre diagonal portions of those matrices are considered). Matrices for directed graphs need not be symmetric. Example. The adjacency matrices for graph G in undirected and directed cases: Undirected G: G A B C D E Directed G: G A B C D E A 0 A 0 B 0 B 0 0 C 0 C 0 0 0 D 0 0 0 D 0 0 0 0 E 0 0 E 0 0 0 0 Lists. A graph can be represented by an associative array of adjacency lists. The domain of the array A G is V. For v V, A G [v] lists all w V, such that (v,w) E. Example. The adjacency lists for the undirected and directed versions of graph G are shown below: Undirected G: A: B,C,E B: A,C,E C: A,B,D D C E A,B Directed G: A: B,C,E B: C,E C: D D E Labeled Graphs. A labeled graph G is a graph G = V,E, where E = {(v,w,l)}, where v,w V are vertices connected by the edge and l is a label. The domain for the set of possible labels is usually specified up-front. Egde labels can be used to specify the length of a connection, cost to traverse the edge, type on edge and many other properites. Graphs can have additional edge and vertex labels. 2

Properties of Graphs. Path. A path in a graph G = V,E is a sequence p = e,e 2,e s of edges, e = (w,w ),,e s = (w s,w s ), such that w = w 2,w 2 = w 3, w s = w s. In undirected graphs p is called a path between w and w w. In directed graphs p is called a path from w to w s. Connected Graphs. A graph G = E,G is called connected iff for any pair v i,v j V, there exists a path p between v i and v j (or, from v i to v j ). Shortest path. The length of a path p in a graph G is the number of edges in it. A shortest path between two vertices v and w is a path that starts in v and ends in w with the smallest length (number of edges in it). Complete graphs. A graph G = V,E is complete iff for all vertices v,w V, (v,w) E. Vertex degrees. Let G = V, E be an undirected graph. The degree of a node v V in G is defined as degree(v) = {(v,v ) E v V }, i.e., it is the number of edges that connect v to other vertices in the graph. Let G = V,E be a directed graph. The in-degree of a node v V in G is defined as in degree(v) = {(v,v) E v V }, i.e. the number of edges in G that end at v. The out-degree of a node v V in G is defined as out degree(v) = {(v,v ) E v V }, i.e., it is the number of edges that start in v. Graphs and Social Networks Social entity. A social entity is a community, organization or setting involving a collection of interacting actors. Actors. In different social entities actors may be: Humans. (e.g., employees in a company). Groups of humans (e.g., sports teams). Legal or political entities (e.g., companies or states). Inanimate ojects (e.g., individual computers). Virtual objects (e.g., web pages or files). 3

Social Network. A social network of a social entity is a structure documenting interactions between actors within the entity. Typically, social networks are represented as graphs. A social network graph SN = V, E is constructed as follows: V is the set of actors of the social entity. E is the set of interactions between the actors in the entity. I.e., (v,w) E iff, actors v and w have an interaction that is tracked by the social network. Interactions can be symmetric, in which case SN is an undirected graph, or assymetric, in which case SN is a directed graph. Examples. Examples of social networks: Business interactions. Email exchanges between employees of a company. Social interactions. Friendship relationship on facebook. Academic interactions. Citation of a paper by another paper. Co-authoring of papers by researchers. Relationship interactions. Kinship relationships between people. Kevin Bacon game. Actors having roles in the same movie. Web page interactions. Links from one web page to another. PageRank via Web Traversal Web Search specifics. Compared to traditional Information Retrieval, web search has the following properties: Huge document collection. (world wide web is the biggest document collection). No golden set. Web is unobservable, hence, we cannot find the sets of all relevant documents for the queries. Only few links visited. Only the top 20-40-00 links are of any importance. Users rarely venture beyond in search of relevant web pages. Web pages are linked! Can this be used to improve search? Web page owners are not trustworthy. Search engine spamming and (somewhat less horrible) search engine optimization attempt to circumvent the results of web search on certain queries. Prestige. Idea: a good web search engine must combine discovery of pages that contain all/most query terms with robust ranking, which promotes important, high-quality, reliable pages to the top. Prestige is a measure of web page importance. 4

PageRank. PageRank is a procedure for computing the prestige of each web pages in a collection. There is a number of definitions/derivations for the PageRank computation. To illustrate how it works, we will use the more simple definition. Web as a graph. We treat World Wide Web as a social network, where individual web pages (urls) are nodes, or actors, and hypertext links between them are interactions. More formally, consider the directed acyclic graph G WWW = {V,E}. The set V of vertices is the list of individual web pages (urls). An edge (v,w) E iff the web page v has in its body an anchor tag <a href="url"> where URL is the URL of the web page w. Given a web page i V, The set I(i) of in-links is the set of all edges e E, such that e = (v,i) for some v V. Given a web page i V, the set O(i) of out-links is the set of all edges e E, such that e = (i,v) for some v V. Note: often, only the in-links and out-links from web pages located on a different site are included in I(i) and O(i). Surfing the web. PageRank is a way of modeling the behavior of a web surfer in a single browser window. In particular, PageRank models the following traversal: PageRank Traversal:. The user starts surfing the web from some, randomly selected page from V a. 2. On each step, the user observes some web page i. With probability d (0, ) (s)he chooses to click on any of the links available on the page (assuming the page has at least one outlink). 3. Each link found on the page i can be selected with the same probability. 4. With probability d the user gets tired of surfing the web by following links and instead goes directly to a randomly selected web page from the collection V. 5. If a web page has no out-links, the user simply goes to a randomly selected web page from the collection V. a PageRank actually allows to relax this condition and start from some page, randomly selected from a predefined collection of pages: a (typically small) subset of the entire web page collection. PageRank defined. The PageRank of a page i V is the probability of eventually reaching page i via the traversal procedure outlined above[]. Deriving PageRank Let p(i) be the probability of reaching web page i (i.e., the PageRank of page i). Let I(i) = {j, j s } be the set of all web pages which link to i. Let the 5

probabilities of reaching each of those pages be p(j ),,p(j s ) respectively. Also, let O(j k ) be the set of all outbound edges from j k. Assumption: All web pages in V have at least one out-link. Suppose we have reached page j. From that page, with probability d, we elect to follow on of the links. j has O(j ) out-links on it, so, with probability p(i j, follow links) = O(j ) we can reach web page i. Since p(follow links) = d, we obtain: p(i j ) = d O(j ). Similar reasoning for all other j I(i) yields p(i j k ) = d O(j k ). We can reach page i in one of only two ways:. By following a link from one of j,,j s. 2. By randomly selecting i when the user chooses to jump (i.e. not follow a link from a current page). We obtain the following formula for computing the probability p(i): Thus, p(i) = ( d) V + (p(i j ) p(j ) + + p(i j s ) p(j s )). Figure 2 illustrates how these probabilities are computed. From here, substituting p(i j k ) we obtain: p(i) = ( d) s V + d O jk p(j k). k= pagerank(i) = ( d) Note, that this is a recursive definition. s V + d O jk pagerank(j k). () k= Computing PageRank From formula (), we see that in order to compute PageRank of a page, we need to know the PageRank of its ancestors. A standard way to model such computation is to perform it iteratively. 6

p(j) p(j2) p(j3) j j2 j3.. O(j3) O(j) O(j2) p(j) O(j) p(j2) O(j2) p(j3) O(j3) p(j4) O(j4) i p(j4) j4 * d O(j4) Figure 2: Computing the probability of reaching a web page. PageRank via iterative process. The traditional iterative algorithm for PageRank uses the following iterative procedure: pagerank 0 (i) = for alli V (2) V pagerank r s (i) = ( d) V + d O k= jk pagerankr (j k ). (3) ( ) Stop when : (pagerank r (i) pagerank r (i)) < ǫ (4) References i V [] Page, Lawrence; Brin, Sergey; Motwani, Rajeev and Winograd, Terry (998). The PageRank citation ranking: Bringing order to the Web. Technical Report, Department of Computer Science, Stanford University. http://ilpubs.stanford.edu:8090/422//999-66.pdf 7