CS535 Big Data Fall 2017 Colorado State University 9/5/2017. Week 3 - A. FAQs. This material is built based on,

Similar documents
CS435 Introduction to Big Data FALL 2018 Colorado State University. 9/24/2018 Week 6-A Sangmi Lee Pallickara. Topics. This material is built based on,

Link Analysis. Chapter PageRank

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Unit VIII. Chapter 9. Link Analysis

Information Networks: PageRank

Big Data Analytics CSCI 4030

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

Information Retrieval. Lecture 11 - Link analysis

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Big Data Analytics CSCI 4030

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Lecture 8: Linkage algorithms and web search

Part 1: Link Analysis & Page Rank

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

COMP 4601 Hubs and Authorities

Jeffrey D. Ullman Stanford University

How to organize the Web?

CS6200 Information Retreival. The WebGraph. July 13, 2015

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS425: Algorithms for Web Scale Data

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Link Analysis in Web Mining

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Slides based on those in:

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

COMP Page Rank

Introduction to Data Mining

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Lecture #3: PageRank Algorithm The Mathematics of Google Search

The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems

Development of MapReduced Topic Sensitive PageRank

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Link Structure Analysis

Analysis of Large Graphs: TrustRank and WebSpam

PageRank and related algorithms

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Jeffrey D. Ullman Stanford University/Infolab

Introduction to Data Mining

Searching the Web [Arasu 01]

Reading. Graph Terminology. Graphs. What are graphs? Varieties. Motivation for Graphs. Reading. CSE 373 Data Structures. Graphs are composed of.

Mining Social Network Graphs

Computer Science Department Picnic. FAQs. Topics. This material is developed based on, 8/27/2018 Week 2-A Sangmi Lee Pallickara

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Lecture 8: Linkage algorithms and web search

Introduction to Information Retrieval

Graphs (Part II) Shannon Quinn

Link Analysis and Web Search

Finding Top UI/UX Design Talent on Adobe Behance

Mathematical Analysis of Google PageRank

How Google Finds Your Needle in the Web's

PageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking

Collaborative filtering based on a random walk model on a graph

Adaptive methods for the computation of PageRank

Mining Web Data. Lijun Zhang

Administrative. Web crawlers. Web Crawlers and Link Analysis!

CHAPTER-2 A GRAPH BASED APPROACH TO FIND CANDIDATE KEYS IN A RELATIONAL DATABASE SCHEME **

DATA MINING - 1DL460

S Postgraduate Course on Signal Processing in Communications, FALL Topic: Iteration Bound. Harri Mäntylä

DATA MINING - 1DL460

Brief (non-technical) history

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

1.6 Case Study: Random Surfer

Social Network Analysis

Link Analysis. Hongning Wang

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

Degree Distribution: The case of Citation Networks

TODAY S LECTURE HYPERTEXT AND

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Web Structure Mining using Link Analysis Algorithms

TELCOM2125: Network Science and Analysis

4/25/12. The Problem: Distributed Methods for Finding Paths in Networks Spring 2012 Lecture #20. Forwarding. Shortest Path Routing

Graph Algorithms. Revised based on the slides by Ruoming Kent State

Information Retrieval

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

Mining Web Data. Lijun Zhang

Project and Production Management Prof. Arun Kanda Department of Mechanical Engineering Indian Institute of Technology, Delhi

Machine Learning Lecture 16

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Note. Out of town Thursday afternoon. Willing to meet before 1pm, me if you want to meet then so I know to be in my office

Algorithms. Definition: An algorithm is a set of rules followed to solve a problem. In general, an algorithm has the steps:

Topological Sort. CSE 373 Data Structures Lecture 19

Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL

An Improved Computation of the PageRank Algorithm 1

Introduction To Graphs and Networks. Fall 2013 Carola Wenk

Network Centrality. Saptarshi Ghosh Department of CSE, IIT Kharagpur Social Computing course, CS60017

Analysis of Algorithms Prof. Karen Daniels

Popularity of Twitter Accounts: PageRank on a Social Network

What are graphs? (Take 1)

Transcription:

S535 ig ata Fall 217 olorado State University 9/5/217 Week 3-9/5/217 S535 ig ata - Fall 217 Week 3--1 S535 IG T FQs Programming ssignment 1 We will discuss link analysis in week3 Installation/configuration guidelines for Hadoop and Spark are uploaded Port assignment has been posted PRT 1. TH OMPUTING MOL FOR IGTNLYTIS 2. W SL LINK NLYSIS TP teams ny change? Please let me know omputer Science, olorado State University 9/5/217 S535 ig ata - Fall 217 Week 3--2 9/5/217 S535 ig ata - Fall 217 Week 3--3 Today s topics Link analysis PageRank Link nalysis 1. PageRank 2. Hyperlink-Induced Topic Search (HITS) 3. entrality nalysis 9/5/217 S535 ig ata - Fall 217 Week 3--4 9/5/217 S535 ig ata - Fall 217 Week 3--5 This material is built based on, nand Rajaraman, Jure Leskovec, and Jeffrey Ullman, Mining of Massive atasets, ambridge University Press, 212 hapter 5 Link nalysis 1. PageRank lgorithm http://infolab.stanford.edu/~ullman/mmds.html 1

S535 ig ata Fall 217 olorado State University 9/5/217 Week 3-9/5/217 S535 ig ata - Fall 217 Week 3--6 efinition of PageRank Link analysis ata-analysis technique used to evaluate relationships (connections) between nodes function that assigns a real number to each page in the Web The higher the PageRank of a page, the more important it is There is NOT one fixed algorithm for assignment of PageRank 9/5/217 S535 ig ata - Fall 217 Week 3--7 xample [1/5] Page has links to, and Page has links to and Page has a link to Page has links to and 9/5/217 S535 ig ata - Fall 217 Week 3--8 9/5/217 S535 ig ata - Fall 217 Week 3--9 xample [2/5] Suppose that a random surfer starts at page Page, and will be the next with probability probability of being at xample [3/5] Now suppose the random surfer at has probability of ½ of being at, ½ of being at and of being at or 9/5/217 S535 ig ata - Fall 217 Week 3--1 9/5/217 S535 ig ata - Fall 217 Week 3--11 xample [4/5] Transition matrix M What happens to random surfers after one step M has n rows and columns What is the transition matrix for this example? xample [5/5] 1 M= The first column a surfer at has a probability of next being at each of the other pages The second column a surfer at has a ½ probability of being next at and the same for being at 2

S535 ig ata Fall 217 olorado State University 9/5/217 Week 3-9/5/217 S535 ig ata - Fall 217 Week 3--12 What does this matrix mean? [1/6] The probability distribution for the location of a random surfer column vector whose jth component is the probability that the surfer is at page j P(j i) Right stochastic matrix Real square matrix with each row summing to 1 1.2 1 9/5/217 S535 ig ata - Fall 217 Week 3--13 What does this matrix mean? [2/6] If we surf at any of the n pages of the Web with equal probability The initial vector v o (row vector) will have 1/n for each component If M is the transition matrix of the Web fter the first one step, the distribution of the surfer will be Mv o fter two steps, M(Mv o ) =M 2 v and so on The product of two right stochastic matrices is also right stochastic Goal: Finding a row eigenvector of the transition matrix M Mv = M 9/5/217 S535 ig ata - Fall 217 Week 3--14 9/5/217 S535 ig ata - Fall 217 Week 3--15 What does this matrix mean? [3/6] Multiplying the initial vector v by M a total of i times The distribution of the surfer after i steps The probability for the next step from the current location The probability for being in the current location What does this matrix mean? [4/6] The probability xi that a random surfer will be at node i at the next step x i = m ij v j mij is the probability that a surfer j at node j will move to node i at the next step vj is the probability that the surfer was at node j at the previous step 1 1/4 1.2 1/4 The distribution of the surfer after the first step 1/4 = - Probability for being in the next possible location 1/4 9/5/217 S535 ig ata - Fall 217 Week 3--16 What does this matrix mean? [5/6] The limit is reached when multiplying the distribution by M another time does not change the distribution The limiting v is a row eigenvector of M Since M is right stochastic (its columns each add up to 1), v is the principle eigenvector Its associated eigenvalue is the largest of all eigenvalues The principle eigenvector of M Where the surfer is most likely to be after a long time 9/5/217 S535 ig ata - Fall 217 Week 3--17 What does this matrix mean? [6/6] The distribution of the surfer approaches a limiting distribution v that satisfies v = Mv provided two conditions are met: 1. The graph is strongly connected It is possible to get from any node to any other node 2. There are no dead ends Nodes that have no arcs out For the Web, 5-75 iterations are sufficient to converge to within the error limits of double-precision arithmetic 3

S535 ig ata Fall 217 olorado State University 9/5/217 Week 3-9/5/217 S535 ig ata - Fall 217 Week 3--18 9/5/217 S535 ig ata - Fall 217 Week 3--19 xample 1 M= xample 1 M= Suppose we apply this process to the matrix M The initial vector v and v1 after multiplying M 1 ¼ ¼ v 1=Mv = ¼ =? ¼ Use your worksheet Suppose we apply this process to the matrix M The initial vector v and v1 after multiplying M 1 ¼ 9/24 ¼ 5/24 v 1 =Mv = ¼ = 5/24 ¼ 5/24 9/5/217 S535 ig ata - Fall 217 Week 3--2 9/5/217 S535 ig ata - Fall 217 Week 3--21 What is the v 2? 1 M= Suppose we apply this process to the matrix M The initial vector v and v1 after multiplying M 1 ¼ 9/24 ¼ 5/24 v 1=Mv = ¼ = 5/24 ¼ 5/24 xample continued The sequence of approximations to the limit We get by multiplying at each step by M is ¼ 9/24 15/48 12 3/9 ¼ 5/24 11/48 7/32 2/9 ¼ 5/24 11/48 7/32 2/9 ¼ 5/24 11/48 7/32 2/9 This difference in probability is not noticeable In the real Web, there are billions of nodes of greatly varying importance The probability of being at a node like www.amazon.com is orders of magnitude greater than others 9/5/217 S535 ig ata - Fall 217 Week 3--22 9/5/217 S535 ig ata - Fall 217 Week 3--23 Problems we need to avoid Link nalysis 1. PageRank lgorithm hallenges in PageRank lgorithm for the real Web ead end page that has no links out Surfers reaching such a page will disappear In the limit, no page that can reach a dead end can have any PageRank at all Spider traps Groups of pages that all have outlinks but they never link to any other pages What if we allow the ead end and Spider traps in the PageRank computation? 4

S535 ig ata Fall 217 olorado State University 9/5/217 Week 3-9/5/217 S535 ig ata - Fall 217 Week 3--24 9/5/217 S535 ig ata - Fall 217 Week 3--25 voiding ead nds xample If we allow dead ends The transition matrix of the Web is no longer stochastic Some of the columns will sum to rather than 1 Remove the arc from to à becomes a If we compute M i v for increasing powers of a substochastic matrix Some of all of the components of the vector go to substochastic matrix matrix whose column sums are at most 1 Importance drains out of the Web No information about the relative importance of pages 9/5/217 S535 ig ata - Fall 217 Week 3--26 9/5/217 S535 ig ata - Fall 217 Week 3--27 xample xample Remove the arc from to à becomes a dead end àif a random surfer reaches, they disappear at the next round Repeatedly multiplying the vector by M: ¼ 3/24 5/48 288 ¼ 5/24 7/48 388 ¼ 5/24 7/48 388 ¼ 5/24 7/48 388 M= The probability of a surfer being anywhere goes to as the number of steps increase 9/5/217 S535 ig ata - Fall 217 Week 3--28 9/5/217 S535 ig ata - Fall 217 Week 3--29 pproaches to dealing with dead ends 1. Recursive deletion Step 1. rop the dead ends from the graph Step 2. rop their incoming arcs as well Step 3. Repeat Step 1 and 2 xample of recursive deletion (1/4) Step 1. rop the dead ends from the graph Step 2. rop their incoming arcs as well Step 3. Repeat Step 1 and 2 2. Taxation Modify the process by which random surfers are assumed to move about the Web 5

S535 ig ata Fall 217 olorado State University 9/5/217 Week 3-9/5/217 S535 ig ata - Fall 217 Week 3--3 9/5/217 S535 ig ata - Fall 217 Week 3--31 xample of recursive deletion (2/4) The final matrix for the graph is ½ M = ½ 1 ½ ½ 1/6 3/12 5/24 2/9 3/6 5/12 14 4/9 2/6 4/12 8/24 3/9 xample of recursive deletion (3/4) We still need to compute deleted nodes ( and ) was the last to be deleted We know all its predecessors have PageRanks ( and ) Therefore, PageRank of = x 2/9 + ½ x 3/9 = 13/54 9/5/217 S535 ig ata - Fall 217 Week 3--32 9/5/217 S535 ig ata - Fall 217 Week 3--33 xample of recursive deletion (3/4) Now, we can compute the PageRank for Only one predecessor, The PageRank of is the same as that of (13/54) The sums of the PageRanks exceeds 1 It cannot represent the distribution of a random surfer It provides a good estimate (2) Spider Traps 9/5/217 S535 ig ata - Fall 217 Week 3--34 9/5/217 S535 ig ata - Fall 217 Week 3--35 Spider Traps and Taxation xample 1 Spider traps set of nodes with no dead ends but no arcs out (from the set of nodes) This can appear intentionally or unintentionally on the Web There is a simple spider trap of 4 nodes Spider traps causes the PageRank calculation to place all the PageRank within the spider traps M= 1 F 6

S535 ig ata Fall 217 olorado State University 9/5/217 Week 3-9/5/217 S535 ig ata - Fall 217 Week 3--36 9/5/217 S535 ig ata - Fall 217 Week 3--37 xample 2 If we perform the usual iteration to compute the PageRank of the nodes, we get The arc out of changed to point to itself 1/6 1/12.42 1/6 5/36.46 1/6 16.194 1/6 4/18.222 1/6 1/6 1/12 is a simple spider trap of one node It still has outgoing edge to itself It is not a dead end M= 1 9/5/217 S535 ig ata - Fall 217 Week 3--38 9/5/217 S535 ig ata - Fall 217 Week 3--39 If we perform the usual iteration to compute the PageRank of the nodes, we get ¼ 3/24 5/48 288 ¼ 5/24 7/48 388 ¼ 14 29/48 25/288 1 ¼ 5/24 7/48 388 ll the PageRank is at Once there, a random surfer can never leave To avoid Spider traps, we modify the calculation of PageRank llowing each random surfer a small probability of teleporting to a random page Rather than following an out-link from their current page 9/5/217 S535 ig ata - Fall 217 Week 3--4 9/5/217 S535 ig ata - Fall 217 Week 3--41 The iterative step, where we compute a new vector estimate of PageRanks v from the current PageRank estimate v and the transition matrix M is Where β is a chosen constant Usually in the range.8 to.9 v' = βmv + (1 β)e / n e is a vector for all 1 s with the appropriate number of components n is the number of nodes in the Web graph v' = βmv + (1 β)e / n The term βmv represents the case where, With probability β, the random surfer decides to follow an out-link from their present page The term (1-β)e/n is a vector ach of whose components has value (1-β)/n Represents the introduction with probability 1-β of a new random surfer at a random page 7

S535 ig ata Fall 217 olorado State University 9/5/217 Week 3-9/5/217 S535 ig ata - Fall 217 Week 3--42 9/5/217 S535 ig ata - Fall 217 Week 3--43 If the graph has no dead ends The probability of introducing a new random surfer is exactly equal to the probability that the random surfer will decide not to follow a link from their current page Surfer decides either to follow a link or teleport to a random page If the graph has dead ends The surfer goes nowhere The term (1-β)e/n does not depend on the sum of the components of the vector v, there will be some fraction of a surfer operating on Web When there are dead ends, the sum of the components of v may be less than 1 ut it will never reach 9/5/217 S535 ig ata - Fall 217 Week 3--44 9/5/217 S535 ig ata - Fall 217 Week 3--45 xample of Taxation () M= 1 v' = βmv + (1 β)e / n β =.8 2/5 4/15 2/5 v = 4/15 4/5 2/5 v + 4/15 2/5 xample of Taxation (2/2) First few iterations: ¼ 9/6 4 543/45 15/148 ¼ 13/6 53/3 77/45 19/148 ¼ 25/6 153/3 2543/45 95/148 ¼ 13/6 53/3 77/45 19/148 y being a spider trap, gets more than ½ of the PageRank for itself ffect is limited ach of the nodes gets some of the PageRank 9/5/217 S535 ig ata - Fall 217 Week 3--46 9/5/217 S535 ig ata - Fall 217 Week 3--47 xample 1 ompute the PageRank of each page assuming β=.8 Link nalysis xamples M= v = β M v +(1-β)e/n ⅓ ½ ½ ⅓ =.8 ⅓ ½ ⅓ + (1-.8)e/3 ⅓ ½ ⅓ 8

S535 ig ata Fall 217 olorado State University 9/5/217 Week 3-9/5/217 S535 ig ata - Fall 217 Week 3--48 xample 2 clique Set of nodes with all possible arcs from one to another Suppose the Web consists of a clique of n nodes and a single additional node that is the successor of each of the n nodes in the clique 9/5/217 S535 ig ata - Fall 217 Week 3--49 xample 2 etermine the PageRank of each page assuming β=.8 Is there a dead end? Is there a spider trap? 9/5/217 S535 ig ata - Fall 217 Week 3--5 9/5/217 S535 ig ata - Fall 217 Week 3--51 xample 2 etermine the PageRank of each page assuming β=.8 Is there a dead end? Yes. Is there a spider trap? No. Having clique does not mean that there is a spider trap xample 2 ¼ ¼ ¼ M= ¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼ ¼ ⅕ v= ⅕ ⅕ ⅕ ⅕ β M v +(1-β)e/n ¼ ¼ ¼ ⅕ 1 ¼ ¼ ¼ ¼ ⅕ 1 =.8 ¼ ¼ ¼ ⅕ +(1-.8) 1 /5 ¼ ¼ ¼ ⅕ 1 ¼ ¼ ¼ ¼ ⅕ 1 9/5/217 S535 ig ata - Fall 217 Week 3--52 xample 3 Suppose that we recursively eliminate dead ends from the Web graph to solve the remaining graph Suppose that the graph is a chain of dead ends, headed by a node with a self-loop What would be the PageRank assigned to each of the nodes? 9/5/217 S535 ig ata - Fall 217 Week 3--53 xample 3 Remove all of the dead ends recursively Z Z 9

S535 ig ata Fall 217 olorado State University 9/5/217 Week 3-9/5/217 S535 ig ata - Fall 217 Week 3--54 9/5/217 S535 ig ata - Fall 217 Week 3--55 xample 3 Remove all of the dead ends recursively What is v and M? xample 3 What is v and M? v =[1], M = [1], PageRank of = 1 PageRank of = ½ 1 PageRank of = PageRank of = ½ PageRank of = PageRank of = ½ Z 9/5/217 S535 ig ata - Fall 217 Week 3--56 9/5/217 S535 ig ata - Fall 217 Week 3--57 xample of Taxation () M= v' = βmv + (1 β)e / n β =.8 2/5 4/15 2/5 v = 4/15 2/5 v + 4/15 2/5 xample of Taxation (2/2) For the first few iterations: ¼ 9/6 4 543/45 15/148 ¼ 13/6 53/3 77/45 19/148 ¼ 13/6 53/3 77/45 19/148 ¼ 13/6 53/3 77/45 19/148 1