Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Similar documents
Link Analysis and Web Search

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Information Retrieval. Lecture 11 - Link analysis

COMP Page Rank

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture 8: Linkage algorithms and web search

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Information Networks: PageRank

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

CS6200 Information Retreival. The WebGraph. July 13, 2015

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Link Structure Analysis

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

PageRank and related algorithms

Link Analysis. Hongning Wang

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

PAGE RANK ON MAP- REDUCE PARADIGM

Introduction to Information Retrieval

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Mathematical Analysis of Google PageRank

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

PageRank Algorithm Abstract: Keywords: I. Introduction II. Text Ranking Vs. Page Ranking

Brief (non-technical) history

.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Lecture 27: Learning from relational data

COMP 4601 Hubs and Authorities

Mathematical Methods and Computational Algorithms for Complex Networks. Benard Abola

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

A brief history of Google

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Collaborative filtering based on a random walk model on a graph

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy

Adaptive methods for the computation of PageRank

Page Rank Algorithm. May 12, Abstract

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Social Network Analysis

Advanced Computer Architecture: A Google Search Engine

Recent Researches on Web Page Ranking

Fast Iterative Solvers for Markov Chains, with Application to Google's PageRank. Hans De Sterck

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

An Improved Computation of the PageRank Algorithm 1

A Reordering for the PageRank problem

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

1 Random Walks on Graphs

The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems

The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems

COMP5331: Knowledge Discovery and Data Mining

Information Retrieval and Web Search Engines

How Google Finds Your Needle in the Web's

16 - Networks and PageRank

Part 1: Link Analysis & Page Rank

On Finding Power Method in Spreading Activation Search

Social Networks 2015 Lecture 10: The structure of the web and link analysis

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

Motivation. Motivation

Hypercubes. (Chapter Nine)

Lecture 8: Linkage algorithms and web search

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

TODAY S LECTURE HYPERTEXT AND

CS535 Big Data Fall 2017 Colorado State University 9/5/2017. Week 3 - A. FAQs. This material is built based on,

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Algorithms, Games, and Networks February 21, Lecture 12

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

How to organize the Web?

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

CS60092: Informa0on Retrieval

Big Data Analytics CSCI 4030

Graph Algorithms. Revised based on the slides by Ruoming Kent State

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

Week 10: DTMC Applications Randomized Routing. Network Performance 10-1

Bruno Martins. 1 st Semester 2012/2013

Link analysis. Query-independent ordering. Query processing. Spamming simple popularity

A Survey of Google's PageRank

Application of PageRank Algorithm on Sorting Problem Su weijun1, a

Using Spam Farm to Boost PageRank p. 1/2

Mining Web Data. Lijun Zhang

CLOUD COMPUTING PROJECT. By: - Manish Motwani - Devendra Singh Parmar - Ashish Sharma

CS6322: Information Retrieval Sanda Harabagiu. Lecture 10: Link analysis

Big Data Analytics CSCI 4030

CS 6604: Data Mining Large Networks and Time-Series

CSCE 5160 Parallel Processing. CSCE 5160 Parallel Processing

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

Searching the Web [Arasu 01]

University of Maryland. Tuesday, March 2, 2010

Ranking on Data Manifolds

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

PV211: Introduction to Information Retrieval

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

Roadmap. Roadmap. Ranking Web Pages. PageRank. Roadmap. Random Walks in Ranking Query Results in Semistructured Databases

Graph Data Processing with MapReduce

ECEN : Microprocessor System Design Department of Electrical and Computer Engineering Texas A&M University. Homework #2 Solutions

EE/CSCI 451 Midterm 1

Web Structure Mining using Link Analysis Algorithms

Transcription:

Page rank computation HPC course project a.y. 2012-13 Compute efficient and scalable Pagerank 1

PageRank PageRank is a link analysis algorithm, named after Brin & Page [1], and used by the Google Internet search engine, which assigns a numerical weighting to each element of a hyperlinked set of documents, such as the World Wide Web, with the purpose of "measuring" its relative importance within the set [wikipedia] [1] Brin, S. and Page, L. (1998) The Anatomy of a Large-Scale Hypertextual Web Search Engine. In: Seventh International World-Wide Web Conference (WWW 1998), April 14-18, 1998, Brisbane, Australia 2

PageRank: the intuitive idea PageRank relies on the democratic nature of the Web by using its vast link structure as an indicator of an individual page's value or quality. PageRank interprets a hyperlink from page x to page y as a vote, by page x, for page y However, PageRank looks at more than the sheer number of votes; it also analyzes the page that casts the vote. Votes casted by important pages weigh more heavily and help to make other pages more "important" This is exactly the idea of rank prestige in social network. 3

More specifically A hyperlink from a page to another page is an implicit conveyance of authority to the target page. The more in-links that a page i receives, the more prestige the page i has. Pages that point to page i also have their own prestige scores. A page of a higher prestige pointing to i is more important than a page of a lower prestige pointing to i In other words, a page is important if it is pointed to by other important pages 4

PageRank algorithm According to rank prestige, the importance of page i (i s PageRank score) is the sum of the PageRank scores of all pages that point to i Since a page may point to many other pages, its prestige score should be shared. The Web as a directed graph G = (V, E) The PageRank score of the page i (denoted by P(i)) is defined by: P(i) = ( j,i) E P( j) O j O j is the number of out-link of j 5

Matrix notation Let n = V be the total number of pages We have a system of n linear equations with n unknowns. We can use a matrix to represent them. Let P be a n-dimensional column vector of PageRank values, i.e., P = (P(1), P(2),, P(n)) T Let A be the adjacency matrix of our graph with A ij = # % $ &% 1 O i if (i, j) E 0 otherwise We can write the n equations P(i) = P = A T P (PageRank) ( j,i) E P( j) O j with 6

Solve the PageRank equation P = A T P This is the characteristic equation of the eigensystem, where the solution to P is an eigenvector with the corresponding eigenvalue of 1 It turns out that if some conditions are satisfied, 1 is the largest eigenvalue and the PageRank vector P is the principal eigenvector. A well known mathematical technique called power iteration can be used to find P Problem: the above Equation does not quite suffice because the Web graph does not meet the conditions. 7

Using Markov chain To introduce these conditions and the enhanced equation, let us derive the same above Equation based on the Markov chain. In the Markov chain, each Web page or node in the Web graph is regarded as a state. A hyperlink is a transition, which leads from one state to another state with a probability. This framework models Web surfing as a stochastic process. Random walk It models a Web surfer randomly surfing the Web as state transition. 8

Random surfing Recall we use O i to denote the number of out-links of a node i Each transition probability is 1/O i if we assume the Web surfer will click the hyperlinks in the page i uniformly at random. the back button on the browser is not used the surfer does not type in an URL 9

Transition probability matrix Let A be the state transition probability matrix: " $ $ $ A = $. $ $ $ # A 11 A 12... A 1n % ' A 21 A 22... A 2n '... ' '... '... ' ' A n1 A n 2... A nn & A ij represents the transition probability that the surfer in state i (page i) will move to state j (page j). Can A be the adjacency matrix previously discussed? 10

Let us start Given an initial probability distribution vector that a surfer is at each state (or page) p 0 = (p 0 (1), p 0 (2),, p 0 (n)) T (a column vector) an n n transition probability matrix A we have n p 0 (i) =1 i=1 n A ij =1 (1) j=1 If the matrix A satisfies Equation (1), we say that A is the stochastic matrix of a Markov chain 11

Back to the Markov chain In a Markov chain, a question of common interest is: What is the probability that, after m steps/transitions (with m ), a random process/walker reaches a state j independently of the initial state of the walk We determine the probability that the random surfer arrives at the state/page j after 1 step (1 transition) by using the following reasoning: p 1 ( j) = n i=1 where A ij (1) is the probability of going from i to j after 1 step. At beginning p 0 (i) = 1/N A ij (1) p 0 (i) for all i 12

State transition We can write this in matricial form: P 1 = A T P 0 In general, the probability distribution after k steps/transitions is: P k = A T P k -1 13

Stationary probability distribution By the Ergodic Theorem of Markov chain a finite Markov chain defined by the stochastic matrix A has a unique stationary probability distribution if A is irreducible and aperiodic The stationary probability distribution means that after a series of transitions p k will converge to a steady-state probability vector π, i.e., lim P = π k k 14

PageRank again When we reach the steady-state, we have P k = P k+1 =π, and thus π =A T π π is the principal eigenvector (the one with the maximum magnitude) of A T with eigenvalue of 1 In PageRank, π is used as the PageRank vector P: P = A T P 15

Is P = π justified? Using the stationary probability distribution π as the PageRank vector is reasonable and quite intuitive because it reflects the long-run probabilities that a random surfer will visit the pages. a page has a high prestige if the probability of visiting it is high 16

Back to the Web graph Now let us come back to the real Web context and see whether the above conditions are satisfied, i.e., whether A is a stochastic matrix and whether it is irreducible and aperiodic. None of them is satisfied. Hence, we need to extend the ideal-case to produce the actual PageRank model. 17

A is a not stochastic matrix A is the transition matrix of the Web graph A ij = # % $ &% It does not satisfy equation: 1 O i if (i, j) E 0 otherwise n j=1 A ij =1 because many Web pages have no out-links (dangling pages) This is reflected in transition matrix A by some rows of 0 s 18

An example Web hyperlink graph " 0 1 2 1 2 0 0 0 % $ ' $ 1 2 0 1 2 0 0 0 ' $ 0 1 0 0 0 0 ' A = $ ' $ 0 0 1 3 0 1 3 1 3 ' $ 0 0 0 0 0 0 ' $ ' # 0 0 0 1 2 1 2 0 & 19

Fix the problem: two possible ways 1. Remove pages with no out-links during the PageRank computation these pages do not affect the ranking of any other page directly 2. Add a complete set of outgoing links from each such page i to all the pages on the Web. Let us use the 2 nd method: " 0 1 2 1 2 0 0 0 % $ ' $ 1 2 0 1 2 0 0 0 ' $ 0 1 0 0 0 0 ' A = $ ' $ 0 0 1 3 0 1 3 1 3 ' $ 1 6 1 6 1 6 1 6 1 6 1 6' $ ' # 0 0 0 1 2 1 2 0 & 20

A is a not irreducible Irreducible means that the Web graph G is strongly connected Definition: A directed graph G = (V, E) is strongly connected if and only if, for each pair of nodes u, v V, there is a directed path from u to v. A general Web graph represented by A is not irreducible because for some pair of nodes u and v, there is no path from u to v In our example, there is no directed path from nodes 3 to 4 21

A is a not aperiodic A state i in a Markov chain being periodic means that there exists a directed cycle (from i to i) that a random walker traverses multiple times Definition: A state i is periodic (with period k > 1) if k is the smallest number such that all paths leading from state i back to state i have a length that is a multiple of k A Markov chain is aperiodic if all states are aperiodic. 22

An example: periodic This a periodic Markov chain with k = 3 If we begin from state 1, to come back to state 1 the only path is 1-2-3-1 for some number of times, say h Thus any return to state 1 will take k h = 3h transitions. 23

Deal with irreducible and aperiodic matrices It is easy to deal with the above two problems with a single strategy. Add a link from each page to every page and give each link a small transition probability controlled by a parameter d Obviously, the augmented transition matrix becomes irreducible and aperiodic it becomes irreducible because it is strongly connected it become aperiodic because we now have paths of all the possible lengths from state i back to state i 24

Improved PageRank After this augmentation, at a page, the random surfer has two options With probability d, 0<d<1, she randomly chooses an outlink to follow With probability 1-d, she stops clicking and jumps to a random page The following equation models the improved model: P = ((1 d) E n + dat )P n is important, since the matrix has to be stochastic where E is a n n square matrix of all 1 s 25

Follow our example d = 0.9 Transposed matrix (1 d) E n + dat = The matrix made stochastic, which is still: periodic (see state 3) reducible (no path from 3 to 4) " 0 1 2 1 2 0 0 0 % $ ' $ 1 2 0 1 2 0 0 0 ' $ 0 1 0 0 0 0 ' A = $ ' $ 0 0 1 3 0 1 3 1 3 ' $ 1 6 1 6 1 6 1 6 1 6 1 6' $ ' # 0 0 0 1 2 1 2 0 & # 1 60 7 15 1 60 1 60 1 6 1 60& % ( % 7 15 1 60 11 12 1 60 1 6 1 60 ( % 7 15 7 15 1 60 19 60 1 6 1 60( % ( % 1 60 1 60 1 60 1 60 1 6 7 15 ( % 1 60 1 60 1 60 19 60 1 6 7 15( % ( $ 1 60 1 60 1 60 19 60 1 6 1 60' 26

The final PageRank algorithm (1-d)E/n + da T is a stochastic matrix (transposed). It is also irreducible and aperiodic Note that E = e e T where e is a column vector of 1 s e T P = 1 since P is the stationary probability vector π If we scale this equation: P = ((1 d) E n + dat )P = (1 d) 1 n e et P + da T P = by multiplying both sides by n, we have: e T P = n and thus: = (1 d) 1 n e + dat P P = (1 d)e + da T P 27

The final PageRank algorithm (cont ) Given: P = (1 d ) e + da PageRank for each page i is: n j=1 that is equivalent to the formula given in the PageRank paper [BP98] The parameter d is called the damping factor which can be set to between 0 and 1. d = 0.85 was used in the PageRank paper T P P(i) = (1 d) + d A ji P( j) P(i) = (1 d) + d [BP98] Sergey Brin and Lawrence Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. WWW Int.l Conf., 1998. A ji = # % $ & % 1 O j ( j,i) E if ( j,i) E 0 otherwise P( j) O j 28

Compute PageRank Use the power iteration method Initialization 0 n Norm 1 less than 10-6 29

Again PageRank Without scaling the equation (by multiplying by n), we have e T P = 1 (i.e., the sum of all PageRanks is one), and thus: P(i) = 1 d n + d P( j) O ( j,i) E j Important pages are cited/pointed by other important ones In the example, the most important is ID=1 P(ID=1) = 0.304 P(ID=1) distributes is rank among all its 5 outgoing links ID= 2, 3, 4, 5, 7 0.304 = 0.061 * 5 30 30

Again PageRank Without scaling the equation (by multiplying by n), we have e T P = 1 (i.e., the sum of all PageRanks is one), and thus: P(i) = 1 d n + d P( j) O ( j,i) E j The stationary probability P(ID=1) is obtained by: (1-d)/n + d (0.023+0.166+0.071+0.045)= (0.15)/7 + 0.85(0.023+0.166+0.071+0.045)= 0.304 31 31

1 st Assignment Write a sequential code (C or C++) that implements Pagerank Compile the code with O3 option, and measure the execution times (command time) for some inputs Input graphs: http://snap.stanford.edu/data! Test example: 0 1 2 P[2] = 0.474412 P[1] = 0.341171 P[0] = 0.184417 32

Hand in (1 st assignment) Create a tar/zip file with: your solution source code and the Makefile; a readme file a brief report (PDF) Groups of max 2 people Send me an email (orlando@unive.it) with the composition of each group How to present the assignment Register in moodle.unive.it (High Performance Computing [CM0227]) and submit the assignment by Nov. 25 th 33

2 nd assignment Given the original incidence matrix A[][], if we know which are the dangling nodes, we can avoid filling zero-rows with values 1/n A T 1 1 1 1 1 1 + 1/n * 0. 0... 0 * p k = p k+1.. 1 1 Dangling nodes A T 1 0 0 1 0.0 1 0.0 + 1/n * 1 1... 1 * * p k = p k+1 34

2 nd assignment A T * p k + 1/n * 1 1 1... 1 * 0 0 1 0.0 1 0.0 * p k = p k+1 A T * p k + X... i2danglings X i2danglings p k [i] n p k [i] n = p k+1 35

2 nd assignment Avoid transposing matrix A[][]! Still traverse A[][] in row major order for (i=0; i<n; i++) for (j=0; j<n; j++) p_new[j] = p_new[j] + a[i][j] * p[i]; Store matrix A[][] in sparse compressed form Compressed sparse row (CSR or CRS) 36

2 Assignment Compressed sparse row (CSR or CRS) Used for traversing matrix in row major order val 10-2 3 9 3 7 8 7 3 8 7 5 8 9 9 13 4 2-1 col_ind 0 4 0 1 5 1 2 3 0 2 3 4 1 3 4 5 1 4 5 row_ptr 0 2 5 8 12 16 19 Start row 0 Position where the n-th row should Start row 1 start. Note that the matrix is sparse, and thus the row could be completely zero. Start row 2 In this case row_ptr[n] = row_prt[n+1] Start row 3 Start row 4 Start row 5 Start row 6 (1 more position) 37

2 Assignment Store big data like A[][] on a file Once we map the file to a memory region, we access it via pointers, just as you would access ordinary variables and objects You can mmap specific section/partition of the file, and share the files between more threads #include <stdio.h>! #include <sys/mman.h>! #include <sys/stat.h>! #include <fcntl.h>! #include <unistd.h>! #include <stdlib.h>!! int main() {! int i;! float val;! float *mmap_region;!! FILE *fstream;! int fd;! 38

2 Assignment! /* create the file */! fstream = fopen("./mmapped_file", "w+");! for (i=0; i<10; i++) {! val = i + 100.0;!! /* write a stream of binary floats */! fwrite(&val, sizeof(float), 1, fstream);! }! fclose(fstream);! /* map a file to the pages starting at a given address for a given length */! fd = open("./mmapped_file", O_RDONLY);! mmap_region = (float *) mmap(0, 10*sizeof(float), PROT_READ,!!!! MAP_SHARED, fd, 0);! if (mmap_region == MAP_FAILED) {!! close(fd);!! printf("error mmapping the file");!! exit(1);! }! close(fd);! Starting offset address in the file 39

2 Assignment!!!! }! /* Print the data mmapped */! for (i=0; i<10; i++)! printf("%f ", mmap_region[i]);! printf("\n");! /* free the mmapped memory */!!if (munmap(mmap_region, 10*sizeof(float)) == -1) {!! printf("error un-mmapping the file");! exit(1);!!}! 40

Hand in (2 nd assignment) Compile the code with O3 option, and measure the execution times (command time) for some (large) inputs Time as a function of number of nodes/edges Some example of graphs are available here: http://snap.stanford.edu/data Create a tar/zip file with: your solution source code and the Makefile; a readme file a brief report (PDF) How to present the assignment Register in moodle.unive.it (High Performance Computing [CM0227]) and submit the assignment Dec. 9 th 41

3rd assignment The goal of this assignment is to parallelize the optimized code of the 2 assignment You can use shared or message passing (also hybrid) parallelization Measure speedup and efficiency as a function of processors/cores exploited (for a couple data sets) Point out the effects of the Amdahl law, concerning the serial sections that remain serial e.g., the input output phases if you are not able to parallelize Measure how the execution time changes when we increase the problem size, without changing the number of processors/cores employed This requires to consider subsets of nodes and edges of a given input graph 42

3rd assignment The issues to solve concern decision such as the right decomposition, the right granularity, and a strategy (static/dynamic) of task assignment I would only to point out that, if we don t transpose matrix A, and decomposes the problem over the input, we have: A * p k = p k+1 A * p k = p k+1 + reduce p k+1 A * p k = p k+1 43

Hand in (3 rd assignment) Compile the code with O3 option, measure the execution time, by also profiling the code with specific routines as MPI_Wtime(), or gettimeofday() if you don t use MPI) search examples of usage of gettimeofday() with a search engine Create a tar/zip file with: your solution source code and the Makefile; a readme file a brief report (PDF) How to present the assignment Register in moodle.unive.it (High Performance Computing [CM0227]) and submit the assignment Jan. 13 th 44