Information Networks: PageRank

Similar documents
Part 1: Link Analysis & Page Rank

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Big Data Analytics CSCI 4030

How to organize the Web?

Big Data Analytics CSCI 4030

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS425: Algorithms for Web Scale Data

Introduction to Data Mining

Slides based on those in:

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Link Analysis and Web Search

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Analysis of Large Graphs: TrustRank and WebSpam

CS6200 Information Retreival. The WebGraph. July 13, 2015

Lecture 8: Linkage algorithms and web search

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

Unit VIII. Chapter 9. Link Analysis

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Bruno Martins. 1 st Semester 2012/2013

Lecture #3: PageRank Algorithm The Mathematics of Google Search

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Information Retrieval. Lecture 11 - Link analysis

Link Structure Analysis

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Link Analysis: Web Structure and Search

COMP 4601 Hubs and Authorities

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

CS535 Big Data Fall 2017 Colorado State University 9/5/2017. Week 3 - A. FAQs. This material is built based on,

Page rank computation HPC course project a.y Compute efficient and scalable Pagerank

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Introduction to Data Mining

CS435 Introduction to Big Data FALL 2018 Colorado State University. 9/24/2018 Week 6-A Sangmi Lee Pallickara. Topics. This material is built based on,

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Link Analysis. Hongning Wang

Web Structure Mining using Link Analysis Algorithms

CS6220: DATA MINING TECHNIQUES

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Introduction to Information Retrieval

COMP5331: Knowledge Discovery and Data Mining

Motivation. Motivation

Link Analysis in Web Mining

Link Analysis. Chapter PageRank

Introduction In to Info Inf rmation o Ret Re r t ie i v e a v l a LINK ANALYSIS 1

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Information Retrieval and Web Search Engines

COMP Page Rank

The PageRank Citation Ranking

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL

Collaborative filtering based on a random walk model on a graph

CSI 445/660 Part 10 (Link Analysis and Web Search)

Link Analysis in the Cloud

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Information Networks: Hubs and Authorities

Brief (non-technical) history

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

CS 6604: Data Mining Large Networks and Time-Series

Databases 2 (VU) ( / )

Information Retrieval and Web Search

Mining Web Data. Lijun Zhang

Lecture 27: Learning from relational data

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

Searching the Web [Arasu 01]

Lecture 17 November 7

Recent Researches on Web Page Ranking

Jeffrey D. Ullman Stanford University

On Finding Power Method in Spreading Activation Search

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Learning to Rank Networked Entities

Mining Web Data. Lijun Zhang

Administrative. Web crawlers. Web Crawlers and Link Analysis!

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Lecture 8: Linkage algorithms and web search

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy

COMPARATIVE ANALYSIS OF POWER METHOD AND GAUSS-SEIDEL METHOD IN PAGERANK COMPUTATION

Mining The Web. Anwar Alhenshiri (PhD)

An Improved Computation of the PageRank Algorithm 1

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Network Centrality. Saptarshi Ghosh Department of CSE, IIT Kharagpur Social Computing course, CS60017

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

A Modified Algorithm to Handle Dangling Pages using Hypothetical Node

Ranking on Data Manifolds

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Web Search. Lecture Objectives. Text Technologies for Data Science INFR Learn about: 11/14/2017. Instructor: Walid Magdy

PageRank and related algorithms

Graph and Link Mining

The PageRank Computation in Google, Randomized Algorithms and Consensus of Multi-Agent Systems

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Transcription:

Information Networks: PageRank Web Science (VU) (706.716) Elisabeth Lex ISDS, TU Graz June 18, 2018 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 1 / 38

Repetition Information Networks Shape of the Web Hubs and Authorities Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 2 / 38

PageRank Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 3 / 38

PageRank Intuition from last week: links as votes (HITS algorithm) Page more important if it has many in-links Do you think that all in-links are equal? Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 4 / 38

PageRank Intuition from last week: links as votes (HITS algorithm) Page more important if it has many in-links Do you think that all in-links are equal? No! Links from important pages are more important Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 4 / 38

Example for PageRank Scores Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 5 / 38

PageRank vs HITS Can you think of the major difference between PageRank and HITS? Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 6 / 38

PageRank vs HITS Can you think of the major difference between PageRank and HITS? Unlike HITS, PageRank is independent of the search query! Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 6 / 38

PageRank Algorithm in a Nutshell Start with a set of pages Crawl the Web to determine link structure Assign each page an initial rank of 1/N Successively update rank of each page by adding up weight of each page that links to it divided by nr of out-links of the referring page Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 7 / 38

PageRank: Basic Formulation PageRank calculation also starts with simple voting based on in-links and gets then repeatedly improved A link s vote is proportional to importance of its source page Let page j have importance r j and n out-links Then, each link gets r j /n votes Importance of page j: sum of the votes on its in-links Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 8 / 38

PageRank: Basic Formulation PageRank calculation also starts with simple voting based on in-links and gets then repeatedly improved A link s vote is proportional to importance of its source page Let page j have importance r j and n out-links Then, each link gets r j /n votes Importance of page j: sum of the votes on its in-links Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 8 / 38

PageRank: Basic Formulation PageRank calculation also starts with simple voting based on in-links and gets then repeatedly improved A link s vote is proportional to importance of its source page Let page j have importance r j and n out-links Then, each link gets r j /n votes Importance of page j: sum of the votes on its in-links r j = r i /3 + r k /4 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 8 / 38

The Flow model of PageRank We learned that a page is important if it is linked from other important pages Plus, a vote (via in-link) from an important page is worth more Based on that, we can define a rank r j for a page j r j = i j where d i is the out-degree of node i r i d i (1) Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 9 / 38

The Flow model of PageRank We learned that a page is important if it is linked from other important pages Plus, a vote (via in-link) from an important page is worth more Based on that, we can define a rank r j for a page j r j = i j where d i is the out-degree of node i r i d i (1) Intuitively, we can think of PageRank as a kind of fluid that flows through the network The fluid passes from node to node across links It pools at the nodes that are the most important Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 9 / 38

Example Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 10 / 38

Example r y = r y /2 + r a /2 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 10 / 38

Example r y = r y /2 + r a /2 r a = r y /2 + r m /1 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 10 / 38

Example r y = r y /2 + r a /2 r a = r y /2 + r m /1 r m = r a /2 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 10 / 38

Example r y = r y /2 + r a /2 r a = r y /2 + r m /1 r m = r a /2 These are called flow equations Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 10 / 38

Solving the equations In the last example: 3 equations, 3 unknowns, no constants This means, there is no unique solution to them We need an additional constraint to enforce unique solution Ranks need to sum up to 1, i.e. This means that for our small graph: r y = r y /2 + r a /2 r a = r y /2 + r m /1 r m = r a /2 r y + r a + r m = 1 So we have 3 unknowns and 4 equations - solvable through elimination Solution: r y = 2/5, r a = 2/4, r m = 1/5 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 11 / 38

Matrix Formulation Elimination does not apply to large scale graphs We need a different formulation of the problem Matrix formulation: Stochastic adjacency matrix M Let page i have d i out-links if i j, then M ji = 1 d i else M ji = 0 where M is a column stochastic matrix, i.e., columns sum up to 1 Rank vector r: vector with an entry per page Length of r is the number of pages in our sample r i corresponds to pagerank score of page i i r i = 1 due to constraint of flow equations Thus, we can write the flow equations from before as vector-matrix product: r = M r Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 12 / 38

Matrix Formulation: Example Flow equation as sum: r j = i j Flow equations in matrix form as vector-matrix product: r = M r Let s assume page i, which has links to 3 pages, i.e. d i = 3. One of the pages it links to is page j. r i d i Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 13 / 38

Matrix Formulation: Example Flow equation as sum: r j = i j Flow equations in matrix form as vector-matrix product: r = M r Let s assume page i, which has links to 3 pages, i.e. d i = 3. One of the pages it links to is page j. r i d i Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 13 / 38

Matrix Formulation Recursive maxtrix equation r = M r resembles an eigenvalue problem Eigenvalue problem definition: Definition Vector x is an eigenvector with the corresponding eigenvalue λ if they are a solution to the following problem: Ax = λx Note that A is given, and we aim to compute x and λ Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 14 / 38

Matrix Formulation Equation r = M r looks similar to Ax = λx In other words: rank vector r is an eigenvector of stochastic web matrix M What is the value of λ? Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 15 / 38

Matrix Formulation Equation r = M r looks similar to Ax = λx In other words: rank vector r is an eigenvector of stochastic web matrix M What is the value of λ? Rank vector r is not any eigenvector, but its principal eigenvector, i.e., its corresponding eigenvalue is 1, ergo λ = 1 Reason: vector r has unit length (its coordinates are nonnegative and sum to 1, also called stochastic vector ) Plus, each column of M sums up to 1 (M is column stochastic ) This means: M r 1 Hence, largest eigenvalue of M = 1 This can be efficiently solved for r using Power iteration method Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 15 / 38

Example: r y = r y /2 + r a /2 r a = r y /2 + r m /1 r m = r a /2 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 16 / 38

Example: r y = r y /2 + r a /2 r a = r y /2 + r m /1 r m = r a /2 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 16 / 38

Example: r y = r y /2 + r a /2 r a = r y /2 + r m /1 r m = r a /2 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 16 / 38

Power Iteration Power Iteration We assume N web pages Initialize r (0) = [1/N,..., 1/N] T Iterate r (t+1) = M r (t) Stop when r (t+1) r (t) 1 < ɛ Algorithm: We set r j to 1/N r i d i. First step: r j = i j Second step: r = r. Go to first step until convergence. Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 17 / 38

Power Iteration Power Iteration: Example Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 18 / 38

Power Iteration Power Iteration: Example Example: r y = r y /2 + r a /2 r a = r y /2 + r m /1 r m = r a /2 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 18 / 38

Power Iteration Power Iteration: Example Example: r y = r y /2 + r a /2 r a = r y /2 + r m /1 r m = r a /2 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 18 / 38

Power Iteration Power Iteration: Example Example: r y = r y /2 + r a /2 r a = r y /2 + r m /1 r m = r a /2 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 18 / 38

Random Walks Random Walks Interpretation: An equivalent definition of PageRank So far, we computed PageRank using flow equations and in terms of matrix formulation Now, we will look at an iterpretation what PageRank scores reflect Random Walk Interpretation PageRank scores equivalent to probability distribution of a random walker on the graph Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 19 / 38

Random Walks Random Walk Interpretation: An equivalent definition of PageRank Consider someone who is randomly browsing a network of Web pages - a random web surfer Surfer starts at any time t by choosing a page i at random, picking each page with equal probability At time t + 1, surfer picks uniformly at random an out-going link from page i and follows it Ends up on some page j linked from i Process repeats indefinitely If page j has no out-going links, surfer stays This is called a Random Walk on the network Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 20 / 38

Random Walks Random Walk Interpretation With what probability is a random walker at time t at a given page? Let p(t) be a vector whose coordinate i denote the probability that the surfer is at page i at time t Thus, p(t) gives us a probability distribution over pages Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 21 / 38

Random Walks Random Walk Interpretation Where is the random walker going to be at time t + 1? Random walker follows an out-going link uniformly at random Thus: p(t + 1) = M p(t) Suppose random walk reaches a state p(t + 1) = M p(t) = p(t), then p(t) is called stationary distribution of a random walk Remember: rank vector r = M p(t) In other words: r is a stationary distribution for the random walk Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 22 / 38

Random Walks What does that mean? PageRank scores correspond to probability is at a given node at a given time step A side note: random walks are effectively Markov processes Why is that important? Because for graphs that satisfy certain conditions, the stationary distribution is unique and will be reached at some point regardless of the initial probability distribution at time t = 0 This means that there are conditions under which PageRank vector r is unique and will be achieved regardless of initialization Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 23 / 38

Random Walks Problems with real web graphs Spider traps: a group of one or more pages that have no links out of the group will eventually accumulate all the importance of the Web What is the problem with that? Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 24 / 38

Random Walks Problems with real web graphs Spider traps: a group of one or more pages that have no links out of the group will eventually accumulate all the importance of the Web What is the problem with that? Eventually, the group absorbes all the PageRank scores Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 24 / 38

Random Walks Problems with real web graphs Spider traps: a group of one or more pages that have no links out of the group will eventually accumulate all the importance of the Web What is the problem with that? Eventually, the group absorbes all the PageRank scores Dead ends: some pages have no out-links What is the problem with that? Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 24 / 38

Random Walks Problems with real web graphs Spider traps: a group of one or more pages that have no links out of the group will eventually accumulate all the importance of the Web What is the problem with that? Eventually, the group absorbes all the PageRank scores Dead ends: some pages have no out-links What is the problem with that? Dead ends make PageRank leak out Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 24 / 38

Random Walks Spider Traps: Example We run the power iteration Remember: PageRank vector r is unique and stationary distribution will always be reached regardless of how we initialize What happens if we initialize a = 1 and b = 0? Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 25 / 38

Random Walks Spider Traps: Example We run the power iteration Remember: PageRank vector r is unique and stationary distribution will always be reached regardless of how we initialize What happens if we initialize a = 1 and b = 0? Both flip all the time, scores get passed to a and b and vice versa Power iteration does not converge - spider trap problem Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 25 / 38

Random Walks Dead end: Example We run the power iteration Remember: PageRank vector r is unique and stationary distribution will always be reached regardless of how we initialize What happens if we initialize a = 1 and b = 0? Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 26 / 38

Random Walks Dead end: Example We run the power iteration Remember: PageRank vector r is unique and stationary distribution will always be reached regardless of how we initialize What happens if we initialize a = 1 and b = 0? In the first multiplication with matrix M, scores are flipped In the second step, the score 1 gets lost, as b can t pass it on This is also called leaking out Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 26 / 38

Random Walks Random Teleports as solution for spider traps At each step, random surfer has 2 options: With probability β, follow a link at random With probability 1 β, jump to some random page In practice, β = 0.8, 0.9 This enables random walker to teleport out of spider trap within few time steps Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 27 / 38

Random Walks What about dead ends? Problem: Pages with 0 out-degree - their PageRank does not get distributed ( leaks out ) What is apparent if we look at matrix M? Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 28 / 38

Random Walks What about dead ends? Problem: Pages with 0 out-degree - their PageRank does not get distributed ( leaks out ) What is apparent if we look at matrix M? It is not stochastic anymore! Why? Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 28 / 38

Random Walks What about dead ends? Problem: Pages with 0 out-degree - their PageRank does not get distributed ( leaks out ) What is apparent if we look at matrix M? It is not stochastic anymore! Why? Node m has 0 out-degree Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 28 / 38

Random Walks Solution: Always teleport Power iteration: all vectors converge to zero Solution: Always teleport Follow random teleport links with probability 1 from dead-ends Update matrix M: Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 29 / 38

Random Walks Solution: Always teleport Power iteration: all vectors converge to zero Solution: Always teleport Follow random teleport links with probability 1 from dead-ends Update matrix M: Why this update? Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 29 / 38

Random Walks Solution: Always teleport Power iteration: all vectors converge to zero Solution: Always teleport Follow random teleport links with probability 1 from dead-ends Update matrix M: Why this update? In our example graph with 3 nodes, the random walker teleports out and with probability 1/3 lands on any other node in the network Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 29 / 38

Random Walks Why do teleports help? Teleports help us make matrix M stochastic as we have seen that wrt dead ends Whenever all the entries for a column in M are 0, we can replace them with 1/d i where d i is the out-degree of node i Teleports help us make matrix M aperiodic Random walker can teleport out of loops Teleports help us make matrix M irreducible 1 Teleports help us add random jumps to matrix M 1 Irreducability: from any state, there is a non-zero probability of going from any one state to any other Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 30 / 38

Random Walks Google s Solution: Random Jumps Idea: combines all of this: Make matrix M stochastic, aperiodic, irreducible At each step, random surfer has 2 options: With probability β, follow a link at random With probability 1 β, jump to some random page PageRank equation [Brin-Page, 1998] r j = i j β r i d i + (1 β) 1 n (2) Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 31 / 38

Random Walks Google Matrix PageRank equation: r j = i j β r i d i + (1 β) 1 n (3) Google Matrix A: A = βm + (1 β) 1 n e ct (4) where e is a vector of all 1s A is stochastic, aperiodic, irreducible, i.e. r (t+1) = A r (t) What is a good value for β? In practice, set to β = 0.8, 0.9, which means making 5 steps and jump What would β = 0 mean? Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 32 / 38

Random Walks Google Matrix PageRank equation: r j = i j β r i d i + (1 β) 1 n (3) Google Matrix A: A = βm + (1 β) 1 n e ct (4) where e is a vector of all 1s A is stochastic, aperiodic, irreducible, i.e. r (t+1) = A r (t) What is a good value for β? In practice, set to β = 0.8, 0.9, which means making 5 steps and jump What would β = 0 mean? Random walker would jump all the time, all nodes in the network would have same PageRank Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 32 / 38

Random Walks Google Matrix PageRank equation: r j = i j β r i d i + (1 β) 1 n (3) Google Matrix A: A = βm + (1 β) 1 n e ct (4) where e is a vector of all 1s A is stochastic, aperiodic, irreducible, i.e. r (t+1) = A r (t) What is a good value for β? In practice, set to β = 0.8, 0.9, which means making 5 steps and jump What would β = 0 mean? Random walker would jump all the time, all nodes in the network would have same PageRank And β = 1? Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 32 / 38

Random Walks Google Matrix PageRank equation: r j = i j β r i d i + (1 β) 1 n (3) Google Matrix A: A = βm + (1 β) 1 n e ct (4) where e is a vector of all 1s A is stochastic, aperiodic, irreducible, i.e. r (t+1) = A r (t) What is a good value for β? In practice, set to β = 0.8, 0.9, which means making 5 steps and jump What would β = 0 mean? Random walker would jump all the time, all nodes in the network would have same PageRank And β = 1? No random jumps Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 32 / 38

Random Walks A few words on actual computation Computing PageRank involves expensive matrix-vector multiplication Matrix A has no non-zeros, i.e. is a dense matrix, huge in terms of the Web Idea: Rearrange PageRank equation so it features matrix M (which is sparse) and not A Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 33 / 38

Random Walks Limitations of PageRank PageRank measures general popularity / importance of a page - Why a problem? Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 34 / 38

Random Walks Limitations of PageRank PageRank measures general popularity / importance of a page - Why a problem? Neglects topic-specific authorities Topic-specific PageRank Susceptible to Link spam Link structures created to boost PageRank Solution: TrustRank Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 34 / 38

Random Walks Topic-Specific PageRank Intuition: Let the random surfer teleport to a random page that is chosen non-uniformly This helps us derive PageRank values that are tailored to meet specific topic needs E.g. a soccer fan might want pages on sports ranked higher So, random surfer should teleport to a random page that is about sports Caveat: Requires a collection of pages that are categorized as sports Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 35 / 38

Random Walks Some Practical Examples for PageRank PageRank on protein interaction graphs 2 Social Media Analysis 3 Altmetrics and Analysis of Readership Data on Mendeley 4 2 http://rsos.royalsocietypublishing.org/content/2/4/140252.abstract 3 http://www.cs.columbia.edu/~ecj2122/research/social_higgs/jubb_ facheris_discovery_of_the_higgs.pdf 4 http://arxiv.org/abs/1504.07482 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 36 / 38

Random Walks Summary We have learned about: PageRank: the Flow model, matrix formulation Random Walks Problems with PageRank: spider traps, dead ends Teleportation and random jumps as solution Some applications beyond web search and ranking Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 37 / 38

Random Walks Thanks for your attention - Questions? elisabeth.lex@tugraz.at Slides use figures and content from Mining of Massive Datasets by Jure Leskovec, Anand Rajaraman, Jeff Ullman. See http://www.mmds.org/ Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 38 / 38