Introduction to Data Mining

Similar documents
CS425: Algorithms for Web Scale Data

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Slides based on those in:

Analysis of Large Graphs: TrustRank and WebSpam

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

How to organize the Web?

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Big Data Analytics CSCI 4030

Part 1: Link Analysis & Page Rank

Big Data Analytics CSCI 4030

Fast Random Walk with Restart: Algorithms and Applications U Kang Dept. of CSE Seoul National University

Information Networks: PageRank

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

CS6220: DATA MINING TECHNIQUES

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Lecture 8: Linkage algorithms and web search

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

Brief (non-technical) history

Unit VIII. Chapter 9. Link Analysis

Link Analysis in Web Mining

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Link Analysis. Chapter PageRank

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Introduction to Data Mining

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

Introduction to Information Retrieval

Information Retrieval. Lecture 11 - Link analysis

Mining Web Data. Lijun Zhang

PEGASUS: A peta-scale graph mining system Implementation. and observations. U. Kang, C. E. Tsourakakis, C. Faloutsos

CS6200 Information Retreival. The WebGraph. July 13, 2015

An Improved Computation of the PageRank Algorithm 1

Graph and Web Mining - Motivation, Applications and Algorithms PROF. EHUD GUDES DEPARTMENT OF COMPUTER SCIENCE BEN-GURION UNIVERSITY, ISRAEL

Administrative. Web crawlers. Web Crawlers and Link Analysis!

Lecture 8: Linkage algorithms and web search

Jeffrey D. Ullman Stanford University

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Link Analysis. Hongning Wang

Mining Web Data. Lijun Zhang

Bruno Martins. 1 st Semester 2012/2013

Information Retrieval Lecture 4: Web Search. Challenges of Web Search 2. Natural Language and Information Processing (NLIP) Group

Information Retrieval and Web Search

CS535 Big Data Fall 2017 Colorado State University 9/5/2017. Week 3 - A. FAQs. This material is built based on,

Web search before Google. (Taken from Page et al. (1999), The PageRank Citation Ranking: Bringing Order to the Web.)

Introduction to Data Mining

Supervised Random Walks

COMP 4601 Hubs and Authorities

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

Fast Nearest Neighbor Search on Large Time-Evolving Graphs

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

Searching the Web [Arasu 01]

COMP Page Rank

A brief history of Google

Link analysis. Query-independent ordering. Query processing. Spamming simple popularity

Link Analysis and Web Search

CS 6604: Data Mining Large Networks and Time-Series

Mining The Web. Anwar Alhenshiri (PhD)

Relational Retrieval Using a Combination of Path-Constrained Random Walks

Link Structure Analysis

PegasusN. User guide. August 29, Overview General Information License... 1

Web Structure Mining using Link Analysis Algorithms

WEB STRUCTURE MINING USING PAGERANK, IMPROVED PAGERANK AN OVERVIEW

Introduction In to Info Inf rmation o Ret Re r t ie i v e a v l a LINK ANALYSIS 1

Information Retrieval and Web Search Engines

Advanced Computer Architecture: A Google Search Engine

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Lec 8: Adaptive Information Retrieval 2

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Link Analysis in the Cloud

Big Data Analytics CSCI 4030

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Recent Researches on Web Page Ranking

Lecture Notes: Social Networks: Models, Algorithms, and Applications Lecture 28: Apr 26, 2012 Scribes: Mauricio Monsalve and Yamini Mule

CS435 Introduction to Big Data FALL 2018 Colorado State University. 9/24/2018 Week 6-A Sangmi Lee Pallickara. Topics. This material is built based on,

PageRank and related algorithms

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

PV211: Introduction to Information Retrieval

arxiv: v1 [cs.si] 18 Oct 2017

Multimedia Content Management: Link Analysis. Ralf Moeller Hamburg Univ. of Technology

Information Retrieval

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Proximity Prestige using Incremental Iteration in Page Rank Algorithm

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Lecture 27: Learning from relational data

Collaborative filtering based on a random walk model on a graph

TPA: Fast, Scalable, and Accurate Method for Approximate Random Walk with Restart on Billion Scale Graphs

~ Ian Hunneybell: WWWT Revision Notes (15/06/2006) ~

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

TupleRank: Ranking Relational Databases using Random Walks on Extended K-partite Graphs

Link analysis in web IR CE-324: Modern Information Retrieval Sharif University of Technology

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

Learning to Rank Networked Entities

Transcription:

Introduction to Data Mining Lecture #10: Link Analysis-2 Seoul National University 1

In This Lecture Pagerank: Google formulation Make the solution to converge Computing Pagerank for very large graphs Pagerank vector and/or stochastic matrix do not fit in the memory Topic specific Pagerank 2

Outline PageRank: Google Formulation Computing PageRank Topic-Specific PageRank Measuring Proximity in Graphs 3

PageRank: Three Questions r ( t+ 1) j = i j r ( t) i d i or equivalently r = Mr Does this converge? Does it converge to what we want? Are results reasonable? 4

Does this converge? a b r ( t+ 1) j = i j r ( t) i d i Example: = r a 1 0 1 0 r b 0 1 0 1 Iteration 0, 1, 2, 5

Does it converge to what we want? a b r ( t+ 1) j = i j r ( t) i d i Example: = r a 1 0 0 0 r b 0 1 0 0 Iteration 0, 1, 2, 6

PageRank: Problems 2 problems: (1) Some pages are dead ends (have no out-links) Random walk has nowhere to go to Such pages cause importance to leak out Dead end (2) Spider traps: (all out-links are within the group) Random walked gets stuck in a trap And eventually spider traps absorb all importance 7

Problem: Spider Traps Power Iteration: Set rr jj = 1 rr jj = ii jj rr ii dd ii And iterate y a m m is a spider trap y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 1 r y = r y /2 + r a /2 r a = r y /2 r m = r a /2 + r m Example: Iteration 0, 1, 2, r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 0 r m 1/3 3/6 7/12 16/24 1 All the PageRank score gets trapped in node m. 8

Solution: Teleports! The Google solution for spider traps: At each time step, the random surfer has two options With prob. β, follow a link at random With prob. 1-β, jump to some random page Common values for β are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps y y a m a m 9

Problem: Dead Ends Power Iteration: Set rr jj = 1 rr jj = ii jj rr ii dd ii And iterate Example: Iteration 0, 1, 2, r y 1/3 2/6 3/12 5/24 0 r a = 1/3 1/6 2/12 3/24 0 r m 1/3 1/6 1/12 2/24 0 a y m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 r y = r y /2 + r a /2 r a = r y /2 r m = r a /2 Here the PageRank leaks out since the matrix is not column stochastic. 10

Solution: Always Teleport! Teleports: Follow random teleport links with probability 1.0 from dead-ends Adjust matrix accordingly y y a m a m y a m y ½ ½ 0 a ½ 0 0 m 0 ½ 0 y a m y ½ ½ ⅓ a ½ 0 ⅓ m 0 ½ ⅓ 11

Why Teleports Solve the Problem? Why are dead-ends and spider traps a problem and why do teleports solve the problem? Spider-traps are not a problem, but with traps PageRank scores are not what we want Solution: Never get stuck in a spider trap by teleporting out of it in a finite number of steps Dead-ends are a problem The matrix is not column stochastic so our initial assumptions are not met Solution: Make matrix column stochastic by always teleporting when there is nowhere else to go 12

Solution: Random Teleports Google s solution: At each step, random surfer has two options: With probability β, follow a link at random With probability 1-β, jump to some random page PageRank equation [Brin-Page, 98] rr jj = ii jj ββ rr ii dd ii + (1 ββ) 1 NN This formulation assumes that MM has no dead ends. We can either preprocess matrix MM to remove all dead ends or explicitly follow random teleport links with probability 1.0 from dead-ends. d i out-degree of node i 13

The Google Matrix PageRank equation [Brin-Page, 98] rr jj = In matrix form: ii jj ββ rr ii dd ii + (1 ββ) 1 NN rr = ββmmmm + 1 ββ [ 1 NN ] NN NNrr = {ββmm + 1 ββ [ 1 NN ] NN NN}rr This is called the Google Matrix [1/N] NxN N by N matrix where all entries are 1/N 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 E.g., for N=3 14

The Google Matrix PageRank equation [Brin-Page, 98] rr jj = ii jj The Google Matrix A: ββ rr ii dd ii + (1 ββ) 1 NN AA = ββ MM + 1 ββ 1 NN NN NN [1/N] NxN N by N matrix where all entries are 1/N We have a recursive problem: rr = AA rr And the Power method still works! What is β? In practice β =0.8,0.9 (make ~5 steps on avg., jump) 15

Random Teleports (ββ = 0.8) y 7/15 M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N] NxN 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 a m 13/15 y 7/15 7/15 1/15 a 7/15 1/15 1/15 m 1/15 7/15 13/15 A y a = m 1/3 1/3 1/3 0.33 0.20 0.46 0.24 0.20 0.52 0.26 0.18 0.56... 7/33 5/33 21/33 16

Outline PageRank: Google Formulation Computing PageRank Topic-Specific PageRank Measuring Proximity in Graphs 17

Computing Page Rank Key step is matrix-vector multiplication r new = A r old Easy if we have enough main memory to hold A, r old, r new Say N = 1 billion pages We need 4 bytes for each entry (say) Total 2 billion entries for 2 vectors(r old, r new ): ~ 8GB Matrix A has N 2 entries A = N 2 = 10 18 (1000 Peta) is a large number! We need to exploit sparsity of M A = β M + (1-β) [1/N] NxN ½ ½ 0 ½ 0 0 0 ½ 1 0.8 +0.2 = 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 1/3 7/15 7/15 1/15 7/15 1/15 1/15 1/15 7/15 13/15 18

Sparse Matrix Formulation rr = AA rr, where AA = ββ MM + 1 ββ 1 NN NN NN Main idea: do not construct A explicitly Specifically: rr = ββ MM rr + (11 ββ) 11 NN NN NN rr = ββ MM rr + (11 ββ) 11 NN NN Note: Here we assume M has no dead-ends. [x] N a vector of length N with all entries x 19

Sparse Matrix Formulation The PageRank equation rr = ββββ rr + 11 ββ NN where [(1-β)/N] N is a vector with all N entries (1-β)/N NN M is a sparse matrix! 10 links per node, approx 10N entries So in each iteration, we need to: Compute r new = β M r old Add a constant value (1-β)/N to each entry in r new Note if M contains dead-ends then jj rr nnnnnn jj < 11 and we also have to renormalize r new so that it sums to 1 20

PageRank: The Complete Algorithm Input: Graph GG and parameter ββ Directed graph GG (can have spider traps and dead ends) Parameter ββ Output: PageRank vector rr nnnnnn Set: rr jj oooooo = 1 NN repeat until convergence: jj rr jj nnnnnn rr jj oooooo > εε jj: rr nnnnnn jj = ii jj ββ rr ii oooooo dd ii Now re-insert the leaked PageRank: jj: rr nnnnnn jj = rr jj nnnnnn + 11 SS NN rr oooooo = rr nnnnnn where: SS = jj rrr jj nnnnnn If the graph has no dead-ends then the amount of leaked PageRank is 1-β. But since we have dead-ends the amount of leaked PageRank may be larger. We have to explicitly account for it by computing S. 21

Sparse Matrix Encoding Encode sparse matrix using only nonzero entries Space proportional roughly to number of links Assuming N = 1 billion, 10N edges would require 4*10*1 billion = 40GB Still won t fit in memory, but will fit on disk source node degree destination nodes 0 3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23 22

Basic Algorithm: Update Step Assume enough RAM to fit r new into memory Store r old and matrix M on disk 1 step of power-iteration is: Initialize all entries of r new = (1-β) / N For each page i (of out-degree d i ): Read into memory: i, d i, dest 1,, dest d i, r old (i) For j = 1 d i r new (dest j ) += β r old (i) / d i 0 1 2 3 4 5 6 r new source degree 0 3 1, 5, 6 1 4 17, 64, 113, 117 2 2 13, 23 destination r old 0 1 2 3 4 5 6 23

Basic Algorithm: Update Step Memory Disk source destination =. x = + + +. 24

Analysis Assume enough RAM to fit r new into memory Store r old and matrix M on disk In each iteration, we have to: Read r old and M Write r new back to disk Cost (disk I/O) per iteration of Power method: = 2 r + M Question: What if we could not even fit r new in memory? 25

Block-based Update Algorithm 0 1 2 3 r new src degree destination 0 4 0, 1, 3, 5 1 2 0, 5 2 2 3, 4 M r old 0 1 2 3 4 5 4 5 Break r new into k blocks that fit in memory Scan M and r old once for each block 26

Block-based Update Algorithm Memory Disk source destination =. x = + + +. 27

Analysis of Block Update Similar to nested-loop join in databases Break r new into k blocks that fit in memory Scan M and r old once for each block Total cost: k scans of M and r old Cost per iteration of Power method: k( M + r ) + r = k M + (k+1) r Can we do better? Hint: M is much bigger than r (approx 10-20x), so we must avoid reading it k times per iteration 28

Block-Stripe Update Algorithm Memory Disk destination source. = x = + + +. 29

Block-Stripe Update Algorithm 0 1 2 3 r new src degree destination 0 4 0, 1 1 3 0 2 2 1 0 4 3 2 2 3 r old 0 1 2 3 4 5 4 5 0 4 5 1 3 5 2 2 4 Break M into stripes! Each stripe contains only destination nodes in the corresponding block of r new 30

Block-Stripe Analysis Break M into stripes Each stripe contains only destination nodes in the corresponding block of r new Some additional overhead per stripe But it is usually worth it Cost per iteration of Power method: = M (1+ε) + (k+1) r 31

Limitations in Page Rank Measures generic popularity of a page Biased against topic-specific authorities Solution: Topic-Specific PageRank (next) Uses a single measure of importance Other models of importance Solution: Hubs-and-Authorities Susceptible to Link spam Artificial link topographies created in order to boost page rank Solution: TrustRank 32

Outline PageRank: Google Formulation Computing PageRank Topic-Specific PageRank Measuring Proximity in Graphs 33

Topic-Specific PageRank Instead of generic popularity, can we measure popularity within a topic? Goal: Evaluate Web pages not just according to their popularity, but by how close they are to a particular topic, e.g. sports or history Allows search queries to be answered based on interests of the user Example: Answer the query Jaguar differently depending on whether the user is interested in animal, car, or operating system 34

Topic-Specific PageRank Random walker has a small probability of teleporting at any step Teleport can go to: Standard PageRank: Any page with equal probability To avoid dead-end and spider-trap problems Topic Specific PageRank: A topic-specific set of relevant pages (teleport set) Idea: Bias the random walk When walker teleports, she picks a page from a set S S contains only pages that are relevant to the topic E.g., Yahoo or DMOZ pages for a given topic/query For each teleport set S, we get a different vector r S 35

Matrix Formulation To make this work all we need is to update the teleportation part of the PageRank formulation: AA iiii = ββ MM iiii + (11 ββ)/ SS if ii SS ββ MM iiii + 00 otherwise A is column stochastic! We weighted all pages in the teleport set S equally Could also assign different weights to pages! Compute as for regular PageRank: Multiply by M, then add a vector Maintains sparseness 36

Example: Topic-Specific PageRank 0.5 0.4 1 1 0.2 0.5 0.4 2 3 0.8 1 1 0.8 0.8 4 Suppose S = {1}, β = 0.8 Node Iteration 0 1 2 stable 1 0.25 0.4 0.28 0.294 2 0.25 0.1 0.16 0.118 3 0.25 0.3 0.32 0.327 4 0.25 0.2 0.24 0.261 S={1}, β=0.90: r=[0.17, 0.07, 0.40, 0.36] S={1}, β=0.8: r=[0.29, 0.11, 0.32, 0.26] S={1}, β=0.70: r=[0.39, 0.14, 0.27, 0.19] S={1,2,3,4}, β=0.8: r=[0.13, 0.10, 0.39, 0.36] S={1,2,3}, β=0.8: r=[0.17, 0.13, 0.38, 0.30] S={1,2}, β=0.8: r=[0.26, 0.20, 0.29, 0.23] S={1}, β=0.8: r=[0.29, 0.11, 0.32, 0.26] 37

Discovering the Topic Vector S Create different PageRanks for different topics The 16 DMOZ top-level categories: arts, business, sports, Which topic ranking to use? User can pick from a menu Classify query into a topic Can use the context of the query E.g., query is launched from a web page talking about a known topic History of queries e.g., basketball followed by Jordan User context, e.g., user s bookmarks, 38

Outline PageRank: Google Formulation Computing PageRank Topic-Specific PageRank Measuring Proximity in Graphs 39

[Tong-Faloutsos, 06] Proximity on Graphs I 1 J 1 1 A 1 H 1 B 1 1 D E 1 1 1 F G a.k.a.: Relevance, Closeness, Similarity 40

Good proximity measure? Shortest path is not good: No effect of degree-1 nodes (E, F, G)! Multi-faceted relationships 41

Good proximity measure? Network flow is not good: Does not punish long paths 42

[Tong-Faloutsos, 06] What is good notion of proximity? I 1 J 1 1 A 1 H 1 B 1 1 D E 1 1 1 F G Multiple connections Quality of connection Length, Degree, Weight 43

Random Walk with Restart: Idea RWR: Random walks from a fixed node E.g., k-partite graph Authors with k types of nodes E.g.: Authors, Conferences, Tags Topic Specific PageRank from node u: teleport set S = {u} Resulting scores measures similarity to node u Problem: Must be done once for each node u Suitable for sub-web-scale applications State-of-the-art: BePI [Jung et al. SIGMOD 17] Conferences Tags 44

RWR: Example IJCAI KDD ICDM SDM AAAI NIPS Philip S. Yu Ning Zhong R. Ramakrishnan M. Jordan Q: What is the most related conference to ICDM? A: Topic-Specific PageRank with teleport set S={ICDM} Conference Author 45

RWR: Example KDD CIKM PKDD SDM PAKDD 0.009 0.011 0.004 0.004 0.008 ICDM 0.007 0.005 0.004 0.005 0.005 ICML ICDE ECML SIGMOD DMKD 46

What You Need to Know Normal PageRank: Teleports uniformly at random to any node All nodes have the same probability of surfer landing there: S = [0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1, 0.1] Topic-Specific PageRank also known as Personalized PageRank: Teleports to a topic specific set of pages Nodes can have different probabilities of surfer landing there: S = [0.1, 0, 0, 0.2, 0, 0, 0.5, 0, 0, 0.2] Random Walk with Restarts: Topic-Specific PageRank where teleport is always to the same node. S=[0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0] 47

Questions? 48