CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Similar documents
CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #21: Graph Mining 2

Slides based on those in:

3 announcements: Thanks for filling out the HW1 poll HW2 is due today 5pm (scans must be readable) HW3 will be posted today

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS425: Algorithms for Web Scale Data

Introduction to Data Mining

CS345a: Data Mining Jure Leskovec and Anand Rajaraman Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Analysis of Large Graphs: TrustRank and WebSpam

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

How to organize the Web?

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Non Overlapping Communities

Link Analysis in Web Mining

Mining Social Network Graphs

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Information Networks: PageRank

Local higher-order graph clustering

CS 6604: Data Mining Large Networks and Time-Series

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Jeffrey D. Ullman Stanford University

Part 1: Link Analysis & Page Rank

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Lecture 17 November 7

CS224W: Analysis of Networks Jure Leskovec, Stanford University

CS6200 Information Retreival. The WebGraph. July 13, 2015

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

Snowball Sampling a Large Graph

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

COMP 4601 Hubs and Authorities

CSI 445/660 Part 10 (Link Analysis and Web Search)

Parallel Local Graph Clustering

Some Interesting Applications of Theory. PageRank Minhashing Locality-Sensitive Hashing

CS 534: Computer Vision Segmentation II Graph Cuts and Image Segmentation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Introduction to Data Mining

Social-Network Graphs

Parallel Local Graph Clustering

Subsampling Graphs 1

Link Structure Analysis

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Link Analysis and Web Search

PageRank and related algorithms

Information Retrieval. Lecture 11 - Link analysis

Local Partitioning using PageRank

Graph Partitioning Algorithms

Single link clustering: 11/7: Lecture 18. Clustering Heuristics 1

Local Higher-Order Graph Clustering

Approximation slides 1. An optimal polynomial algorithm for the Vertex Cover and matching in Bipartite graphs

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Clustering: Classic Methods and Modern Views

Web Structure Mining using Link Analysis Algorithms

Mining Web Data. Lijun Zhang

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

CS 534: Computer Vision Segmentation and Perceptual Grouping

Segmentation: Clustering, Graph Cut and EM

BigML homework 6: Efficient Approximate PageRank

Spectral Clustering X I AO ZE N G + E L HA M TA BA S SI CS E CL A S S P R ESENTATION MA RCH 1 6,

Social and Technological Network Data Analytics. Lecture 5: Structure of the Web, Search and Power Laws. Prof Cecilia Mascolo

Absorbing Random walks Coverage

Pregel. Ali Shah

Jure Leskovec, Cornell/Stanford University. Joint work with Kevin Lang, Anirban Dasgupta and Michael Mahoney, Yahoo! Research

Graph Exploration: Taking the User into the Loop

Community Detection. Community

Absorbing Random walks Coverage

Collaborative filtering based on a random walk model on a graph

Large-Scale Networks. PageRank. Dr Vincent Gramoli Lecturer School of Information Technologies

Big Data Analytics CSCI 4030

Lecture 8: Linkage algorithms and web search

Lecture 6: Spectral Graph Theory I

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

CSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection

A Graph-Based Approach for Mining Closed Large Itemsets

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

A STUDY OF RANKING ALGORITHM USED BY VARIOUS SEARCH ENGINE

CS-C Data Science Chapter 9: Searching for relevant pages on the Web: Random walks on the Web. Jaakko Hollmén, Department of Computer Science

Extracting Information from Complex Networks

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Multi-label classification using rule-based classifier systems

Graph Theory S 1 I 2 I 1 S 2 I 1 I 2

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

An Empirical Analysis of Communities in Real-World Networks

CS6220: DATA MINING TECHNIQUES

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

CSCI-B609: A Theorist s Toolkit, Fall 2016 Sept. 6, Firstly let s consider a real world problem: community detection.

Link Analysis. CSE 454 Advanced Internet Systems University of Washington. 1/26/12 16:36 1 Copyright D.S.Weld

MSA220 - Statistical Learning for Big Data

Unsupervised Learning. Pantelis P. Analytis. Introduction. Finding structure in graphs. Clustering analysis. Dimensionality reduction.

Transcription:

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank Proposed at around same time as PageRank ( 98) Goal: Say we want to find good newspapers Don t just find newspapers. Find experts people who link in a coordinated way to good newspapers Idea: Links as votes Page is more important if it has more links In coming links? Out going links? 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

Hubs and Authorities Each page has 2 scores: Quality as an expert (hub): Total sum of votes of authorities pointed to Quality as a content (authority): Total sum of votes coming from experts NYT: 10 Ebay: 3 Yahoo: 3 CNN: 8 WSJ: 9 Principle of repeated improvement 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 4

Interesting pages fall into two classes: 1. Authorities are pages containing useful information Newspaper home pages Course home pages Home pages of auto manufacturers 2. Hubs are pages that link to authorities List of newspapers Course bulletin List of US auto manufacturers 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 5

Each page starts with hub score 1. Authorities collect their votes (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 6

Sum of hub scores of nodes pointing to NYT. Each page starts with hub score 1. Authorities collect their votes (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 7

Sum of authority scores of nodes that the node points to. Hubs collect authority scores (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

Authorities again collect the hub scores (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 9

A good hub links to many good authorities A good authority is linked from many good hubs Model using two scores for each node: Hub score and Authority score Represented as vectors and 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 10

[Kleinberg 98] Each page has 2 scores: j 1 j 2 j 3 j 4 Authority score: Hub score: HITS algorithm: n number of node in a graph Initialize: Then keep iterating until convergence: i i Authority: Hub: Normalize, such that:, j 1 j 2 j 3 j 4 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 11

1 1 1 A = 1 0 1 0 1 0 1 1 0 A T = 1 0 1 1 1 0 Yahoo Amazon M soft h(yahoo) h(amazon) h(m soft) = = =.58.58.58.80.53.27.80.53.27.79.57.23..........788.577.211 a(yahoo) =.58 a(amazon) =.58 a(m soft) =.58.58.58.58.62.49.62.62.49.62..........628.459.628 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 12

[Kleinberg 98] HITS converges to a single stable point Notation: Vector 1 1 Adjacency matrix (n x n): if Then can be rewritten as So: Similarly, can be rewritten as 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 13

The hub score of page i is proportional to the sum of the authority scores of the pages it links to: h = λ A a λ is a scale factor: The authority score of page i is proportional to the sum of the hub scores of the pages it is linked from: a = μ A T h μ is scale factor: 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 14

HITS algorithm in vector notation: Set: Repeat until convergence: Normalize Then: Thus, in and steps: new new Convergence criterion: is updated (in 2 steps): h is updated (in 2 steps): Repeated matrix powering 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 15

1/ 1/ Under reasonable assumptions about A, HITS converges to vectors h * and a * : h * is the principal eigenvector of matrix A A T a * is the principal eigenvector of matrix A T A 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 16

PageRank and HITS are two solutions to the same problem: What is the value of an in link from u to v? In the PageRank model, the value of the link depends on the links into u In the HITS model, it depends on the value of the other links out of u The destinies of PageRank and HITS post 1998 were very different 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 17

We often think of networks being organized into modules, cluster, communities: 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 19

2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 20

Find micro markets by partitioning the query to advertiser graph: query advertiser [Andersen, Lang: Communities from seed sets, 2006] 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 21

Clusters in Movies to Actors graph: [Andersen, Lang: Communities from seed sets, 2006] 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 22

Discovering social circles, circles of trust: [McAuley, Leskovec: Discovering social circles in ego networks, 2012] 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 23

Graph is large Assume the graph fits in main memory For example, to work with a 200M node and 2B edge graph one needs approx. 16GB RAM But the graph is too big for running anything more than linear time algorithms We will cover a PageRank based algorithm for finding dense clusters The runtime of the algorithm will be proportional to the cluster size (not the graph size!) 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 24

Discovering clusters based on seed nodes Given: Seed node s Compute (approximate) Personalized PageRank (PPR) around node s (teleport set={s}) Idea is that if s belongs to a nice cluster, the random walk will get trapped inside the cluster Seed node 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 25

Seed node Algorithm outline: Node rank in decreasing PPR score Pick a seed node s of interest Run PPR with teleport set = {s} Sort the nodes by the decreasing PPR score Sweep over the nodes and find good clusters Cluster quality (lower is better) Good clusters 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 26

Undirected graph Partitioning task: 2 1 3 4 5 6 Divide vertices into 2 disjoint groups A 2 1 3 4 5 6 B=V\A Question: How can we define a good cluster in? 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 27

What makes a good cluster? Maximize the number of within cluster connections Minimize the number of between cluster connections 1 5 2 3 4 6 A V\A 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 28

Express cluster quality as a function of the edge cut of the cluster Cut: Set of edges with only one node in the cluster: Note: This works for weighed and unweighted (set all w ij =1) graphs A 2 1 3 4 5 6 cut(a) = 2 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 29

Partition quality: Cut score Quality of a cluster is the weight of connections pointing outside the cluster Degenerate case: Optimal cut Minimum cut Problem: Only considers external cluster connections Does not consider internal cluster connectivity 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 30

[Shi Malik] Criterion: Conductance: Connectivity of the group to the rest of the network relative to the density of the group {( i, j) E; i A, j A} ( A) min( vol ( A), 2m vol ( A)) : total weight of the edges with at least one endpoint in : Why use this criterion? Produces more balanced partitions m number of edges of the graph d i degree of node i 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 31

2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 32

Algorithm outline: Pick a seed node s of interest Run PPR w/ teleport={s} Sort the nodes by the decreasing PPR score Sweep over the nodes and find good clusters Sweep: Conductance Good clusters Node rank i in decreasing PPR score Sort nodes in decreasing PPR score For each compute Local minima of correspond to good clusters 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 33

The whole Sweep curve can be computed in linear time: For loop over the nodes Keep hash table of nodes in Conductance Good clusters Node rank i in decreasing PPR score To compute 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 34

How to compute Personalized PageRank (PPR) without touching the whole graph? Power method won t work since each single iteration accesses all nodes of the graph: is a teleport vector: is the personalized PageRank vector At index S PageRank Nibble [Andersen, Chung, Lang, 07] A fast method for computing approximate Personalized PageRank (PPR) with teleport set ={s} ApproxPageRank(s, β, ε) s seed node β teleportation parameter ε approximation error parameter 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 35

Overview of the approximate PPR Lazy random walk, which is a variant of a random walk that stays put with probability 1/2 at each time step, and walks to a random neighbor the other half of the time: Keep track of residual PPR score Residual tells us how well PPR score of is approximated is the true PageRank of node degree of is PageRank estimate at around If residual of node is too big then push the walk further (for each such that, : ), else don t touch the node 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 36

A different way to look at PageRank: [Jeh&Widom. Scaling Personalized Web Search, 2002] is the true PageRank vector with teleport parameter, and teleport vector is the PageRank vector with teleportation vector (set) and teleportation parameter And is the stochastic PageRank transition matrix Gives an idea how to compute PageRank: Node s view of the graph ( ) is the average of its out neighbors views plus s own importance 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 37

Idea: approx. PageRank, its residual PageRank Start with trivial approximation: and Iteratively push PageRank from to until is small Maintain invariant: [Andersen, Chung, Lang. Local graph partitioning using PageRank vectors, 2007] Push: 1 step of a lazy random walk from node :,, :, for each such that, : return, Update r Do 1 step of a walk: Stay at u with prob. ½ Spread remaining ½ fraction of q u as if a single step of random walk were applied to u 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 38

If is large, this means that we have underestimated the importance of node Then we want take some of that gap ( ) and give it away, since we know that we have too much of it So, we keep and then give away the rest to our neighbors, so that we can get rid of it This correspond to the spreading of,, :, for each such that, : return, term Each node wants to keep giving away this excess PageRank until we have settled close to the all nodes have no or a very small gap 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 39

ApproxPageRank(S, β, ε): Set, While : Choose any vertex : where At index S r PPR vector r u PPR score of u q residual PPR vector q u residual of node u d u degree of u Return For each such that Update Update r: Move 1 of the prob. from q u to r u 1 step of a lazy random walk: -Stay atq u with prob. ½ - Spread remaining ½ fraction of q u as if a single step of random walk were applied to u 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 40

ApproxPageRank(S, β, ε): Set 0, 0.. 0 1 0 0 while : Choose any vertex where,, :, a For each such that, : Update, return s b c s a b c Init: r = [0, 0, 0, 0] q = [1, 0, 0, 0] Push(s, r, q): r = [.5, 0, 0, 0] q = [.25,.08,.08,.08] Push(s, r, q): r = [.62, 0, 0, 0] q = [.06,.10,.10,.10] Push(a, r, q): r = [0.62,.05, 0, 0] q = [.09,.03,.10,.10] Push(b, r, q): r = [0.62,.05,.05, 0] q = [.09,.05,.03,.10]. r = [.57,.19,.14,.09] 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 41

Runtime: PageRank Nibble computes PPR in time with residual error Power method would take time Graph cut approximation guarantee: If there exists a cut of conductance method finds a cut of conductance then the Details in [Andersen, Chung, Lang. Local graph partitioning using PageRank vectors, 2007] http://www.math.ucsd.edu/~fan/wp/localpartfull.pdf 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 42

The smaller the ε the farther the random walk will spread! Seed node 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 43

[Andersen, Lang: Communities from seed sets, 2006] 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 44

2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 45

Seed node Algorithm summary: Pick a seed node s of interest Run PPR with teleport set = {s} Sort the nodes by the decreasing PPR score Sweep over the nodes and find good clusters Cluster quality (lower is better) Good clusters Node rank in decreasing PPR score 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 46

[Kumar et al. 99] Searching for small communities in the Web graph What is the signature of a community / discussion in a Web graph? Use this to define topics : What the same people on the left talk about on the right Remember HITS! Dense 2-layer graph Intuition: Many people all talking about the same things 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 48

A more well defined problem: Enumerate complete bipartite subgraphs K s,t Where K s,t : s nodes on the left where each links to the same t other nodes on the right X Y X = s = 3 Y = t = 4 K 3,4 Fully connected 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 49

[Agrawal-Srikant 99] Market basket analysis. Setting: Market: Universe U of n items Baskets: m subsets of U: S 1, S 2,, S m U (S i is a set of items one person bought) Support: Frequency threshold f Goal: Find all subsets T s.t. T S i of at least f sets S i (items in T were bought together at least f times) What s the connection between the itemsets and complete bipartite graphs? 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 50

[Kumar et al. 99] Frequent itemsets = complete bipartite graphs! How? View each node i as a set S i of nodes i points to K s,t = a set Y of size t that occurs in s sets S i Looking for K s,t set of frequency threshold to s and look at layer t all frequent sets of size t i X j i k a b c d S i ={a,b,c,d} a b c d Y s minimum support ( X =s) t itemset size ( Y =t) 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 51

[Kumar et al. 99] View each node i as a set S i of nodes i points to i a b c d S i ={a,b,c,d} Find frequent itemsets: s minimum support t itemset size We found K s,t! K s,t = a set Y of size t that occurs in s sets S i x Say we find a frequent itemset Y={a,b,c} of supp s So, there are s nodes that link to all of {a,b,c}: 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 52 a b c X y x y z a b c a b c z Y a b c

a c e d b f Itemsets: a = {b,c,d} b = {d} c = {b,d,e,f} d = {e,f} e = {b,d} f = {} Support threshold s=2 {b,d}: support 3 {e,f}: support 2 And we just found 2 bipartite subgraphs: a c e 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 53 d b c e f d

Example of a community from a web graph Nodes on the right Nodes on the left [Kumar, Raghavan, Rajagopalan, Tomkins: Trawling the Web for emerging cyber-communities 1999] 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 54