CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu

HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank Proposed at around same time as PageRank ( 98) Goal: Say we want to find good newspapers Don t just find newspapers. Find experts people who link in a coordinated way to good newspapers Idea: Links as votes Page is more important if it has more links In coming links? Out going links? 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

Hubs and Authorities Each page has 2 scores: Quality as an expert (hub): Total sum of votes of authorities pointed to Quality as a content (authority): Total sum of votes coming from experts NYT: 10 Ebay: 3 Yahoo: 3 CNN: 8 WSJ: 9 Principle of repeated improvement 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 4

Interesting pages fall into two classes: 1. Authorities are pages containing useful information Newspaper home pages Course home pages Home pages of auto manufacturers 2. Hubs are pages that link to authorities List of newspapers Course bulletin List of US auto manufacturers 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 5

Each page starts with hub score 1. Authorities collect their votes (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 6

Sum of hub scores of nodes pointing to NYT. Each page starts with hub score 1. Authorities collect their votes (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 7

Sum of authority scores of nodes that the node points to. Hubs collect authority scores (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

Authorities again collect the hub scores (Note this is idealized example. In reality graph is not bipartite and each page has both the hub and authority score) 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 9

A good hub links to many good authorities A good authority is linked from many good hubs Model using two scores for each node: Hub score and Authority score Represented as vectors and 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 10

[Kleinberg 98] Each page has 2 scores: j 1 j 2 j 3 j 4 Authority score: Hub score: HITS algorithm: n number of node in a graph Initialize: Then keep iterating until convergence: i i Authority: Hub: Normalize, such that:, j 1 j 2 j 3 j 4 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 11

1 1 1 A = 1 0 1 0 1 0 1 1 0 A T = 1 0 1 1 1 0 Yahoo Amazon M soft h(yahoo) h(amazon) h(m soft) = = =.58.58.58.80.53.27.80.53.27.79.57.23..........788.577.211 a(yahoo) =.58 a(amazon) =.58 a(m soft) =.58.58.58.58.62.49.62.62.49.62..........628.459.628 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 12

[Kleinberg 98] HITS converges to a single stable point Notation: Vector 1 1 Adjacency matrix (n x n): if Then can be rewritten as So: Similarly, can be rewritten as 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 13

The hub score of page i is proportional to the sum of the authority scores of the pages it links to: h = λ A a λ is a scale factor: The authority score of page i is proportional to the sum of the hub scores of the pages it is linked from: a = μ A T h μ is scale factor: 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 14

HITS algorithm in vector notation: Set: Repeat until convergence: Normalize Then: Thus, in and steps: new new Convergence criterion: is updated (in 2 steps): h is updated (in 2 steps): Repeated matrix powering 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 15

1/ 1/ Under reasonable assumptions about A, HITS converges to vectors h * and a * : h * is the principal eigenvector of matrix A A T a * is the principal eigenvector of matrix A T A 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 16

PageRank and HITS are two solutions to the same problem: What is the value of an in link from u to v? In the PageRank model, the value of the link depends on the links into u In the HITS model, it depends on the value of the other links out of u The destinies of PageRank and HITS post 1998 were very different 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 17

We often think of networks being organized into modules, cluster, communities: 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 19

2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 20

Find micro markets by partitioning the query to advertiser graph: query advertiser [Andersen, Lang: Communities from seed sets, 2006] 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 21

Clusters in Movies to Actors graph: [Andersen, Lang: Communities from seed sets, 2006] 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 22

Discovering social circles, circles of trust: [McAuley, Leskovec: Discovering social circles in ego networks, 2012] 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 23

Graph is large Assume the graph fits in main memory For example, to work with a 200M node and 2B edge graph one needs approx. 16GB RAM But the graph is too big for running anything more than linear time algorithms We will cover a PageRank based algorithm for finding dense clusters The runtime of the algorithm will be proportional to the cluster size (not the graph size!) 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 24

Discovering clusters based on seed nodes Given: Seed node s Compute (approximate) Personalized PageRank (PPR) around node s (teleport set={s}) Idea is that if s belongs to a nice cluster, the random walk will get trapped inside the cluster Seed node 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 25

Seed node Algorithm outline: Node rank in decreasing PPR score Pick a seed node s of interest Run PPR with teleport set = {s} Sort the nodes by the decreasing PPR score Sweep over the nodes and find good clusters Cluster quality (lower is better) Good clusters 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 26

Undirected graph Partitioning task: 2 1 3 4 5 6 Divide vertices into 2 disjoint groups A 2 1 3 4 5 6 B=V\A Question: How can we define a good cluster in? 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 27

What makes a good cluster? Maximize the number of within cluster connections Minimize the number of between cluster connections 1 5 2 3 4 6 A V\A 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 28

Express cluster quality as a function of the edge cut of the cluster Cut: Set of edges with only one node in the cluster: Note: This works for weighed and unweighted (set all w ij =1) graphs A 2 1 3 4 5 6 cut(a) = 2 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 29

Partition quality: Cut score Quality of a cluster is the weight of connections pointing outside the cluster Degenerate case: Optimal cut Minimum cut Problem: Only considers external cluster connections Does not consider internal cluster connectivity 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 30

[Shi Malik] Criterion: Conductance: Connectivity of the group to the rest of the network relative to the density of the group {( i, j) E; i A, j A} ( A) min( vol ( A), 2m vol ( A)) : total weight of the edges with at least one endpoint in : Why use this criterion? Produces more balanced partitions m number of edges of the graph d i degree of node i 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 31

2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 32

Algorithm outline: Pick a seed node s of interest Run PPR w/ teleport={s} Sort the nodes by the decreasing PPR score Sweep over the nodes and find good clusters Sweep: Conductance Good clusters Node rank i in decreasing PPR score Sort nodes in decreasing PPR score For each compute Local minima of correspond to good clusters 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 33

The whole Sweep curve can be computed in linear time: For loop over the nodes Keep hash table of nodes in Conductance Good clusters Node rank i in decreasing PPR score To compute 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 34

How to compute Personalized PageRank (PPR) without touching the whole graph? Power method won t work since each single iteration accesses all nodes of the graph: is a teleport vector: is the personalized PageRank vector At index S PageRank Nibble [Andersen, Chung, Lang, 07] A fast method for computing approximate Personalized PageRank (PPR) with teleport set ={s} ApproxPageRank(s, β, ε) s seed node β teleportation parameter ε approximation error parameter 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 35

Overview of the approximate PPR Lazy random walk, which is a variant of a random walk that stays put with probability 1/2 at each time step, and walks to a random neighbor the other half of the time: Keep track of residual PPR score Residual tells us how well PPR score of is approximated is the true PageRank of node degree of is PageRank estimate at around If residual of node is too big then push the walk further (for each such that, : ), else don t touch the node 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 36

A different way to look at PageRank: [Jeh&Widom. Scaling Personalized Web Search, 2002] is the true PageRank vector with teleport parameter, and teleport vector is the PageRank vector with teleportation vector (set) and teleportation parameter And is the stochastic PageRank transition matrix Gives an idea how to compute PageRank: Node s view of the graph ( ) is the average of its out neighbors views plus s own importance 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 37

Idea: approx. PageRank, its residual PageRank Start with trivial approximation: and Iteratively push PageRank from to until is small Maintain invariant: [Andersen, Chung, Lang. Local graph partitioning using PageRank vectors, 2007] Push: 1 step of a lazy random walk from node :,, :, for each such that, : return, Update r Do 1 step of a walk: Stay at u with prob. ½ Spread remaining ½ fraction of q u as if a single step of random walk were applied to u 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 38

If is large, this means that we have underestimated the importance of node Then we want take some of that gap ( ) and give it away, since we know that we have too much of it So, we keep and then give away the rest to our neighbors, so that we can get rid of it This correspond to the spreading of,, :, for each such that, : return, term Each node wants to keep giving away this excess PageRank until we have settled close to the all nodes have no or a very small gap 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 39

ApproxPageRank(S, β, ε): Set, While : Choose any vertex : where At index S r PPR vector r u PPR score of u q residual PPR vector q u residual of node u d u degree of u Return For each such that Update Update r: Move 1 of the prob. from q u to r u 1 step of a lazy random walk: -Stay atq u with prob. ½ - Spread remaining ½ fraction of q u as if a single step of random walk were applied to u 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 40

ApproxPageRank(S, β, ε): Set 0, 0.. 0 1 0 0 while : Choose any vertex where,, :, a For each such that, : Update, return s b c s a b c Init: r = [0, 0, 0, 0] q = [1, 0, 0, 0] Push(s, r, q): r = [.5, 0, 0, 0] q = [.25,.08,.08,.08] Push(s, r, q): r = [.62, 0, 0, 0] q = [.06,.10,.10,.10] Push(a, r, q): r = [0.62,.05, 0, 0] q = [.09,.03,.10,.10] Push(b, r, q): r = [0.62,.05,.05, 0] q = [.09,.05,.03,.10]. r = [.57,.19,.14,.09] 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 41

Runtime: PageRank Nibble computes PPR in time with residual error Power method would take time Graph cut approximation guarantee: If there exists a cut of conductance method finds a cut of conductance then the Details in [Andersen, Chung, Lang. Local graph partitioning using PageRank vectors, 2007] http://www.math.ucsd.edu/~fan/wp/localpartfull.pdf 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 42

The smaller the ε the farther the random walk will spread! Seed node 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 43

[Andersen, Lang: Communities from seed sets, 2006] 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 44

2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 45

Seed node Algorithm summary: Pick a seed node s of interest Run PPR with teleport set = {s} Sort the nodes by the decreasing PPR score Sweep over the nodes and find good clusters Cluster quality (lower is better) Good clusters Node rank in decreasing PPR score 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 46

[Kumar et al. 99] Searching for small communities in the Web graph What is the signature of a community / discussion in a Web graph? Use this to define topics : What the same people on the left talk about on the right Remember HITS! Dense 2-layer graph Intuition: Many people all talking about the same things 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 48

A more well defined problem: Enumerate complete bipartite subgraphs K s,t Where K s,t : s nodes on the left where each links to the same t other nodes on the right X Y X = s = 3 Y = t = 4 K 3,4 Fully connected 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 49

[Agrawal-Srikant 99] Market basket analysis. Setting: Market: Universe U of n items Baskets: m subsets of U: S 1, S 2,, S m U (S i is a set of items one person bought) Support: Frequency threshold f Goal: Find all subsets T s.t. T S i of at least f sets S i (items in T were bought together at least f times) What s the connection between the itemsets and complete bipartite graphs? 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 50

[Kumar et al. 99] Frequent itemsets = complete bipartite graphs! How? View each node i as a set S i of nodes i points to K s,t = a set Y of size t that occurs in s sets S i Looking for K s,t set of frequency threshold to s and look at layer t all frequent sets of size t i X j i k a b c d S i ={a,b,c,d} a b c d Y s minimum support ( X =s) t itemset size ( Y =t) 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 51

[Kumar et al. 99] View each node i as a set S i of nodes i points to i a b c d S i ={a,b,c,d} Find frequent itemsets: s minimum support t itemset size We found K s,t! K s,t = a set Y of size t that occurs in s sets S i x Say we find a frequent itemset Y={a,b,c} of supp s So, there are s nodes that link to all of {a,b,c}: 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 52 a b c X y x y z a b c a b c z Y a b c

a c e d b f Itemsets: a = {b,c,d} b = {d} c = {b,d,e,f} d = {e,f} e = {b,d} f = {} Support threshold s=2 {b,d}: support 3 {e,f}: support 2 And we just found 2 bipartite subgraphs: a c e 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 53 d b c e f d

Example of a community from a web graph Nodes on the right Nodes on the left [Kumar, Raghavan, Rajagopalan, Tomkins: Trawling the Web for emerging cyber-communities 1999] 2/10/2015 Jure Leskovec, Stanford C246: Mining Massive Datasets 54