CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Size: px

Start display at page:

Download "CS246: Mining Massive Datasets Jure Leskovec, Stanford University"

Tyrone Earl Goodwin
5 years ago
Views:

1 CS246: Mining Massive Datasets Jure Leskovec, Stanford University

2 Can we identify node groups? (communities, modules, clusters) 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Nodes: Football Teams Edges: Games played 2

3 NCAA conferences 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Nodes: Football Teams Edges: Games played 3

4 Can we identify functional modules? 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Nodes: Proteins Edges: Physical interactions 4

5 Functional modules 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Nodes: Proteins Edges: Physical interactions 5

6 Can we identify social communities? 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Nodes: Facebook Users Edges: Friendships 6

7 Social communities High school Summer internship Stanford (Squash) Stanford (Basketball) 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Nodes: Facebook Users Edges: Friendships 7

8 Non-overlapping vs. overlapping communities 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

9 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 9 Nodes Nodes Network Adjacency matrix

10 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 10 What is the structure of community overlaps: Edge density in the overlaps is higher! Communities as tiles

11 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 11 Communities in a network This is what we want!

12 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 12 1) Given a model, we generate the network: Generative model for networks A C B D E F G 2) Given a network, find the best model H A C B D E H F G Generative model for networks

13 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 13 Goal: Define a model that can generate networks The model will have a set of parameters that we will later want to estimate (and detect communities) Generative model for networks A C B D E F G Q: Given a set of nodes, how do communities generate edges of the network? H

14 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 14 Communities, C p A p B Model Memberships, M Nodes, V Model Network Generative model B(V, C, M, {p c }) for graphs: Nodes V, Communities C, Memberships M Each community c has a single probability p c Later we fit the model to networks to detect communities

15 Communities, C Memberships, M p A p B Model Nodes, V Community Affiliations AGM generates the links: For each For each pair of nodes in community AA, we connect them with prob. pp AA The overall edge probability is: P ( u, v) = 1 (1 ) p c c M u M v If uu, vv share no communities: PP uu, vv = εε Network MM uu set of communities node uu belongs to Think of this as an OR function: If at least 1 community says YES we create an edge 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 15

16 Network 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 16 Model

17 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 17 AGM can express a variety of community structures: Non-overlapping, Overlapping, Nested

18 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 18 Detecting communities with AGM: A C B D E F G H Given a Graph, find the Model 1) Affiliation graph M 2) Number of communities C 3) Parameters p c

19 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 19 Maximum Likelihood Principle (MLE): Given: Data XX Assumption: Data is generated by some model ff(θθ) ff model ΘΘ model parameters Want to estimate PP ff XX ΘΘ): The probability that our model ff (with parameters ΘΘ) generated the data Now let s find the most likely model that could have generated the data: arg max PP ff XX ΘΘ) Θ

20 Imagine we are given a set of coin flips Task: Figure the bias of a coin! Data: Sequence of coin flips: XX = [11, 00, 00, 00, 11, 00, 00, 11] Model: ff ΘΘ = return 1 with prob. Θ, else return 0 What is PP ff XX ΘΘ? Assuming coin flips are independent So, PP ff XX ΘΘ = PP ff 00 ΘΘ PP ff 11 ΘΘ PP ff 11 ΘΘ PP ff 00 ΘΘ What is PP ff 11 ΘΘ? Simple, PP ff 00 ΘΘ = ΘΘ Then, PP ff XX ΘΘ = ΘΘ ΘΘ 55 For example: PP ff XX ΘΘ = = PP ff XX ΘΘ = = What did we learn? Our data was most likely generated by coin with bias ΘΘ = 33/88 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 20 PP ff XX ΘΘ ΘΘ = 33/88 ΘΘ

21 How do we do MLE for graphs? Model generates a probabilistic adjacency matrix We then flip all the entries of the probabilistic matrix to obtain the binary adjacency matrix For every pair of nodes uu, vv AGM gives the prob. pp uuuu of them being linked Flip biased coins The likelihood of AGM generating graph G: P( G Θ) = Π ( u, v) G P( u, v) ( u, v) G (1 P( u, v)) 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 21 Π

22 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 22 Given graph G and Θ we calculate likelihood that Θ generated G: P(G Θ) G Θ=B(V, C, M, {p c }) P(G Θ) G P( G Θ) = Π ( u, v) G P( u, v) Π ( u, v) G (1 P( u, v))

23 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 23 Our goal: Find ΘΘ = BB(VV, CC, MM, pp CC ) such that: arg max Θ P( GG ) AGM Θ How do we find BB(VV, CC, MM, pp CC ) that maximizes the likelihood? No nice way, have to search over all affiliation graphs But, don t despair

24 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 24 Relaxation: Memberships have strengths FF uuuu u v FF uuuu : The membership strength of node uu to community AA (FF uuuu = 00: no membership) Each community AA links nodes independently: PP AA uu, vv = 11 eeeeee ( FF uuaa FF vvaa )

25 j Community membership strength matrix FF Nodes FF = Communities PP AA uu, vv = 11 eeeeee ( FF uuuu FF vvvv ) Probability of connection is proportional to the product of strengths Prob. that at least one common community CC links the nodes: PP uu, vv = 11 CC 11 PP CC uu, vv FF uuuu strength of uu s membership to AA FF vv vector of community membership strengths of uu 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 25

26 FF uu : FF vv : FF ww : Community AA links nodes uu, vv independently: PP AA uu, vv = 11 eeeeee ( FF uuaa FF vvaa ) Then prob. at least one common CC links them: PP uu, vv = 11 CC 11 PP CC uu, vv = 11 eeeeee ( CC FF uuuu FF vvvv ) = 11 eeeeee ( FF uu FF TT vv ) For example: Node community membership strengths 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Then: FF uu FF vv TT = And: PP uu, vv = 11 eeeeee = But: PP uu, ww = PP vv, ww = 00 26

27 Another way to think about this is: Think of a weighted graph (but we don t see the weights) uu, vv interact with strength XX AA uuuu ~PPPPPPPP(FF uuaa FF vvaa ) Total interaction strength is the sum: XX uuuu = CC CC XX uuuu Then: XX uuuu ~PPPPPPPP( CC FF uuuu FF vvvv ) We observe an edge if there is non-zero interaction: PP XX uuuu > 0 = 1 PP XX uuuu = 0 = = 11 eeeeee ( FF uu FF TT vv ) 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Poisson PMF: PP(XX = xx) = λλkk kk! ee λλ 27

28 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets [wsdm 13] 28 Task: Given a network, estimate FF Find FF that maximizes the log-likelihood Many times we take the logarithm of the likelihood and call it log-likelihood: ll FF = log PP(GG FF) Objective function is biconvex Non-negative matrix factorization: Update FF uuuu for node uu while fixing the memberships of all other nodes

29 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 29 Task: Given a network, estimate FF Find FF that maximizes the log-likelihood Many times we take the logarithm of the likelihood and call it log-likelihood: ll FF = llllll PP(GG FF) Connection: Non-negative matrix factorization We want to factor the adjacency matrix G 11 eeeeee (. ) = F F T Matrix of edge probs. Graph G

30 Non-negative matrix factorization: Use a gradient based approach! But updating the whole FF takes too long Computing the gradient takes quadratic time! Our network are far too big to allow for that: Network with 1M nodes, can have different edges Instead, update FF in small units : Update FF uuuu for node uu while fixing the memberships of all other nodes 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 30

31 This is slow! Computing FF uu takes linear time! 31 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Compute gradient of a single row FF uu of FF: Coordinate gradient ascent: Iterate over the rows of FF: Compute gradient FF uu of row uu (while keeping others fixed) Update the row FF uu : FF uu FF uu + ηη ll(ff uu ) Project FF uu back to a non-negative vector: If FF uuuu < 00: FF uuuu = 00

32 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 32 However, we notice: We can cache vv FF vv So, computing vv NN(uu) FF vv now takes linear time in the degree NN(uu) of uu In networks degree of a node is much smaller to the total number of nodes in the network, so this is a significant speedup!

33 BigCLAM takes 5 minutes for 300k node nets Other methods take 10 days Can process networks with 100M edges! 33

34 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 34

35 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 35 Extension: Make community membership edges directed: Outgoing membership: Nodes sends edges Incoming membership: Node receives edges

36 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 36

2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 37 Everything is almost the same except now we have 2 matrices: FF and HH FF out-going community

37 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 37 Everything is almost the same except now we have 2 matrices: FF and HH FF out-going community memberships HH in-coming community memberships FF HH Edge prob.: PP uu, vv = 11 eeeeee( FF uu HH vv TT ) Network log-likelihood: which we optimize the same way as before

38 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 38

39 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 39 Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach by J. Yang, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), Community Detection in Networks with Node Attributes by J. Yang, J. McAuley, J. Leskovec. IEEE International Conference On Data Mining (ICDM), 2013.

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit