Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Similar documents
CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Online Social Networks and Media. Community detection

Mining Social Network Graphs

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

CSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

CS224W: Analysis of Networks Jure Leskovec, Stanford University

Graphs (Part II) Shannon Quinn

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Infinite data. Filtering data streams

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Classify My Social Contacts into Circles Stanford University CS224W Fall 2014

Clustering Lecture 5: Mixture Model

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Overlapping Community Detection in Temporal Text Networks

Introduction to Machine Learning CMU-10701

Non Overlapping Communities

Social-Network Graphs

CS Introduction to Data Mining Instructor: Abdullah Mueen

Learning to Discover Social Circles in Ego Networks

CSE 258 Lecture 5. Web Mining and Recommender Systems. Dimensionality Reduction

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li

Information Networks: PageRank

Stanford Infolab Technical Report. Overlapping Communities Explain Core-Periphery Organization of Networks

Community Detection. Community

Community detection algorithms survey and overlapping communities. Presented by Sai Ravi Kiran Mallampati

CSE 255 Lecture 5. Data Mining and Predictive Analytics. Dimensionality Reduction

Social Networks. Slides by : I. Koutsopoulos (AUEB), Source:L. Adamic, SN Analysis, Coursera course

Mixture Models and the EM Algorithm

Big Data Analytics CSCI 4030

Epilog: Further Topics

Lecture 8: The EM algorithm

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Community Preserving Network Embedding

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

Mining Data that Changes. 17 July 2015

MIDTERM EXAMINATION Networked Life (NETS 112) November 21, 2013 Prof. Michael Kearns

Lecture 11: E-M and MeanShift. CAP 5415 Fall 2007

K-Means and Gaussian Mixture Models

How to organize the Web?

Graph Theory for Network Science

Clusters and Communities

Overlapping Communities

Segmentation: Clustering, Graph Cut and EM

Unsupervised Learning: Clustering

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Community-Affiliation Graph Model for Overlapping Network Community Detection

Graph Mining: Overview of different graph models

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

CS224W: Analysis of Networks Jure Leskovec, Stanford University

Chapters 11 and 13, Graph Data Mining

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li

Basics of Network Analysis

Graph Theory Review. January 30, Network Science Analytics Graph Theory Review 1

Lecture 3, Review of Algorithms. What is Algorithm?

node2vec: Scalable Feature Learning for Networks

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Computer Vision. Exercise Session 10 Image Categorization

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CSE 158. Web Mining and Recommender Systems. Midterm recap

Expectation Maximization: Inferring model parameters and class labels

Deep Boltzmann Machines

Clustering. Image segmentation, document clustering, protein class discovery, compression

Supervised Learning for Image Segmentation

Overlay (and P2P) Networks

Introduction to Machine Learning

Analysis of Large Graphs: TrustRank and WebSpam

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CSE 5243 INTRO. TO DATA MINING

Graphs. Data Structures and Algorithms CSE 373 SU 18 BEN JONES 1

Uncertainties: Representation and Propagation & Line Extraction from Range data

Clustering: Classic Methods and Modern Views

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Social Network Analysis

CS281 Section 9: Graph Models and Practical MCMC

Algorithmic and Economic Aspects of Networks. Nicole Immorlica

Inference and Representation

Topic mash II: assortativity, resilience, link prediction CS224W

Algorithms and Applications in Social Networks. 2017/2018, Semester B Slava Novgorodov

COMS 4771 Clustering. Nakul Verma

Parameter Estimation. Learning From Data: MLE. Parameter Estimation. Likelihood. Maximum Likelihood Parameter Estimation. Likelihood Function 12/1/16

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Instructor: Dr. Mehmet Aktaş. Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

This Talk. Map nodes to low-dimensional embeddings. 2) Graph neural networks. Deep learning architectures for graphstructured

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Learning Undirected Models with Missing Data

Spatial biosurveillance

More details on Loopy BP

Discovering Social Circles in Ego Networks. Julian McAuley and Jure Leskovec Stanford January 11, 2013

Expectation Maximization: Inferring model parameters and class labels

Sampling Large Graphs: Algorithms and Applications

Bipartite Edge Prediction via Transductive Learning over Product Graphs

Web Structure Mining Community Detection and Evaluation

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Compressed representations for web and social graphs. Cecilia Hernandez and Gonzalo Navarro Presented by Helen Xu 6.

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Computational Cognitive Science

Transcription:

Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University http://www.mmds.org

Can we identify node groups? (communities, modules, clusters) Nodes: Football Teams Edges: Games played 2

NCAA conferences Nodes: Football Teams Edges: Games played 3

Can we identify functional modules? Nodes: Proteins Edges: Physical interactions 4

Functional modules Nodes: Proteins Edges: Physical interactions 5

Can we identify social communities? Nodes: Facebook Users Edges: Friendships 6

Social communities High school Summer internship Stanford (Squash) Stanford (Basketball) Nodes: Facebook Users Edges: Friendships 7

Non-overlapping vs. overlapping communities 8

Nodes Nodes Network Adjacency matrix 9

What is the structure of community overlaps: Edge density in the overlaps is higher! Communities as tiles 10

Communities in a network This is what we want! 11

1) Given a model, we generate the network: Generative model for networks A C B D E F G 2) Given a network, find the best model H A C B D E H F G Generative model for networks 12

Goal:Define a model that can generate networks The model will have a set of parameters that we will later want to estimate (and detect communities) Generative model for networks A C B D E F G Q: Given a set of nodes, how do communities generate edges of the network? H 13

Communities, C p A p B Model Memberships, M Nodes, V Model Generative model B(V, C, M, {p c }) for graphs: Nodes V, Communities C, Memberships M Network Each community chas a single probability p c Later we fit the model to networks to detect communities 14

Communities, C Memberships, M p A p B Model Nodes, V Community Affiliations AGM generates the links: For each For each pair of nodes in community, we connect them with prob. The overall edge probability is: P ( u, v) = 1 (1 ) p c c M u M v If, share nocommunities:, Network Think of this as an OR function: If at least 1 community says YES we create an edge set of communities node belongs to 15

Model Network 16

AGM can express a variety of community structures: Non-overlapping, Overlapping, Nested 17

Detecting communities with AGM: A C B D E F G H Given a Graph, find the Model 1) Affiliation graph M 2) Number of communities C 3) Parameters p c 19

Maximum Likelihood Principle (MLE): Given:Data Assumption:Data is generated by some model model model parameters Want to estimate : The probability that our model (with parameters ) generated the data Now let s find the most likely model that could have generated the data: arg max 20

Imagine we are given a set of coin flips Task:Figure out the bias of a coin! Data:Sequence of coin flips:,,,,,,, Model: return 1 with prob. Θ,else return 0 What is? Assuming coin flips are independent So, What is? Simple, Then, For example:... What did we learn?our data was most likely generated by coin with bias / / 21

How do we do MLE for graphs? Model generates a probabilistic adjacency matrix We then flip all the entries of the probabilistic matrix to obtain the binary adjacency matrix For every pair of nodes, AGM gives the prob. of them being linked 0 0.10 0.10 0.04 0.10 0 0.02 0.06 0.10 0.02 0 0.06 0.04 0.06 0.06 0 Flip biased coins 0 1 0 0 1 0 1 1 0 1 0 1 0 1 1 0 The likelihood of AGM generating graph G: P( G Θ) = Π ( u, v) E P( u, v) Π ( u, v) E (1 P( u, v)) 22

Given graph G(V,E)and Θ,we calculate likelihood that Θ generated G: P(G Θ) G A B 0 0.9 0.9 0 0.9 0 0.9 0 0 1 1 0 1 0 1 0 Θ=B(V, C, M, {p c }) 0.9 0.9 0 0.9 0 0 0.9 0 P(G Θ) 1 1 0 1 0 0 1 0 G P( G Θ) = Π ( u, v) E P( u, v) Π ( u, v) E (1 P( u, v)) 23

Our goal:find,,, such that: arg max Θ P( ) AGM Θ How do we find,,, that maximizes the likelihood? 24

Our goal is to find,,, such that: arg max,,,,,, Problem: Finding Bmeans finding the bipartite affiliation network. There is no nice way to do this. Fitting,,, is too hard, let s change the model (so it is easier to fit)! 25

Relaxation: Memberships have strengths u v :The membership strength of node to community ( : no membership) Each community links nodes independently:, 26

j Community membership strength matrix Nodes Communities, Probability of connection is proportional to the product of strengths Notice: If one node doesn t belong to the community ( 0) then, Prob. that at least one common community links the nodes:,, strength of s membership to vector of community membership strengths of 27

: : : Community links nodes, independently:, Then prob. at least one common links them:,, Example matrix: 0 1.2 0 0.2 0.5 0 0 0.8 0 1.8 1 0 Then:. And:,.. But:,., Node community membership strengths 28

Task: Given a network,, estimate Find that maximizes the likelihood:,,,, where:, Many times we take the logarithm of the likelihood, and call it log-likelihood: Goal: Find that maximizes : 29

Compute gradient of a single row of : Coordinate gradient ascent: Iterate over the rows of : Compute gradient of row (while keeping others fixed) Update the row :.. Set out outgoing neighbors Project back to a non-negative vector: If : This is slow!computing takes linear time! 30

However, we notice: We cache So, computing in the degree of now takes linear time In networks degree of a node is much smaller to the total number of nodes in the network, so this is a significant speedup! 31

10000 8000 Time (Sec.) 6000 4000 2000 0 Link Clustering Clique Percolation MMSB BigCLAM Parallel BigCLAM 0 100 200 300 Number of nodes ( 10 3 ) BigCLAMtakes 5 minutes for 300k node nets Other methods take 10 days Can process networks with 100M edges! 32

34

Extension: Make community membership edges directed! Outgoing membership: Nodes sends edges Incoming membership: Node receives edges 35

36

Everything is almost the same except now we have 2 matrices: and out-going community memberships in-coming community memberships Edge prob.:, Network log-likelihood: which we optimize the same way as before 37

38

Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approachby J. Yang, J. Leskovec.ACM International Conference on Web Search and Data Mining (WSDM), 2013. Detecting Cohesive and 2-mode Communities in Directed and Undirected Networksby J. Yang, J. McAuley, J. Leskovec.ACM International Conference on Web Search and Data Mining (WSDM), 2014. Community Detection in Networks with Node Attributesby J. Yang, J. McAuley, J. Leskovec.IEEE International Conference On Data Mining (ICDM), 2013. 39