CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Size: px
Start display at page:

Download "CS246: Mining Massive Datasets Jure Leskovec, Stanford University"

Transcription

1 CS246: Mining Massive Datasets Jure Leskovec, Stanford University

2 Can we identify node groups? (communities, modules, clusters) 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Nodes: Football Teams Edges: Games played 2

3 NCAA conferences 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Nodes: Football Teams Edges: Games played 3

4 Can we identify functional modules? 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Nodes: Proteins Edges: Physical interactions 4

5 Functional modules 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Nodes: Proteins Edges: Physical interactions 5

6 Can we identify social communities? 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Nodes: Facebook Users Edges: Friendships 6

7 Social communities High school Summer internship Stanford (Squash) Stanford (Basketball) 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Nodes: Facebook Users Edges: Friendships 7

8 Non-overlapping vs. overlapping communities 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 8

9 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 9 Nodes Nodes Network Adjacency matrix

10 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 10 What is the structure of community overlaps: Edge density in the overlaps is higher! Communities as tiles

11 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 11 Communities in a network This is what we want!

12 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 12 1) Given a model, we generate the network: Generative model for networks A C B D E F G 2) Given a network, find the best model H A C B D E H F G Generative model for networks

13 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 13 Goal: Define a model that can generate networks The model will have a set of parameters that we will later want to estimate (and detect communities) Generative model for networks A C B D E F G Q: Given a set of nodes, how do communities generate edges of the network? H

14 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 14 Communities, C p A p B Model Memberships, M Nodes, V Model Network Generative model B(V, C, M, {p c }) for graphs: Nodes V, Communities C, Memberships M Each community c has a single probability p c Later we fit the model to networks to detect communities

15 Communities, C Memberships, M p A p B Model Nodes, V Community Affiliations AGM generates the links: For each For each pair of nodes in community AA, we connect them with prob. pp AA The overall edge probability is: P ( u, v) = 1 (1 ) p c c M u M v If uu, vv share no communities: PP uu, vv = εε Network MM uu set of communities node uu belongs to Think of this as an OR function: If at least 1 community says YES we create an edge 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 15

16 Network 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 16 Model

17 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 17 AGM can express a variety of community structures: Non-overlapping, Overlapping, Nested

18 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 18 Detecting communities with AGM: A C B D E F G H Given a Graph, find the Model 1) Affiliation graph M 2) Number of communities C 3) Parameters p c

19 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 19 Maximum Likelihood Principle (MLE): Given: Data XX Assumption: Data is generated by some model ff(θθ) ff model ΘΘ model parameters Want to estimate PP ff XX ΘΘ): The probability that our model ff (with parameters ΘΘ) generated the data Now let s find the most likely model that could have generated the data: arg max PP ff XX ΘΘ) Θ

20 Imagine we are given a set of coin flips Task: Figure the bias of a coin! Data: Sequence of coin flips: XX = [11, 00, 00, 00, 11, 00, 00, 11] Model: ff ΘΘ = return 1 with prob. Θ, else return 0 What is PP ff XX ΘΘ? Assuming coin flips are independent So, PP ff XX ΘΘ = PP ff 00 ΘΘ PP ff 11 ΘΘ PP ff 11 ΘΘ PP ff 00 ΘΘ What is PP ff 11 ΘΘ? Simple, PP ff 00 ΘΘ = ΘΘ Then, PP ff XX ΘΘ = ΘΘ ΘΘ 55 For example: PP ff XX ΘΘ = = PP ff XX ΘΘ = = What did we learn? Our data was most likely generated by coin with bias ΘΘ = 33/88 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 20 PP ff XX ΘΘ ΘΘ = 33/88 ΘΘ

21 How do we do MLE for graphs? Model generates a probabilistic adjacency matrix We then flip all the entries of the probabilistic matrix to obtain the binary adjacency matrix For every pair of nodes uu, vv AGM gives the prob. pp uuuu of them being linked Flip biased coins The likelihood of AGM generating graph G: P( G Θ) = Π ( u, v) G P( u, v) ( u, v) G (1 P( u, v)) 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 21 Π

22 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 22 Given graph G and Θ we calculate likelihood that Θ generated G: P(G Θ) G Θ=B(V, C, M, {p c }) P(G Θ) G P( G Θ) = Π ( u, v) G P( u, v) Π ( u, v) G (1 P( u, v))

23 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 23 Our goal: Find ΘΘ = BB(VV, CC, MM, pp CC ) such that: arg max Θ P( GG ) AGM Θ How do we find BB(VV, CC, MM, pp CC ) that maximizes the likelihood? No nice way, have to search over all affiliation graphs But, don t despair

24 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 24 Relaxation: Memberships have strengths FF uuuu u v FF uuuu : The membership strength of node uu to community AA (FF uuuu = 00: no membership) Each community AA links nodes independently: PP AA uu, vv = 11 eeeeee ( FF uuaa FF vvaa )

25 j Community membership strength matrix FF Nodes FF = Communities PP AA uu, vv = 11 eeeeee ( FF uuuu FF vvvv ) Probability of connection is proportional to the product of strengths Prob. that at least one common community CC links the nodes: PP uu, vv = 11 CC 11 PP CC uu, vv FF uuuu strength of uu s membership to AA FF vv vector of community membership strengths of uu 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 25

26 FF uu : FF vv : FF ww : Community AA links nodes uu, vv independently: PP AA uu, vv = 11 eeeeee ( FF uuaa FF vvaa ) Then prob. at least one common CC links them: PP uu, vv = 11 CC 11 PP CC uu, vv = 11 eeeeee ( CC FF uuuu FF vvvv ) = 11 eeeeee ( FF uu FF TT vv ) For example: Node community membership strengths 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Then: FF uu FF vv TT = And: PP uu, vv = 11 eeeeee = But: PP uu, ww = PP vv, ww = 00 26

27 Another way to think about this is: Think of a weighted graph (but we don t see the weights) uu, vv interact with strength XX AA uuuu ~PPPPPPPP(FF uuaa FF vvaa ) Total interaction strength is the sum: XX uuuu = CC CC XX uuuu Then: XX uuuu ~PPPPPPPP( CC FF uuuu FF vvvv ) We observe an edge if there is non-zero interaction: PP XX uuuu > 0 = 1 PP XX uuuu = 0 = = 11 eeeeee ( FF uu FF TT vv ) 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Poisson PMF: PP(XX = xx) = λλkk kk! ee λλ 27

28 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets [wsdm 13] 28 Task: Given a network, estimate FF Find FF that maximizes the log-likelihood Many times we take the logarithm of the likelihood and call it log-likelihood: ll FF = log PP(GG FF) Objective function is biconvex Non-negative matrix factorization: Update FF uuuu for node uu while fixing the memberships of all other nodes

29 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 29 Task: Given a network, estimate FF Find FF that maximizes the log-likelihood Many times we take the logarithm of the likelihood and call it log-likelihood: ll FF = llllll PP(GG FF) Connection: Non-negative matrix factorization We want to factor the adjacency matrix G 11 eeeeee (. ) = F F T Matrix of edge probs. Graph G

30 Non-negative matrix factorization: Use a gradient based approach! But updating the whole FF takes too long Computing the gradient takes quadratic time! Our network are far too big to allow for that: Network with 1M nodes, can have different edges Instead, update FF in small units : Update FF uuuu for node uu while fixing the memberships of all other nodes 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 30

31 This is slow! Computing FF uu takes linear time! 31 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets Compute gradient of a single row FF uu of FF: Coordinate gradient ascent: Iterate over the rows of FF: Compute gradient FF uu of row uu (while keeping others fixed) Update the row FF uu : FF uu FF uu + ηη ll(ff uu ) Project FF uu back to a non-negative vector: If FF uuuu < 00: FF uuuu = 00

32 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 32 However, we notice: We can cache vv FF vv So, computing vv NN(uu) FF vv now takes linear time in the degree NN(uu) of uu In networks degree of a node is much smaller to the total number of nodes in the network, so this is a significant speedup!

33 BigCLAM takes 5 minutes for 300k node nets Other methods take 10 days Can process networks with 100M edges! 33

34 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 34

35 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 35 Extension: Make community membership edges directed: Outgoing membership: Nodes sends edges Incoming membership: Node receives edges

36 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 36

37 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 37 Everything is almost the same except now we have 2 matrices: FF and HH FF out-going community memberships HH in-coming community memberships FF HH Edge prob.: PP uu, vv = 11 eeeeee( FF uu HH vv TT ) Network log-likelihood: which we optimize the same way as before

38 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 38

39 2/13/2014 Jure Leskovec, Stanford C246: Mining Massive Datasets 39 Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach by J. Yang, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), Detecting Cohesive and 2-mode Communities in Directed and Undirected Networks by J. Yang, J. McAuley, J. Leskovec. ACM International Conference on Web Search and Data Mining (WSDM), Community Detection in Networks with Node Attributes by J. Yang, J. McAuley, J. Leskovec. IEEE International Conference On Data Mining (ICDM), 2013.

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Online Social Networks and Media. Community detection

Online Social Networks and Media. Community detection Online Social Networks and Media Community detection 1 Notes on Homework 1 1. You should write your own code for generating the graphs. You may use SNAP graph primitives (e.g., add node/edge) 2. For the

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu SPAM FARMING 2/11/2013 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 2/11/2013 Jure Leskovec, Stanford

More information

CS224W: Analysis of Networks Jure Leskovec, Stanford University

CS224W: Analysis of Networks Jure Leskovec, Stanford University CS224W: Analysis of Networks Jure Leskovec, Stanford University http://cs224w.stanford.edu 11/13/17 Jure Leskovec, Stanford CS224W: Analysis of Networks, http://cs224w.stanford.edu 2 Observations Models

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu Setting from the last class: AB-A : gets a AB-B : gets b AB-AB : gets max(a, b) Also: Cost

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data

More information

Gaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm

Gaussian Mixture Models For Clustering Data. Soft Clustering and the EM Algorithm Gaussian Mixture Models For Clustering Data Soft Clustering and the EM Algorithm K-Means Clustering Input: Observations: xx ii R dd ii {1,., NN} Number of Clusters: kk Output: Cluster Assignments. Cluster

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu Web advertising We discussed how to match advertisers to queries in real-time But we did not discuss how to estimate

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS46: Mining Massive Datasets Jure Leskovec, Stanford University http://cs46.stanford.edu /7/ Jure Leskovec, Stanford C46: Mining Massive Datasets Many real-world problems Web Search and Text Mining Billions

More information

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection

CSE 255 Lecture 6. Data Mining and Predictive Analytics. Community Detection CSE 255 Lecture 6 Data Mining and Predictive Analytics Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:

More information

Graphs (Part II) Shannon Quinn

Graphs (Part II) Shannon Quinn Graphs (Part II) Shannon Quinn (with thanks to William Cohen and Aapo Kyrola of CMU, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University) Parallel Graph Computation Distributed computation

More information

CS6716 Pattern Recognition

CS6716 Pattern Recognition CS6716 Pattern Recognition Prototype Methods Aaron Bobick School of Interactive Computing Administrivia Problem 2b was extended to March 25. Done? PS3 will be out this real soon (tonight) due April 10.

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/25/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 In many data mining

More information

Mining Social Network Graphs

Mining Social Network Graphs Mining Social Network Graphs Analysis of Large Graphs: Community Detection Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com Note to other teachers and users of these slides: We would be

More information

CSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection

CSE 258 Lecture 6. Web Mining and Recommender Systems. Community Detection CSE 258 Lecture 6 Web Mining and Recommender Systems Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:

More information

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection

CSE 158 Lecture 6. Web Mining and Recommender Systems. Community Detection CSE 158 Lecture 6 Web Mining and Recommender Systems Community Detection Dimensionality reduction Goal: take high-dimensional data, and describe it compactly using a small number of dimensions Assumption:

More information

Multi-view object segmentation in space and time. Abdelaziz Djelouah, Jean Sebastien Franco, Edmond Boyer

Multi-view object segmentation in space and time. Abdelaziz Djelouah, Jean Sebastien Franco, Edmond Boyer Multi-view object segmentation in space and time Abdelaziz Djelouah, Jean Sebastien Franco, Edmond Boyer Outline Addressed problem Method Results and Conclusion Outline Addressed problem Method Results

More information

1.2 Round-off Errors and Computer Arithmetic

1.2 Round-off Errors and Computer Arithmetic 1.2 Round-off Errors and Computer Arithmetic 1 In a computer model, a memory storage unit word is used to store a number. A word has only a finite number of bits. These facts imply: 1. Only a small set

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu /2/8 Jure Leskovec, Stanford CS246: Mining Massive Datasets 2 Task: Given a large number (N in the millions or

More information

The ABC s of Web Site Evaluation

The ABC s of Web Site Evaluation Aa Bb Cc Dd Ee Ff Gg Hh Ii Jj Kk Ll Mm Nn Oo Pp Qq Rr Ss Tt Uu Vv Ww Xx Yy Zz The ABC s of Web Site Evaluation by Kathy Schrock Digital Literacy by Paul Gilster Digital literacy is the ability to understand

More information

Mixture Models and the EM Algorithm

Mixture Models and the EM Algorithm Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Finite Mixture Models Say we have a data set D = {x 1,..., x N } where x i is

More information

Large Scale Data Analysis Using Deep Learning

Large Scale Data Analysis Using Deep Learning Large Scale Data Analysis Using Deep Learning Machine Learning Basics - 1 U Kang Seoul National University U Kang 1 In This Lecture Overview of Machine Learning Capacity, overfitting, and underfitting

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University We need your help with our research on human interpretable machine learning. Please complete a survey at http://stanford.io/1wpokco. It should be fun and take about 1min to complete. Thanks a lot for your

More information

Introduction to Machine Learning CMU-10701

Introduction to Machine Learning CMU-10701 Introduction to Machine Learning CMU-10701 Clustering and EM Barnabás Póczos & Aarti Singh Contents Clustering K-means Mixture of Gaussians Expectation Maximization Variational Methods 2 Clustering 3 K-

More information

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it.

Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it. Sequence Alignments Overview Sequence alignment is an essential concept for bioinformatics, as most of our data analysis and interpretation techniques make use of it. Sequence alignment means arranging

More information

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient. Ali Mirzapour Paper Presentation - Deep Learning March 7 th

Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient. Ali Mirzapour Paper Presentation - Deep Learning March 7 th Training Restricted Boltzmann Machines using Approximations to the Likelihood Gradient Ali Mirzapour Paper Presentation - Deep Learning March 7 th 1 Outline of the Presentation Restricted Boltzmann Machine

More information

Classify My Social Contacts into Circles Stanford University CS224W Fall 2014

Classify My Social Contacts into Circles Stanford University CS224W Fall 2014 Classify My Social Contacts into Circles Stanford University CS224W Fall 2014 Amer Hammudi (SUNet ID: ahammudi) ahammudi@stanford.edu Darren Koh (SUNet: dtkoh) dtkoh@stanford.edu Jia Li (SUNet: jli14)

More information

T10/05-233r2 SAT - ATA Errors to SCSI Errors - Translation Map

T10/05-233r2 SAT - ATA Errors to SCSI Errors - Translation Map To: T10 Technical Committee From: Wayne Bellamy (wayne.bellamy@hp.com), Hewlett Packard Date: June 30, 2005 Subject: T10/05-233r2 SAT - s to Errors - Translation Map Revision History Revision 0 (June 08,

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/12/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 3/12/2014 Jure

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec Stanford University Jure Leskovec, Stanford University http://cs224w.stanford.edu [Morris 2000] Based on 2 player coordination game 2 players

More information

Wisconsin Retirement Testing Preparation

Wisconsin Retirement Testing Preparation Wisconsin Retirement Testing Preparation The Wisconsin Retirement System (WRS) is changing its reporting requirements from annual to every pay period starting January 1, 2018. With that, there are many

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu [Kumar et al. 99] 2/13/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu 10/10/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu

More information

Visit MathNation.com or search "Math Nation" in your phone or tablet's app store to watch the videos that go along with this workbook!

Visit MathNation.com or search Math Nation in your phone or tablet's app store to watch the videos that go along with this workbook! Topic 1: Introduction to Angles - Part 1... 47 Topic 2: Introduction to Angles Part 2... 50 Topic 3: Angle Pairs Part 1... 53 Topic 4: Angle Pairs Part 2... 56 Topic 5: Special Types of Angle Pairs Formed

More information

CS Introduction to Data Mining Instructor: Abdullah Mueen

CS Introduction to Data Mining Instructor: Abdullah Mueen CS 591.03 Introduction to Data Mining Instructor: Abdullah Mueen LECTURE 8: ADVANCED CLUSTERING (FUZZY AND CO -CLUSTERING) Review: Basic Cluster Analysis Methods (Chap. 10) Cluster Analysis: Basic Concepts

More information

CS249: ADVANCED DATA MINING

CS249: ADVANCED DATA MINING CS249: ADVANCED DATA MINING Classification Evaluation and Practical Issues Instructor: Yizhou Sun yzsun@cs.ucla.edu April 24, 2017 Homework 2 out Announcements Due May 3 rd (11:59pm) Course project proposal

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS6: Mining Massive Datasets Jure Leskovec, Stanford University http://cs6.stanford.edu Training data 00 million ratings, 80,000 users, 7,770 movies 6 years of data: 000 00 Test data Last few ratings of

More information

node2vec: Scalable Feature Learning for Networks

node2vec: Scalable Feature Learning for Networks node2vec: Scalable Feature Learning for Networks A paper by Aditya Grover and Jure Leskovec, presented at Knowledge Discovery and Data Mining 16. 11/27/2018 Presented by: Dharvi Verma CS 848: Graph Database

More information

Visual Identity Guidelines. Abbreviated for Constituent Leagues

Visual Identity Guidelines. Abbreviated for Constituent Leagues Visual Identity Guidelines Abbreviated for Constituent Leagues 1 Constituent League Logo The logo is available in a horizontal and vertical format. Either can be used depending on the best fit for a particular

More information

K-Means and Gaussian Mixture Models

K-Means and Gaussian Mixture Models K-Means and Gaussian Mixture Models David Rosenberg New York University June 15, 2015 David Rosenberg (New York University) DS-GA 1003 June 15, 2015 1 / 43 K-Means Clustering Example: Old Faithful Geyser

More information

The vertex set is a finite nonempty set. The edge set may be empty, but otherwise its elements are two-element subsets of the vertex set.

The vertex set is a finite nonempty set. The edge set may be empty, but otherwise its elements are two-element subsets of the vertex set. Math 3336 Section 10.2 Graph terminology and Special Types of Graphs Definition: A graph is an object consisting of two sets called its vertex set and its edge set. The vertex set is a finite nonempty

More information

Clustering Lecture 5: Mixture Model

Clustering Lecture 5: Mixture Model Clustering Lecture 5: Mixture Model Jing Gao SUNY Buffalo 1 Outline Basics Motivation, definition, evaluation Methods Partitional Hierarchical Density-based Mixture model Spectral methods Advanced topics

More information

Learning Undirected Models with Missing Data

Learning Undirected Models with Missing Data Learning Undirected Models with Missing Data Sargur Srihari srihari@cedar.buffalo.edu 1 Topics Log-linear form of Markov Network The missing data parameter estimation problem Methods for missing data:

More information

Non Overlapping Communities

Non Overlapping Communities Non Overlapping Communities Davide Mottin, Konstantina Lazaridou HassoPlattner Institute Graph Mining course Winter Semester 2016 Acknowledgements Most of this lecture is taken from: http://web.stanford.edu/class/cs224w/slides

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS6: Mining Massive Datasets Jure Leskovec, Stanford University http://cs6.stanford.edu /6/01 Jure Leskovec, Stanford C6: Mining Massive Datasets Training data 100 million ratings, 80,000 users, 17,770

More information

How to organize the Web?

How to organize the Web? How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval attempts to find relevant docs in a small and trusted set Newspaper

More information

BRAND STANDARD GUIDELINES 2014

BRAND STANDARD GUIDELINES 2014 BRAND STANDARD GUIDELINES 2014 LOGO USAGE & TYPEFACES Logo Usage The Lackawanna College School of Petroleum & Natural Gas logo utilizes typography, two simple rule lines and the Lackawanna College graphic

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Math 96--Radicals #1-- Simplify; Combine--page 1

Math 96--Radicals #1-- Simplify; Combine--page 1 Simplify; Combine--page 1 Part A Number Systems a. Whole Numbers = {0, 1, 2, 3,...} b. Integers = whole numbers and their opposites = {..., 3, 2, 1, 0, 1, 2, 3,...} c. Rational Numbers = quotient of integers

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

Lesson 12: Angles Associated with Parallel Lines

Lesson 12: Angles Associated with Parallel Lines Lesson 12 Lesson 12: Angles Associated with Parallel Lines Classwork Exploratory Challenge 1 In the figure below, LL 1 is not parallel to LL 2, and mm is a transversal. Use a protractor to measure angles

More information

CS281 Section 9: Graph Models and Practical MCMC

CS281 Section 9: Graph Models and Practical MCMC CS281 Section 9: Graph Models and Practical MCMC Scott Linderman November 11, 213 Now that we have a few MCMC inference algorithms in our toolbox, let s try them out on some random graph models. Graphs

More information

Logistic Regression and Gradient Ascent

Logistic Regression and Gradient Ascent Logistic Regression and Gradient Ascent CS 349-02 (Machine Learning) April 0, 207 The perceptron algorithm has a couple of issues: () the predictions have no probabilistic interpretation or confidence

More information

Learning to Discover Social Circles in Ego Networks

Learning to Discover Social Circles in Ego Networks Learning to Discover Social Circles in Ego Networks Julian McAuley Stanford jmcauley@cs.stanford.edu Jure Leskovec Stanford jure@cs.stanford.edu Abstract Our personal social networks are big and cluttered,

More information

Social-Network Graphs

Social-Network Graphs Social-Network Graphs Mining Social Networks Facebook, Google+, Twitter Email Networks, Collaboration Networks Identify communities Similar to clustering Communities usually overlap Identify similarities

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

BRANDING AND STYLE GUIDELINES

BRANDING AND STYLE GUIDELINES BRANDING AND STYLE GUIDELINES INTRODUCTION The Dodd family brand is designed for clarity of communication and consistency within departments. Bold colors and photographs are set on simple and clean backdrops

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

recruitment Logo Typography Colourways Mechanism Usage Pip Recruitment Brand Toolkit

recruitment Logo Typography Colourways Mechanism Usage Pip Recruitment Brand Toolkit Logo Typography Colourways Mechanism Usage Primary; Secondary; Silhouette; Favicon; Additional Notes; Where possible, use the logo with the striped mechanism behind. Only when it is required to be stripped

More information

Overlapping Communities

Overlapping Communities Yangyang Hou, Mu Wang, Yongyang Yu Purdue Univiersity Department of Computer Science April 25, 2013 Overview Datasets Algorithm I Algorithm II Algorithm III Evaluation Overview Graph models of many real

More information

Introduction to Deep Learning

Introduction to Deep Learning ENEE698A : Machine Learning Seminar Introduction to Deep Learning Raviteja Vemulapalli Image credit: [LeCun 1998] Resources Unsupervised feature learning and deep learning (UFLDL) tutorial (http://ufldl.stanford.edu/wiki/index.php/ufldl_tutorial)

More information

Link Analysis. Hongning Wang

Link Analysis. Hongning Wang Link Analysis Hongning Wang CS@UVa Structured v.s. unstructured data Our claim before IR v.s. DB = unstructured data v.s. structured data As a result, we have assumed Document = a sequence of words Query

More information

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu Natural Language Processing CS 6320 Lecture 6 Neural Language Models Instructor: Sanda Harabagiu In this lecture We shall cover: Deep Neural Models for Natural Language Processing Introduce Feed Forward

More information

Information Networks: PageRank

Information Networks: PageRank Information Networks: PageRank Web Science (VU) (706.716) Elisabeth Lex ISDS, TU Graz June 18, 2018 Elisabeth Lex (ISDS, TU Graz) Links June 18, 2018 1 / 38 Repetition Information Networks Shape of the

More information

Expectation Maximization: Inferring model parameters and class labels

Expectation Maximization: Inferring model parameters and class labels Expectation Maximization: Inferring model parameters and class labels Emily Fox University of Washington February 27, 2017 Mixture of Gaussian recap 1 2/27/2017 Jumble of unlabeled images HISTOGRAM blue

More information

Graph Mining: Overview of different graph models

Graph Mining: Overview of different graph models Graph Mining: Overview of different graph models Davide Mottin, Konstantina Lazaridou Hasso Plattner Institute Graph Mining course Winter Semester 2016 Lecture road Anomaly detection (previous lecture)

More information

Similar Polygons Date: Per:

Similar Polygons Date: Per: Math 2 Unit 6 Worksheet 1 Name: Similar Polygons Date: Per: [1-2] List the pairs of congruent angles and the extended proportion that relates the corresponding sides for the similar polygons. 1. AA BB

More information

Brand Guidelines October, 2014

Brand Guidelines October, 2014 Brand Guidelines October, 2014 Contents 1 Logo 2 Graphical Elements 3 Icons 4 Typography 5 Colors 6 Stationery 7 Social Media 8 Templates 9 Product Line logos Brand Guidelines Page 2 1) Logo Logo: Overview

More information

Section 6: Triangles Part 1

Section 6: Triangles Part 1 Section 6: Triangles Part 1 Topic 1: Introduction to Triangles Part 1... 125 Topic 2: Introduction to Triangles Part 2... 127 Topic 3: rea and Perimeter in the Coordinate Plane Part 1... 130 Topic 4: rea

More information

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others Things We d Like to Do Spam Classification Given an email, predict

More information

Lecture 1.3 Basic projective geometry. Thomas Opsahl

Lecture 1.3 Basic projective geometry. Thomas Opsahl Lecture 1.3 Basic projective geometr Thomas Opsahl Motivation For the pinhole camera, the correspondence between observed 3D points in the world and D points in the captured image is given b straight lines

More information

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li

DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Welcome to DS504/CS586: Big Data Analytics Graph Mining Prof. Yanhua Li Time: 6:00pm 8:50pm R Location: AK232 Fall 2016 Graph Data: Social Networks Facebook social graph 4-degrees of separation [Backstrom-Boldi-Rosa-Ugander-Vigna,

More information

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas

CS839: Probabilistic Graphical Models. Lecture 10: Learning with Partially Observed Data. Theo Rekatsinas CS839: Probabilistic Graphical Models Lecture 10: Learning with Partially Observed Data Theo Rekatsinas 1 Partially Observed GMs Speech recognition 2 Partially Observed GMs Evolution 3 Partially Observed

More information

PracticeAdmin Identity Guide. Last Updated 4/27/2015 Created by Vanessa Street

PracticeAdmin Identity Guide. Last Updated 4/27/2015 Created by Vanessa Street PracticeAdmin Identity Guide Last Updated 4/27/2015 Created by Vanessa Street About PracticeAdmin Mission At PracticeAdmin, we simplify the complex process of medical billing by providing healthcare professionals

More information

Segmentation: Clustering, Graph Cut and EM

Segmentation: Clustering, Graph Cut and EM Segmentation: Clustering, Graph Cut and EM Ying Wu Electrical Engineering and Computer Science Northwestern University, Evanston, IL 60208 yingwu@northwestern.edu http://www.eecs.northwestern.edu/~yingwu

More information

Clustering: Classic Methods and Modern Views

Clustering: Classic Methods and Modern Views Clustering: Classic Methods and Modern Views Marina Meilă University of Washington mmp@stat.washington.edu June 22, 2015 Lorentz Center Workshop on Clusters, Games and Axioms Outline Paradigms for clustering

More information

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University

Expectation Maximization. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University Expectation Maximization Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University April 10 th, 2006 1 Announcements Reminder: Project milestone due Wednesday beginning of class 2 Coordinate

More information

CS3600 SYSTEMS AND NETWORKS

CS3600 SYSTEMS AND NETWORKS CS3600 SYSTEMS AND NETWORKS NORTHEASTERN UNIVERSITY Lecture 15: Networking overview Prof. (amislove@ccs.neu.edu) What is a network? 2 What is a network? Wikipedia: A telecommunications network is a network

More information

Note Set 4: Finite Mixture Models and the EM Algorithm

Note Set 4: Finite Mixture Models and the EM Algorithm Note Set 4: Finite Mixture Models and the EM Algorithm Padhraic Smyth, Department of Computer Science University of California, Irvine Finite Mixture Models A finite mixture model with K components, for

More information

Databases 2 (VU) ( )

Databases 2 (VU) ( ) Databases 2 (VU) (707.030) Map-Reduce Denis Helic KMI, TU Graz Nov 4, 2013 Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 1 / 90 Outline 1 Motivation 2 Large Scale Computation 3 Map-Reduce 4 Environment

More information

How to Register for Summer Camp. A Tutorial

How to Register for Summer Camp. A Tutorial How to Register for Summer Camp A Tutorial 1. Upon arriving at our website (https://flightcamp.ou.edu/), the very first step is logging in. Please click the Login link in the top left corner of the page

More information

Random Graph Model; parameterization 2

Random Graph Model; parameterization 2 Agenda Random Graphs Recap giant component and small world statistics problems: degree distribution and triangles Recall that a graph G = (V, E) consists of a set of vertices V and a set of edges E V V.

More information

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors

10/14/2017. Dejan Sarka. Anomaly Detection. Sponsors Dejan Sarka Anomaly Detection Sponsors About me SQL Server MVP (17 years) and MCT (20 years) 25 years working with SQL Server Authoring 16 th book Authoring many courses, articles Agenda Introduction Simple

More information

Multicast Tree Aggregation in Large Domains

Multicast Tree Aggregation in Large Domains Multicast Tree Aggregation in Large Domains Joanna Moulierac 1 Alexandre Guitton 2 and Miklós Molnár 1 IRISA/University of Rennes I 502 Rennes France 2 Birkbeck College University of London England IRISA/INSA

More information

Media Kit & Brand Guidelines

Media Kit & Brand Guidelines Media Kit & Brand Guidelines Generation XYZ Generation X 1960-1980 Generation Y 1977-2004 Generation Z Early 2001 - Present Named after Douglas Coupland s novel Generation X: Tales from an Accelerated

More information

COMPUTER GRAPHICS COURSE. LuxRender. Light Transport Foundations

COMPUTER GRAPHICS COURSE. LuxRender. Light Transport Foundations COMPUTER GRAPHICS COURSE LuxRender Light Transport Foundations Georgios Papaioannou - 2015 Light Transport Light is emitted at the light sources and scattered around a 3D environment in a practically infinite

More information

Eureka Math. Grade 7, Module 6. Student File_A. Contains copy-ready classwork and homework

Eureka Math. Grade 7, Module 6. Student File_A. Contains copy-ready classwork and homework A Story of Ratios Eureka Math Grade 7, Module 6 Student File_A Contains copy-ready classwork and homework Published by the non-profit Great Minds. Copyright 2015 Great Minds. No part of this work may be

More information

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm DATA MINING LECTURE 7 Hierarchical Clustering, DBSCAN The EM Algorithm CLUSTERING What is a Clustering? In general a grouping of objects such that the objects in a group (cluster) are similar (or related)

More information

Parameter Estimation. Learning From Data: MLE. Parameter Estimation. Likelihood. Maximum Likelihood Parameter Estimation. Likelihood Function 12/1/16

Parameter Estimation. Learning From Data: MLE. Parameter Estimation. Likelihood. Maximum Likelihood Parameter Estimation. Likelihood Function 12/1/16 Learning From Data: MLE Maximum Estimators Common approach in statistics: use a parametric model of data: Assume data set: Bin(n, p), Poisson( ), N(µ, exp( ) Uniform(a, b) 2 ) But parameters are unknown!!!

More information

IBL and clustering. Relationship of IBL with CBR

IBL and clustering. Relationship of IBL with CBR IBL and clustering Distance based methods IBL and knn Clustering Distance based and hierarchical Probability-based Expectation Maximization (EM) Relationship of IBL with CBR + uses previously processed

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Logis&c Regression. Aar$ Singh & Barnabas Poczos. Machine Learning / Jan 28, 2014

Logis&c Regression. Aar$ Singh & Barnabas Poczos. Machine Learning / Jan 28, 2014 Logis&c Regression Aar$ Singh & Barnabas Poczos Machine Learning 10-701/15-781 Jan 28, 2014 Linear Regression & Linear Classifica&on Weight Height Linear fit Linear decision boundary 2 Naïve Bayes Recap

More information

Consistency Theory in Machine Learning

Consistency Theory in Machine Learning Consistency Theory in Machine Learning Wei Gao ( 高尉 ) Learning And Mining from DatA (LAMDA) National Key Laboratory for Novel Software Technology Nanjing University Machine learning Training data learn

More information

A Joint Congestion Control, Routing, and Scheduling Algorithm in Multihop Wireless Networks with Heterogeneous Flows

A Joint Congestion Control, Routing, and Scheduling Algorithm in Multihop Wireless Networks with Heterogeneous Flows 1 A Joint Congestion Control, Routing, and Scheduling Algorithm in Multihop Wireless Networks with Heterogeneous Flows Phuong Luu Vo, Nguyen H. Tran, Choong Seon Hong, *KiJoon Chae Kyung H University,

More information

Tranont Mission Statement. Tranont Vision Statement. Change the world s economy, one household at a time.

Tranont Mission Statement. Tranont Vision Statement. Change the world s economy, one household at a time. STYLE GUIDE Tranont Mission Statement Change the world s economy, one household at a time. Tranont Vision Statement We offer individuals world class financial education and training, financial management

More information

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li

DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Welcome to DS504/CS586: Big Data Analytics Big Data Clustering Prof. Yanhua Li Time: 6:00pm 8:50pm Thu Location: AK 232 Fall 2016 High Dimensional Data v Given a cloud of data points we want to understand

More information

Lecture 11: E-M and MeanShift. CAP 5415 Fall 2007

Lecture 11: E-M and MeanShift. CAP 5415 Fall 2007 Lecture 11: E-M and MeanShift CAP 5415 Fall 2007 Review on Segmentation by Clustering Each Pixel Data Vector Example (From Comanciu and Meer) Review of k-means Let's find three clusters in this data These

More information