CAIM: Cerca i Anàlisi d Informació Massiva

Similar documents
Social Data Management Communities

Models of Network Formation. Networked Life NETS 112 Fall 2017 Prof. Michael Kearns

Summary: What We Have Learned So Far

Lesson 4. Random graphs. Sergio Barbarossa. UPC - Barcelona - July 2008

Signal Processing for Big Data

Erdős-Rényi Model for network formation

Introduction to network metrics

CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

Machine Learning and Modeling for Social Networks

Basics of Network Analysis

Graph-theoretic Properties of Networks

Community detection. Leonid E. Zhukov


6.207/14.15: Networks Lecture 5: Generalized Random Graphs and Small-World Model

ECE 158A - Data Networks

CSCI5070 Advanced Topics in Social Computing

Non Overlapping Communities

Mining Social Network Graphs

arxiv:cond-mat/ v5 [cond-mat.dis-nn] 16 Aug 2006

Wednesday, March 8, Complex Networks. Presenter: Jirakhom Ruttanavakul. CS 790R, University of Nevada, Reno

Networks in economics and finance. Lecture 1 - Measuring networks

Community Detection. Community

TELCOM2125: Network Science and Analysis

CS-E5740. Complex Networks. Scale-free networks

Web Structure Mining Community Detection and Evaluation

Properties of Biological Networks

Extracting Information from Complex Networks

CS224W: Analysis of Networks Jure Leskovec, Stanford University

Example for calculation of clustering coefficient Node N 1 has 8 neighbors (red arrows) There are 12 connectivities among neighbors (blue arrows)

Social-Network Graphs

1 Homophily and assortative mixing

Advanced Algorithms and Models for Computational Biology -- a machine learning approach

TELCOM2125: Network Science and Analysis

Online Social Networks and Media. Community detection

(Social) Networks Analysis III. Prof. Dr. Daning Hu Department of Informatics University of Zurich

How Do Real Networks Look? Networked Life NETS 112 Fall 2014 Prof. Michael Kearns

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

1 More configuration model

An introduction to the physics of complex networks

Social Network Analysis With igraph & R. Ofrit Lesser December 11 th, 2014

Graph Theory for Network Science

CS 322: (Social and Information) Network Analysis Jure Leskovec Stanford University

arxiv:cond-mat/ v3 [cond-mat.dis-nn] 30 Jun 2005

Exercise set #2 (29 pts)

E6885 Network Science Lecture 5: Network Estimation and Modeling

Community Detection in Social Networks

Constructing a G(N, p) Network

The Mathematical Description of Networks

Expected Nodes: a quality function for the detection of link communities

Constructing a G(N, p) Network

Lecture Note: Computation problems in social. network analysis

V 2 Clusters, Dijkstra, and Graph Layout

Networks and Discrete Mathematics

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

Clusters and Communities

My favorite application using eigenvalues: partitioning and community detection in social networks

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

M.E.J. Newman: Models of the Small World

Graph Theory. Graph Theory. COURSE: Introduction to Biological Networks. Euler s Solution LECTURE 1: INTRODUCTION TO NETWORKS.

Nick Hamilton Institute for Molecular Bioscience. Essential Graph Theory for Biologists. Image: Matt Moores, The Visible Cell

A Generating Function Approach to Analyze Random Graphs

1 Degree Distributions

Critical Phenomena in Complex Networks

Small-World Models and Network Growth Models. Anastassia Semjonova Roman Tekhov

V 1 Introduction! Mon, Oct 15, 2012! Bioinformatics 3 Volkhard Helms!

V2: Measures and Metrics (II)

Hierarchical Graph Clustering: Quality Metrics & Algorithms

TELCOM2125: Network Science and Analysis

6. Overview. L3S Research Center, University of Hannover. 6.1 Section Motivation. Investigation of structural aspects of peer-to-peer networks

Oh Pott, Oh Pott! or how to detect community structure in complex networks

Complex networks Phys 682 / CIS 629: Computational Methods for Nonlinear Systems

Social Networks. Slides by : I. Koutsopoulos (AUEB), Source:L. Adamic, SN Analysis, Coursera course

Modularity CMSC 858L

Supplementary material to Epidemic spreading on complex networks with community structures

Algorithmic and Economic Aspects of Networks. Nicole Immorlica

Clustering CS 550: Machine Learning

CSE 158 Lecture 11. Web Mining and Recommender Systems. Triadic closure; strong & weak ties

Heuristics for the Critical Node Detection Problem in Large Complex Networks

Chapter 11. Network Community Detection

1 Random Graph Models for Networks

Overlay (and P2P) Networks

UNIVERSITA DEGLI STUDI DI CATANIA FACOLTA DI INGEGNERIA

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Characteristics of Preferentially Attached Network Grown from. Small World

Graph Theory. Network Science: Graph theory. Graph theory Terminology and notation. Graph theory Graph visualization

Introduction to Engineering Systems, ESD.00. Networks. Lecturers: Professor Joseph Sussman Dr. Afreen Siddiqi TA: Regina Clewlow

Master s Thesis. Title. Supervisor Professor Masayuki Murata. Author Yinan Liu. February 12th, 2016

MODELS FOR EVOLUTION AND JOINING OF SMALL WORLD NETWORKS

Case Studies in Complex Networks

Community Structure and Beyond

Community Structure Detection. Amar Chandole Ameya Kabre Atishay Aggarwal

Topology Enhancement in Wireless Multihop Networks: A Top-down Approach

- relationships (edges) among entities (nodes) - technology: Internet, World Wide Web - biology: genomics, gene expression, proteinprotein

beyond social networks

V 2 Clusters, Dijkstra, and Graph Layout"

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Cluster Analysis. Angela Montanari and Laura Anderlucci

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Social, Information, and Routing Networks: Models, Algorithms, and Strategic Behavior

Transcription:

1 / 72 CAIM: Cerca i Anàlisi d Informació Massiva FIB, Grau en Enginyeria Informàtica Slides by Marta Arias, José Balcázar, Ricard Gavaldá Department of Computer Science, UPC Fall 2016 http://www.cs.upc.edu/~caim

7. Introduction to Network Analysis

3 / 72 Network Analysis, Part I Today s contents 1. Examples of real networks 2. What do real networks look like? real networks exhibit small diameter.. and so does the Erdös-Rényi or random model real networks have high clustering coefficient.. and so does the Watts-Strogatz model real networks degree distribution follows a power-law.. and so does the Barabasi-Albert or preferential attachment model

4 / 72 Examples of real networks Social networks Information networks Technological networks Biological networks

5 / 72 Social networks Links denote social interactions friendship, collaborations, e-mail, etc.

6 / 72 Information networks Nodes store information, links associate information citation networks, the web, p2p networks, etc.

7 / 72 Technological networks Man-built for the distribution of a commodity telephone networks, power grids, transportation networks, etc.

8 / 72 Biological networks Represent biological systems protein-protein interaction networks, gene regulation networks, metabolic pathways, etc.

9 / 72 Representing networks Network Graph Networks are just collections of points joined by lines points lines vertices edges, arcs math nodes links computer science sites bonds physics actors ties, relations sociology

10 / 72 Types of networks From [Newman, 2003] (a) unweighted, undirected (b) discrete vertex and edge types, undirected (c) varying vertex and edge weights, undirected (d) directed

11 / 72 Small-world phenomenon A friend of a friend is also frequently a friend Only 6 hops separate any two people in the world

12 / 72 Measuring the small-world phenomenon, I Let d ij be the shortest-path distance between nodes i and j To check whether any two nodes are within 6 hops, we use: The diameter (longest shortest-path distance) as d = máx d ij i,j The average shortest-path length as l = 2 n (n + 1) i>j d ij The harmonic mean shortest-path length as l 1 = 2 n (n + 1) i>j d 1 ij

From [Newman, 2003] 13 / 72

14 / 72 But.. Can we mimic this phenomenon in simulated networks ( models )? The answer is YES!

15 / 72 The (basic) random graph model a.k.a. ER model Basic G n,p Erdös-Rényi random graph model: parameter n is the number of vertices parameter p is s.t. 0 p 1 Generate and edge (i, j) independently at random with probability p

16 / 72 Measuring the diameter in ER networks Want to show that the diameter in ER networks is small Let the average degree be z At distance l, can reach z l nodes At distance log n log z, reach all n nodes So, diameter is (roughly) O(log n)

ER networks have small diameter As shown by the following simulation 17 / 72

18 / 72 Measuring the small-world phenomenon, II To check whether the friend of a friend is also frequently a friend, we use: The transitivity or clustering coefficient, which basically measures the probability that two of my friends are also friends

19 / 72 Global clustering coefficient C = 3 number of triangles number of connected triples C = 3 1 8 = 0.375

20 / 72 Local clustering coefficient For each vertex i, let n i be the number of neighbors of i Let C i be the fraction of pairs of neighbors that are connected within each other C i = nr. of connections between i s neighbors 1 2 n i (n i 1) Finally, average C i over all nodes i in the network C = 1 n i C i

21 / 72 Local clustering coefficient example C 1 = C 2 = 1/1 C 3 = 1/6 C 4 = C 5 = 0 C = 1 5 (1 + 1 + 1/6) = 13/30 = 0.433

From [Newman, 2003] 22 / 72

23 / 72 ER networks do not show transitivity C = p, since edges are added independently Given a graph with n nodes and e edges, we can estimate p as e ˆp = 1/2 n (n 1) We say that clustering is high if C ˆp Hence, ER networks do not have high clustering coefficient since for them C ˆp

ER networks do not show transitivity 24 / 72

25 / 72 So ER networks do not have high clustering, but.. Can we mimic this phenomenon in simulated networks ( models ), while keeping the diameter small? The answer is YES!

26 / 72 The Watts-Strogatz model, I From [Watts and Strogatz, 1998] Reconciling two observations from real networks: High clustering: my friend s friends are also my friends small diameter

27 / 72 The Watts-Strogatz model, II Start with all n vertices arranged on a ring Each vertex has intially 4 connections to their closest nodes mimics local or geographical connectivity With probability p, rewire each local connection to a random vertex p = 0 high clustering, high diameter p = 1 low clustering, low diameter (ER model) What happens in between? As we increase p from 0 to 1 Fast decrease of mean distance Slow decrease in clustering

28 / 72 The Watts-Strogatz model, III For an appropriate value of p 0.01 (1 %), we observe that the model achieves high clustering and small diameter

29 / 72 Degree distribution Histogram of nr of nodes having a particular degree f k = fraction of nodes of degree k

30 / 72 Scale-free networks The degree distribution of most real-world networks follows a power-law distribution f k = ck α heavy-tail distribution, implies existence of hubs hubs are nodes with very high degree

31 / 72 Random networks are not scale-free! For random networks, the degree distribution follows the binomial distribution (or Poisson if n is large) f k = ( n )p k (1 p) (n k) zk e z k k! Where z = p(n 1) is the mean degree Probability of nodes with very large degree becomes exponentially small so no hubs

32 / 72 So ER networks are not scale-free, but.. Can we obtained scale-free simulated networks? The answer is YES!

33 / 72 Preferential attachment Rich get richer dynamics The more someone has, the more she is likely to have Examples the more friends you have, the easier it is to make new ones the more business a firm has, the easier it is to win more the more people there are at a restaurant, the more who want to go

34 / 72 Barabási-Albert model From [Barabási and Albert, 1999] Growth model The model controls how a network grows over time Uses preferential attachment as a guide to grow the network new nodes prefer to attach to well-connected nodes (Simplified) process: the process starts with some initial subgraph each new node comes in with m edges probability of connecting to existing node i is proportional to i s degree results in a power-law degree distribution with exponent α = 3

35 / 72 ER vs. BA Experiment with 1000 nodes, 999 edges (m 0 = 1 in BA model). random preferential attachment

In summary.. phenomenon real networks ER WS BA small diameter yes yes yes yes high clustering yes no yes yes 1 scale-free yes no no yes 1 clustering coefficient is higher than in random networks, but not as high as for example in WS networks 36 / 72

37 / 72 Network Analysis, Part II Today s contents 1. Centrality Degree centrality Closeness centrality Betweenness centrality 2. Community finding algorithms Hierarchical clustering Agglomerative Girvan-Newman Modularity maximization: Louvain method

38 / 72 Centrality in Networks Centrality is a node s measure w.r.t. others A central node is important and/or powerful A central node has an influential position in the network A central node has an advantageous position in the network

39 / 72 Degree centrality Power through connections degree_centrality(i) def = k(i)

40 / 72 Degree centrality Power through connections in_degree_centrality(i) def = k in (i)

41 / 72 Degree centrality Power through connections out_degree_centrality(i) def = k out (i)

42 / 72 Degree centrality Power through connections By the way, there is a normalized version which divides the centrality of each degree by the maximum centrality value possible, i.e. n 1 (so values are all between 0 and 1). But look at these examples, does degree centrality look OK to you?

43 / 72 Closeness centrality Power through proximity to others closeness_centrality(i) def = ( j i ) d(i, j) 1 = n 1 n 1 d(i, j) j i Here, what matters is to be close to everybody else, i.e., to be easily reachable or have the power to quickly reach others.

44 / 72 Betweenness centrality Power through brokerage A node is important if it lies in many shortest-paths so it is essential in passing information through the network

Betweenness centrality Power through brokerage Where betweenness_centrality(i) def = g jk (i) g jk j<k g jk is the number of shortest-paths between j and k, and g jk (i) is the number of shortest-paths through i Oftentimes it is normalized: norm_betweenness_centrality(i) def = betweenness_centrality(i) ) ( n 1 2 45 / 72

Betweenness centrality Examples (non-normalized) 46 / 72

What is community structure? 47 / 72

Why is community structure important? 48 / 72

.. but don t trust visual perception it is best to use objective algorithms 49 / 72

50 / 72 Main idea A community is dense in the inside but sparse w.r.t. the outside No universal definition! But some ideas are: A community should be densely connected A community should be well-separated from the rest of the network Members of a community should be more similar among themselves than with the rest Most common.. nr. of intra-cluster edges > nr. of inter-cluster edges

51 / 72 Some definitions Let G = (V, E) be a network with V = n nodes and E = m edges. Let C be a subset of nodes in the network (a cluster or community ) of size C = n c. Then intra-cluster density: δ int (C) = nr. internal edges of C n c (n c 1)/2 inter-cluster density: δ ext (C) = nr. inter-cluster edges of C n c (n n c ) A community should have δ int (C) > δ(g), where δ(g) is the average edge density of the whole graph G, i.e. δ(g) = nr. edges in G n(n 1)/2

52 / 72 Most algorithms search for tradeoffs between large δ int (C) and small δ ext (C) e.g. optimizing C δ int(c) δ ext (C) over all communities C Define further: m c = nr. edges within cluster C = {(u, v) u, v C} f c = nr. edges in the frontier of C = {(u, v) u C, v C} n c1 = 4, m c1 = 5, f c1 = 2 n c2 = 3, m c2 = 3, f c2 = 2 n c3 = 5, m c3 = 8, f c3 = 2

53 / 72 Community quality criteria conductance: fraction of edges leaving the cluster expansion: nr of edges per node leaving the cluster fc internal density: a.k.a. intra-cluster density cut ratio: a.k.a. inter-cluster density f c n c(n n c) f c 2m c+f c n c m c n c(n c 1)/2 modularity: difference between nr. of edges in C and the expected nr. of edges E[m c ] of a random graph with the same degree distribution 1 4m (m c E[m c ])

54 / 72 Methods we will cover Hierarchical clustering Agglomerative Divisive (Girvan-Newman algorithm) Modularity maximization algorithms Louvain method

Hierarchical clustering From hairball to dendogram 55 / 72

Suitable if input network has hierarchical structure 56 / 72

57 / 72 Agglomerative hierarchical clustering [Newman, 2010] Ingredients Similarity measure between nodes Similarity measure between sets of nodes Pseudocode 1. Assign each node to its own cluster 2. Find the cluster pair with highest similarity and join them together into a cluster 3. Compute new similarities between new joined cluster and others 4. Go to step 2 until all nodes form a single cluster

Example Data 20 0 20 40 60 80 0 20 40 60 80 D. Blei Clustering 02 5 / 21

Example iteration 001 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 002 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 003 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 004 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 005 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 006 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 007 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 008 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 009 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 010 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 011 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 012 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 013 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 014 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 015 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 016 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 017 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 018 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 019 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 020 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 021 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 022 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 023 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

Example iteration 024 V2 20 0 20 40 60 80 0 20 40 60 80 V1 D. Blei Clustering 02 5 / 21

58 / 72 Similarity measures w ij for nodes I Let A be the adjacency matrix of the network, i.e. A ij = 1 if (i, j) E and 0 otherwise. Jaccard index: w ij = Γ(i) Γ(j) Γ(i) Γ(j) where Γ(i) is the set of neighbors of node i Cosine similarity: 2 where: w ij = k A ika kj k A2 ik k A2 jk nij = Γ(i) Γ(j) = k A ika kj, and ki = k A ik is the degree of node i = n ij ki k j

Similarity measures w ij for nodes II Euclidean distance: (or rather Hamming distance since A is binary) d ij = k (A ik A jk ) 2 Normalized Euclidean distance: 3 d ij = k (A ik A jk ) 2 k i + k j = 1 2 n ij k i + k j Pearson correlation coefficient where µ i = 1 n r ij = cov(a i, A j ) σ i σ j = k (A ik µ i )(A jk µ j ) nσ i σ j k A 1 ik and σ i = n k (A ik µ i ) 2 2 From the equation xy = x y cos θ 3 Uses the idea that the maximum value of d ij is when there are no common neighbors and then d ij = k i + k j 59 / 72

60 / 72 Similarity measures for sets of nodes Single linkage: s XY = máx s xy x X,y Y Complete linkage: s XY = mín s xy x X,y Y x X,y Y Average linkage: s XY = s xy X Y

Agglomerative hierarchical clustering on Zachary s network Using average linkage 61 / 72

62 / 72 The Girvan-Newman algorithm A divisive hierarchical algorithm [Girvan and Newman, 2002] Edge betweenness The betweenness of an edge is the nr. of shortest-paths in the network that pass through that edge It uses the idea that bridges between communities must have high edge betweenness

63 / 72 The Girvan-Newman algorithm Pseudocode 1. Compute betweenness for all edges in the network 2. Remove the edge with highest betweenness 3. Go to step 1 until no edges left Result is a dendogram

64 / 72 Definition of modularity [Newman, 2010] Using a null model Random graphs are not expected to have community structure, so we will use them as null models. Q = (nr. of intra-cluster communities) (expected nr of edges) In particular: Q = 1 2m (A ij P ij ) δ(c i, C j ) ij where P ij is the expected number of edges between nodes i and j under the null model, C i is the community of vertex i, and δ(c i, C j ) = 1 if C i = C j and 0 otherwise.

How do we compute P ij? Using the configuration null model The configuration random graph model choses a graph with the same degree distribution as the original graph uniformly at random. Let us compute P ij There are 2m stubs or half-edges available in the configuration model Let p i be the probability of picking at random a stub incident with i p i = k i 2m The probability of connecting i to j is then p i p j = k ik j 4m 2 And so P ij = 2mp i p j = k ik j 2m 65 / 72

66 / 72 Properties of modularity Q = 1 2m ij ( A ij k ) ik j 2m δ(c i, C j ) Q depends on nodes in the same clusters only Larger modularity means better communities (better than random intra-cluster density) Q 1 2m ij A ij δ(c i, C j ) 1 2m ij A ij 1 Q may take negative values partitions with large negative Q implies existence of cluster with small internal edge density and large inter-community edges

67 / 72 The Louvain method [Blondel et al., 2008] Considered state-of-the-art Pseudocode 1. Repeat until local optimum reached 1.1 Phase 1: partition network greedily using modularity 1.2 Phase 2: agglomerate found clusters into new nodes

68 / 72 The Louvain method Phase 1: optimizing modularity Pseudocode for phase 1 1. Assign a different community to each node 2. For each node i For each neighbor j of i, consider removing i from its community and placing it to j s community Greedily chose to place i into community of neighbor that leads to highest modularity gain 3. Repeat until no improvement can be done

69 / 72 The Louvain method Phase 2: agglomerating clusters to form new network Pseudocode for phase 2 1. Let each community C i form a new node i 2. Let the edges between new nodes i and j be the sum of edges between nodes in C i and C j in the previous graph (notice there are self-loops)

70 / 72 The Louvain method Observations The output is also a hierarchy Works for weighted graphs, and so modularity has to be generalized to Q w = 1 2W ( ij W ij s is j 2W ) δ(c i, C j ) where W ij is the weight of undirected edge (i, j), W = ij W ij and s i = k W ik.

71 / 72 References I Barabási, A.-L. and Albert, R. (1999). Emergence of scaling in random networks. science, 286(5439):509 512. Blondel, V. D., Guillaume, J.-l., Lambiotte, R., and Lefebvre, E. (2008). Fast unfolding of community hierarchies in large networks. Networks, pages 1 6. Girvan, M. and Newman, M. E. J. (2002). Community structure in social and biological networks. Proceedings of the National Academy of Sciences of the United States of America, 99:7821 7826. Newman, M. (2010). Networks: An Introduction. Oxford University Press, USA, 2010 edition.

72 / 72 References II Newman, M. E. (2003). The structure and function of complex networks. SIAM review, 45(2):167 256. Watts, D. J. and Strogatz, S. H. (1998). Collective dynamics of small-world networks. nature, 393(6684):440 442.