Topic II: Graph Mining

Similar documents
CS249: SPECIAL TOPICS MINING INFORMATION/SOCIAL NETWORKS

CSCI5070 Advanced Topics in Social Computing

Lesson 4. Random graphs. Sergio Barbarossa. UPC - Barcelona - July 2008

Summary: What We Have Learned So Far

Nick Hamilton Institute for Molecular Bioscience. Essential Graph Theory for Biologists. Image: Matt Moores, The Visible Cell

(Social) Networks Analysis III. Prof. Dr. Daning Hu Department of Informatics University of Zurich

Theory of Computation and Complexity Theory. COL 705! Lecture 0!

Social Network Analysis

Course Introduction / Review of Fundamentals of Graph Theory

Graph Theory. Graph Theory. COURSE: Introduction to Biological Networks. Euler s Solution LECTURE 1: INTRODUCTION TO NETWORKS.

Introduction to network metrics

Critical Phenomena in Complex Networks

Graph Theory for Network Science

Wednesday, March 8, Complex Networks. Presenter: Jirakhom Ruttanavakul. CS 790R, University of Nevada, Reno

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

CAIM: Cerca i Anàlisi d Informació Massiva

Complex-Network Modelling and Inference

Basics of Network Analysis

PageRank and related algorithms

Centralities (4) By: Ralucca Gera, NPS. Excellence Through Knowledge

CS-E5740. Complex Networks. Scale-free networks

Introduction to Graph Theory

Graph Theory for Network Science

Overlay (and P2P) Networks

Peer-to-Peer Data Management

6. Overview. L3S Research Center, University of Hannover. 6.1 Section Motivation. Investigation of structural aspects of peer-to-peer networks

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Network Thinking. Complexity: A Guided Tour, Chapters 15-16

Link Analysis from Bing Liu. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, Springer and other material.

1. a graph G = (V (G), E(G)) consists of a set V (G) of vertices, and a set E(G) of edges (edges are pairs of elements of V (G))

On Page Rank. 1 Introduction

TELCOM2125: Network Science and Analysis

Machine Learning and Modeling for Social Networks

Networks and Discrete Mathematics

CSE 158 Lecture 11. Web Mining and Recommender Systems. Triadic closure; strong & weak ties

L Modelling and Simulating Social Systems with MATLAB

Social Network Analysis

Signal Processing for Big Data

1 Starting around 1996, researchers began to work on. 2 In Feb, 1997, Yanhong Li (Scotch Plains, NJ) filed a

WLU Mathematics Department Seminar March 12, The Web Graph. Anthony Bonato. Wilfrid Laurier University. Waterloo, Canada

CENTRALITIES. Carlo PICCARDI. DEIB - Department of Electronics, Information and Bioengineering Politecnico di Milano, Italy

Introduction to Complex Networks Analysis

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

Collaborative filtering based on a random walk model on a graph

CSE 255 Lecture 13. Data Mining and Predictive Analytics. Triadic closure; strong & weak ties

Link Analysis and Web Search

Mathematics of Networks II

Phase Transitions in Random Graphs- Outbreak of Epidemics to Network Robustness and fragility

Degree Distribution: The case of Citation Networks

THE KNOWLEDGE MANAGEMENT STRATEGY IN ORGANIZATIONS. Summer semester, 2016/2017

SLANG Session 4. Jason Quinley Roland Mühlenbernd Seminar für Sprachwissenschaft University of Tübingen

Graphs. Introduction To Graphs: Exercises. Definitions:

Lecture 9: I: Web Retrieval II: Webology. Johan Bollen Old Dominion University Department of Computer Science

TELCOM2125: Network Science and Analysis

Unit 2: Graphs and Matrices. ICPSR University of Michigan, Ann Arbor Summer 2015 Instructor: Ann McCranie

CSE 158 Lecture 13. Web Mining and Recommender Systems. Triadic closure; strong & weak ties

MAE 298, Lecture 9 April 30, Web search and decentralized search on small-worlds

Graph Theory Review. January 30, Network Science Analytics Graph Theory Review 1

Properties of Biological Networks

Varying Applications (examples)

Bruno Martins. 1 st Semester 2012/2013

Graph Theory and Network Measurment

Chapter 1. Social Media and Social Computing. October 2012 Youn-Hee Han

Example for calculation of clustering coefficient Node N 1 has 8 neighbors (red arrows) There are 12 connectivities among neighbors (blue arrows)

Lecture on Game Theoretical Network Analysis

Lecture 3: Descriptive Statistics

Paths. Path is a sequence of edges that begins at a vertex of a graph and travels from vertex to vertex along edges of the graph.

Social Networks. Slides by : I. Koutsopoulos (AUEB), Source:L. Adamic, SN Analysis, Coursera course

Some Graph Theory for Network Analysis. CS 249B: Science of Networks Week 01: Thursday, 01/31/08 Daniel Bilar Wellesley College Spring 2008

Introduction to Engineering Systems, ESD.00. Networks. Lecturers: Professor Joseph Sussman Dr. Afreen Siddiqi TA: Regina Clewlow

RANDOM-REAL NETWORKS

Graph and Digraph Glossary

Advanced Algorithms and Models for Computational Biology -- a machine learning approach

Networks in economics and finance. Lecture 1 - Measuring networks

A Generating Function Approach to Analyze Random Graphs

Chapter 4. Relations & Graphs. 4.1 Relations. Exercises For each of the relations specified below:

Simplicial Complexes of Networks and Their Statistical Properties

Basic Network Concepts

How Do Real Networks Look? Networked Life NETS 112 Fall 2014 Prof. Michael Kearns

Chapter 2 Graphs. 2.1 Definition of Graphs

Distributed Data Management

Small-world networks

Algorithms and Applications in Social Networks. 2017/2018, Semester B Slava Novgorodov

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Extracting Information from Complex Networks

Graph Theory: Introduction

Network Analysis. 1. Large and Complex Networks. Scale-free Networks Albert Barabasi Society

- relationships (edges) among entities (nodes) - technology: Internet, World Wide Web - biology: genomics, gene expression, proteinprotein

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

CS 311 Discrete Math for Computer Science Dr. William C. Bulko. Graphs

Constructing a G(N, p) Network

An introduction to the physics of complex networks

UNIVERSITA DEGLI STUDI DI CATANIA FACOLTA DI INGEGNERIA

M.E.J. Newman: Models of the Small World

Introduction to Networks and Business Intelligence

Graph Model Selection using Maximum Likelihood

CSE 258 Lecture 12. Web Mining and Recommender Systems. Social networks

Economic Networks. Theory and Empirics. Giorgio Fagiolo Laboratory of Economics and Management (LEM) Sant Anna School of Advanced Studies, Pisa, Italy

Resilient Networking. Thorsten Strufe. Module 3: Graph Analysis. Disclaimer. Dresden, SS 15

Modeling and Simulating Social Systems with MATLAB

Transcription:

Topic II: Graph Mining Discrete Topics in Data Mining Universität des Saarlandes, Saarbrücken Winter Semester 2012/13 T II.Intro-1

Topic II Intro: Graph Mining 1. Why Graphs? 2. What is Graph Mining 3. Graphs: Definitions 4. Centrality 5. Graph Properties 5.1. Small World 5.2. Scale Invariance 5.3. Clustering Coefficient 6. Random Graph Models Z&M, Ch. 4 T II.Intro-2

Why Graphs? T II.Intro-3

Why Graphs? IP Networks T II.Intro-3

Why Graphs? Social Networks T II.Intro-3

Why Graphs? World Wide Web T II.Intro-3

Why Graphs? Protein Protein Interactions T II.Intro-3

Why Graphs? Co-authorships T II.Intro-3

Why Graphs? NISZK_h QCMA SBP MA_E P^{NP[log^2]} AWPP C_=P NE NISZK MA WAPP BPE P^{NP[log]} WPP N.BPP PZK AmpP-BQP RPE BPQP UE BH TreeBQP ZPE BH_2 LWPP BPP E US NP RQP SUBEXP P^{FewP} YP compnp RBQP ZQP QP Few EP ZBQP RP EQP betap QPLIN FewP ZPP Q Complexity beta_2p Classes UP T II.Intro-3

Why Graphs? Graphs are Everywhere! T II.Intro-3

Graphs: Definitions An undirected graph G is a pair (V, E) V = {vi} is the set of vertices E = {ei = {vi, vj} : vi, vj V} is the set of edges In directed graph the edges have a direction E = {ei = (vi, vj) : vi, vj V} And edge from a vertex to itself is loop A graph that does not have loops is simple The degree of a vertex v, d(v), is the number of edges attached to it, d(v) = {{v, u} E : u V} In directed graphs vertices have in-degree id(v) and outdegree od(v) T II.Intro-4

Subgraphs A graph H = (VH, EH) is a subgraph of G = (V, E) if VH V EH E The edges in EH are between vertices in VH If V V is a set of vertices, then G = (V, E ) is the induced subgraph if For all vi, vj V such that {vi, vj} E, {vi, vj} E Subgraph K = (VK, EK) of G is a clique if For all vi, vj VK, {vi, vj} EK Cliques are also called complete subgraphs T II.Intro-5

Bipartite Graphs A graph G = (V, E) is bipartite if V can be partitioned into two sets U and W such that U W = and U W = V (a partition) For all {vi, vj} E, vi U and vj W No edges within U and no edges within W Any subgraph of a bipartite graph is also bipartite A biclique is a complete bipartite subgraph K = (U V, E) For all u U and v V, edge {u, v} E T II.Intro-6

Paths and Distances A walk in graph G between vertices x and y is an ordered sequence x = v0, v1, v2,, vt 1, vt = y {vi 1, vi} E for all i = 1,, t If x = y, the walk is closed The same vertex can re-appear in the walk many times A trail is a walk where edges are distinct {vi 1, vi} {vj 1, vj} for i j A path is a walk where vertices are distinct vi vj for i j A closed path with t 3 is a cycle The distance between x and y, d(x, y) is the length of the shortest path between them T II.Intro-7

Connectedness Two vertices x and y are connected if there is a path between them A graph is connected if all pairs of its vertices are connected A connected component of a graph is a maximal connected subgraph A directed graph is strongly connected if there is a directed path between all ordered pairs of its vertices It is weakly connected if it is connected only when considered as an undirected graph If a graph is not connected, it is disconnected T II.Intro-8

Example v 1 v 2 v 1 v 2 v 3 v 4 v 5 v 6 v 3 v 4 v 5 v 6 v 7 v 8 (a) v 7 v 8 (b) T II.Intro-9

Adjacency Matrix The adjacency matrix of an undirected graph G = (V, E) with V = n is the n-by-n symmetric binary matrix A with aij = 1 if and only if {vi, vj} E A weighted adjacency matrix has the weights of the edges For directed graphs, the adjacency matrix is not necessarily symmetric The bi-adjacency matrix of a bipartite graph G = (U V, E) with U = n and V = m is the n-by-m binary matrix B with bij = 1 if and only if {ui, vj} E T II.Intro-10

Topological Attributes The weighted degree of a vertex vi is d(vi) = j aij The average degree of a graph is the average of the degrees of its vertices, Σ i d(vi)/n Degree and average degree can be extended to directed graphs The average path length of a connected graph is the average of path lengths between all vertices  i  n d(v i,v j )/ 2 j>i = n(n 2 1)  i  d(v i,v j ) j>i T II.Intro-11

Eccentricity, Radius & Diameter The eccentricity of a vertex vi, e(vi), is its maximum distance to any other vertex, maxj{d(vi, vj)} The radius of a connected graph, r(g), is the minimum eccentricity of any vertex, mini{e(vi)} The diameter of a connected graph, d(g), is the maximum eccentricity of any vertex, maxi{e(vi)} = maxi,j{d(vi, vj)} The effective diameter of a graph is smallest number that is larger than the eccentricity of a large fraction of the vertices in the graph Large fraction e.g. 90% T II.Intro-12

Clustering Coefficient The clustering coefficient of vertex vi, C(vi), tells how clique-like the neighbourhood of vi is Let ni be the number of neighbours of vi and mi the number of edges between the neighbours of vi (vi excluded) ni C(v i )=m i / = 2m i 2 n i (n i 1) Well-defined only for vi with at least two neighbours For others, let C(vi) = 0 The clustering coefficient of the graph is the average clustering coefficient of the vertices: C(G) = n 1 ΣiC(vi) T II.Intro-13

Graph Mining Graphs can explain relations between objects Finding these relations is the task of graph mining The type of the relation depends on the task Graph mining is an umbrella term that encompasses many different techniques and problems Frequent subgraph mining Graph clustering Path analysis/building Influence propagation T II.Intro-14

Example: Tiling Databases Binary matrices define a bipartite graph A tile is a biclique of that graph Tiling is the task of finding 1 2 3 ( ) A B C 1 1 0 1 1 1 0 1 1 a minimum number of bicliques to cover all edges of a bipartite graph Or to find k bicliques to cover most of the edges 1 2 3 A B C T II.Intro-15

Example: The Characteristics of Erdős Graph Co-authorship graph of mathematicians 401K authors (vertices), 676K co-authorships (edges) Median degree = 1, mean = 3.36, standard deviation = 6.61 Large connected component of 268K vertices The radius of the component is 12 and diameter 23 Two vertices with eccentricity 12 Average distance between two vertices 7.64 (based on a sample) Eight degrees of separation The clustering coefficient is 0.14 http://www.oakland.edu/enp/ T II.Intro-16

Centrality Six degrees of Kevin Bacon Every actor is related to Kevin Bacon by no more than 6 hops Kevin Bacon has acted with many, that have acted with many others, that have acted with many others That makes Kevin Bacon a centre of the co-acting graph Although he s not the centre: the average distance to him is 2.994 but to Dennis Hopper it is only 2.802 http://oracleofbacon.org T II.Intro-17

Centrality Six degrees of Kevin Bacon Every actor is related to Kevin Bacon by no more than 6 hops Kevin Bacon has acted with many, that have acted with many others, that have acted with many others That makes Kevin Bacon a centre of the co-acting graph Although he s not the centre: the average distance to him is 2.994 but to Dennis Hopper it is only 2.802 http://oracleofbacon.org T II.Intro-17

Degree and Eccentricity Centrality Centrality is a function c: V R that induces a total order in V The higher the centrality of a vertex, the more important it is In degree centrality c(vi) = d(vi), the degree of the vertex In eccentricity centrality the least eccentric vertex is the most central one, c(vi) = 1/e(vi) The lest eccentric vertex is central The most eccentric vertex is peripheral T II.Intro-18

Closeness Centrality In closeness centrality the vertex with least distance to all other vertices is the centre! 1 c(v i )= Â j d(v i,v j ) In eccentricity centrality we aim to minimize the maximum distance In closeness centrality we aim to minimize the average distance This is the distance used to measure the centre of Hollywood T II.Intro-19

Betweenness Centrality The betweenness centrality measures the number of shortest paths that travel through vi Measures the monitoring role of the vertex All roads lead to Rome Let ηjk be the number of shortest paths between vj and vk and let ηjk(vi) be the number of those that include vi Let γjk(vi) = ηjk(vi)/ηjk Betweenness centrality is defined as c(v i )= j6=i  k6=i k> j g jk T II.Intro-20

Prestige In prestige, the vertex is more central if it has many incoming edges from other vertices of high prestige A is the adjacency matrix of the directed graph G p is n-dimensional vector giving the prestige of the vertices p = A T p Starting from an initial prestige vector p0, we get pk = A T pk 1 = A T (A T pk 2) = (A T ) 2 pk 2 = (A T ) 3 pk 3 = = (A T ) k p0 Vector p converges to the dominant eigenvector of A T Under some assumptions T II.Intro-21

PageRank PageRank uses normalized prestige to rank web pages If there is a vertex with no out-going edges, the prestige cannot be computed PageRank evades this problem by adding a small probability of a random jump to another vertex Random Surfer model Computing the PageRank is equivalent to computing the stationary distribution of a certain Markov chain Which is again equivalent to computing the dominant eigenvector T II.Intro-22

Graph Properties Several real-world graphs exhibit certain characteristics Studying what these are and explaining why they appear is an important area of network research As data miners, we need to understand the consequences of these characteristics Finding a result that can be explained merely by one of these characteristics is not interesting We also want to model graphs with these characteristics T II.Intro-23

Small-World Property A graph G is said to exhibit a small-world property if its average path length scales logarithmically, µl log n The six degrees of Kevin Bacon is based on this property Also the Erdős number How far a mathematician is from Hungarian combinatorist Paul Erdős A radius of a large, connected mathematical co-authorship network (268K authors) is 12 and diameter 23 T II.Intro-24

Scale-Free Property The degree distribution of a graph is the distribution of its vertex degrees How many vertices with degree 1, how many with degree 2, etc. f(k) is the number of edges with degree k A graph is said to exhibit scale-free property if f(k) k γ So-called power-law distribution Majority of vertices have small degrees, few have very high degrees Scale-free: f(ck) = α(ck) γ = (αc γ )k γ k γ T II.Intro-25

Example: WWW Links In-degree Out-degree Broder et al. Graph structure in the web. WWW 00 s = 2.09 s = 2.72 T II.Intro-26

Clustering Effect A graph exhibits clustering effect if the distribution of average clustering coefficient (per degree) follow the power law If C(k) is the average clustering coefficient of all vertices of degree k, then C(k) k γ The vertices with small degrees are part of highly clustered areas (high clustering coefficient) while hub vertices have smaller clustering coefficients T II.Intro-27

Random Graph Models Begin able to generate random graphs that exhibit these properties is very useful They tell us something how such graphs have come to be They let us study what we find in an average graph With some graph models, we can also make analytical studies of the properties What to expect T II.Intro-28

Erdős Rényi Graphs Two parameters: number of vertices n and number of edges m Samples uniformly from all such graphs Sample m edges u.a.r. without replacement Average degree is 2m/n Degree distribution follows Poisson, not power law Clustering coefficient is uniform Exhibits small-world property T II.Intro-29

Watts Strogatz Graphs Aims for high local clustering Starts with vertices in a ring, each connected to k neighbours left and right Adds random perturbations Edge rewiring: move the end-point of random edges to random vertices Edge shortcuts: add random edges between vertices Not scale-free High clustering coefficient for small amounts of perturbations Small diameter with some amount of perturbations T II.Intro-30

Example T II.Intro-31

Barabási Albert Graphs Mimics dynamic evolution of graphs Preferential attachment Starts with a regular graph At each time step, adds a new vertex u From u, adds q edges to other vertices Vertices are sampled proportional to their degree High degree, high probability to get more edges Degree distribution follows power law (with γ = 3) Ultra-small world behaviour Very small clustering coefficient T II.Intro-32