CSI 4352, Introduction to Data Mining Chapters 11 and 13, Graph Data Mining Young-Rae Cho Associate Professor Department of Computer Science Balor Universit Graph Representation Graph An ordered pair GV,E with a set of vertices V and a set of edges E Etended Graph Representation Directed vs. undirected graph Whether each edge has a direction Weighted vs. unweighted graph Whether each edge has a weight Labeled vs. unlabeled graph Whether each verte has a label 2-D vs. 3-D graph representation Each verte has angles between two linked edges 1
Wh Graph Mining is Important? Data are often represented as a graph Biological networks Chemical compounds Internet WWW Electric circuits Workflows Social networks Graph is a general model for data mining!! Graph Data Mining Topics 1 Single Graph Mining Frequent sub-graph pattern mining Finding sub-graphs that frequentl occur in a graph Graph clustering Verte clustering Partitioning a graph into sub-graphs Verte classification Classifing a verte in a graph 2
Graph Data Mining Topics 2 Graph Dataset Mining Frequent sub-graph pattern mining Finding sub-graphs that frequentl occur among graphs Graph data clustering Grouping similar graphs Graph data classification Classifing a new graph id 1 2 graph 3 4 Applications Application of Single Graph Mining Biological network analsis Social network analsis Web communit analsis Application of Graph Dataset Mining Biochemical structure analsis Program control flow analsis XML structure analsis Challenges Finding the complete set satisfing the minimum support threshold Developing efficient and scalable algorithms Incorporating various kinds of user-specific constraints 3
CSI 4352, Introduction to Data Mining Chapters 11 and 13, Graph Data Mining General Definitions Graph Clustering Subgraph Pattern Mining Connectivit Degree, degv i The number of links from v i to other vertices Incoming degree and outgoing degree for directed graphs Weighted degree sum of the weights of the edges directl connected for weighted graphs A Set of eighbors, v i A set of vertices directl linked to the verte v i Also called adjacent neighbors or direct neighbors Degree Distribution, Pk Probabilit that a verte has eactl k links The number of vertices whose degree is k over the total number of vertices 4
Length & Sie Walk A sequence of vertices such that each is linked to its succeeding one Path A walk such that each verte in the walk is distinct Path Length The number of edges in path p Shortest Path between v i and v j A path with the smallest length out of all paths from v i to v j Characteristic Path Length of G Average length of the shortest paths between each pair of vertices Diameter of G Largest length of the shortest paths between each pair of vertices Densit Densit of GV,E The number of actual edges in G over the number of all possible edges 2 E D G V V 1 Clique A full connected graph also called, complete graph DG = 1 Quasi-Clique Close to clique A densel connected sub-graph DG > θ where θ is a user-specified threshold 5
Modularit Clustering Coefficient of v i The densit of a sub-graph G V,E where V is the set of neighbors of v i j v, v, v v j vk k i C vi v v 1 i i Measuring the effectiveness of v i on denseness Average Clustering Coefficient of GV,E Average of the clustering coefficients of all vertices in V Maimum is 1 Measuring the modularit of G Centralit Closeness, C c v i Detects the vertices located in the center of a graph CC vi v j V 1 p v, v s i j where p s v i,v j is the shortest path length between v i and v j Betweenness, C B v i Detects the vertices located between two clusters C v B i svi tv st vi st where σ st is the number of shortest paths between s and t, and σ st v i is the number of shortest paths between s and t, which pass through the verte v i 6
CSI 4352, Introduction to Data Mining Chapters 11 and 13, Graph Data Mining General Definitions Graph Clustering Subgraph Pattern Mining Graph Clustering Problem Definition Finding densel connected sub-graphs G V,E from a graph GV,E Finding sub-graphs with dense intra-connections and sparse interconnections modularit Methods Densit-based methods Hierarchical methods Partition-based methods 7
CSI 4352, Introduction to Data Mining Chapters 11 and 13, Graph Data Mining General Definitions Graph Clustering Densit-based Methods Partition-based Methods Hierarchical Methods Subgraph Pattern Mining Maimum Clique Find all maimum sied cliques Use antimonotonic propert If a subset of set A is not a clique, then the set A is not a clique G A B C D E F J I K sie 2 cliques: {AB}, {AC}, {AE},.. sie 3 cliques: {ABC}, {ACE},.. sie 4 cliques: {JKLM} L M 8
Clique Percolation Definitions Two k-cliques are adjacent if the share k-1 vertices where k is the number of vertices in each clique A k-clique chain is a sub-graph comprising the union of a sequence of adjacent k cliques 1 Find all k-cliques 2 Find all maimal k-clique chains b iterative merging adjacent k-cliques Reference Palla, G., et al., Uncovering the overlapping communit structure of comple networks in nature and societ, ature 2005 k-core Decomposition Definition k-core is a sub-graph b pruning all vertices whose degree is less than k Remove repeatedl all vertices whose degree is less than k A B C D E F J G I K k = 3 L Reference Wuch, S. and Almaas, E., Peeling the east protein network, Proteomics 2005 M 9
Seed Growth Main Idea Search for local optimiation Use a modularit or densit function Tpes of seeds Random seeds: selected randoml Core seeds: selected b degree or clustering coefficient 1 Select a verte seed as an initial cluster S 2 Add a neighbor of the seed to S repeatedl to find the maimum modularit or densit 3 Return S if the modularit of S > threshold 4 Repeat 1, 2 and 3 to find a set of clusters Graph Entrop 1 Definitions An eample of seed-growth algorithms General notation Inner links of v in G V,E : edges from v to the vertices in V p i v: probabilit of v having inner links Outer links of v in G V,E : edges from v to the vertices not in V p o v: probabilit of v having outer links Definitions Verte entrop: ev = - p i v log 2 p i v - p o v log 2 p o v Graph entrop : egv,e = Σ vv ev Find the minimum graph entrop during seed growth 10
Graph Entrop 2 Eample Graph Entrop 3 1 Select a seed node, and include all neighbors of the seed node into a seed cluster 2 Iterativel remove a neighbor if removal decreases graph entrop 3 Iterativel add a node on the outer boundar of a current cluster if addition decreases graph entrop 4 Output the cluster with the minimal graph entrop 5 Repeat 1, 2, 3, and 4 until no seed node remains Reference Kenle, E.C. and Cho, Y.-R., Detecting protein complees and functional modules from protein interaction networks: a graph entrop approach, Proteomics 2011 11
CSI 4352, Introduction to Data Mining Chapters 11 and 13, Graph Data Mining General Definitions Graph Clustering Densit-based Methods Partition-based Methods Hierarchical Methods Subgraph Pattern Mining Restricted eighborhood Search 1 Main Idea Iterative moves of vertices to find the best global modularit Tpes of moves Global move: moving a random verte to a random cluster Intensification move: moving in the restricted neighborhood vertices on the boundar of partitions 1 Randoml partition the graph into k subgraphs 2 Make an intensification move of a random verte if this move improves modularit 3 Repeat 2 until finding the best modularit 12
Restricted eighborhood Search 2 Eample Modularit function: number of interconnecting edges between clusters G B D F I A C E J K L M Reference King, A., et al., Protein comple prediction via cost-based clustering Bioinformatics 2004 CSI 4352, Introduction to Data Mining Chapters 11 and 13, Graph Data Mining General Definitions Graph Clustering Densit-based Methods Partition-based Methods Hierarchical Methods Subgraph Pattern Mining 13
Bottom-Up vs. Top-Down Bottom-Up Agglomerative Approaches Start with each verte as a cluster Iterativel merge the closest clusters Require a distance function between two clusters Top-Down Divisive Approaches Start with the whole graph as a cluster Recursivel divide up the clusters Require a cutting algorithm Merging b Shortest Path Length Main Idea Agglomerative approach using single-link distance 1 Select two closest vertices from different clusters based on the shortest path length between them 2 Merge two clusters that include the selected vertices 3 Repeat 1 and 2 until the shortest path length reaches a threshold 14
15 Merging b Common eighbors 1 Main Idea Agglomerative approach using the similarit based on common neighbors More common neighbors two vertices share, more similar the are 1 Find the most similar vertices from different clusters based on a similarit function 2 Merge the two clusters if the merged cluster reaches a densit threshold 3 Repeat 1 and 2 until no more clusters can be merged Similarit Functions Jaccard coefficient: Geometric coefficient:, S, 2 S Merging b Common eighbors 2 More Similarit Functions Dice coefficient: Simpson coefficient: Marland bridge coefficient: Reference Brun, C., et al., Clustering proteins from interaction networks for the prediction of cellular functions BMC Bioinformatics 2004 2, S, min, S 2 1, S
Merging b Statistical Significance Statistical Similarit Function Hper-geometric coefficient P-value: V V Z V X Z X Z Y Z P V V X Y V is the total number of vertices, X=, Y=, and Z= for vertices and Repeatedl find the vertices with the smallest P-value, and merge them Reference Samanta, M.P. and Liang, S., Predicting protein functions from redundancies in large-scale protein interaction networks PAS 2003 Minimum Cut Definitions Cut: a set of edges whose removal disconnects the graph Minimum cut: a cut with minimum number of edges Recursivel find the minimum cut B D F G I A Parameter C E J K Minimum densit threshold Minimum sie threshold L M 16
Betweenness Cut Betweenness Measurement of vertices or edges located between clusters 1 Iterativel eliminate a verte or an edge with the highest Betweenness value until the graph is separated 2 Recursivel appl 1 into each subgraph 3 Repeat 1 and 2 until all subgraphs reach a densit threshold Reference Dunn, R., et al., The use of edge-betweenness clustering to investigate biological function in protein interaction networks BMC Bioinformatics 2005 Dividing b Common eighbors Main Idea Divisive approach using the dissimilarit based on common neighbors Less common neighbors two vertices share, more dissimilar the are 1 Iterativel eliminate the edge between the most dissimilar vertices based on a similarit function, until the graph is separated 2 Recursivel appl 1 into each subgraph 3 Repeat 1 and 2 until all subgraphs reach a densit threshold Reference Radicchi, F., et al., Defining and identifing communities in networks PAS 2004 17
CSI 4352, Introduction to Data Mining Chapters 11 and 13, Graph Data Mining General Definitions Graph Clustering Subgraph Pattern Mining Properties of Subgraph Patterns from Graph Dataset Properties Anti-monotonic propert Apriori algorithm If a sub-graph G is not frequent, then none of the super-graphs of G are frequent id graph Eample If is infrequent, so do and 1 2 3 4 18
Properties of Subgraph Patterns from a Single Graph Properties Frequent sub-graph pattern mining in a graph ot anti-monotonic propert! Even if a sub-graph G is not frequent, some of the super-graphs of G might be frequent Eample Suppose minimum support is 10 How man? How man? FSG Algorithm FSG Frequent Sub-graph discover 1 Initiall, find all frequent sie-3 sub-graphs 2 Generate candidate sie-k+1 sub-graphs from frequent sie-k sub-graphs 3 Count support of each candidate sub-graph to select frequent sub-graphs 4 Repeat 2 and 3 until no frequent sub-graph or no candidate is found 19
Candidate Sub-graphs Selective Joining Join two sie-k sub-graphs if the share sie-k-1 sub-graphs Ma join the same sie-k sub-graph Produce multiple distinct sie-k+1 sub-graphs + + Summar of FSG Algorithm Strength Apriori pruning Weakness Generates a huge set of candidate sub-graphs Requires multiple scans of database Inefficient for mining large-sied sub-graph patterns eeds efficient finding of isomorphic graphs to count support Reference Kuramochi, M. and Karpis, G., Frequent subgraph discover. In Proceedings of ICDM 2001 20
Structural Isomorphism Definition If two graphs are isomorphic, then the are structurall identical Eamples Canonical Adjacenc Matri Adjacenc Matri 0 1 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0 1 0 0 0 0 1 0 0 Canonical Adjacenc Matri Canonical Code 1110100110 1 1 1 1 0 1 0 0 1 0 21
FFSM Algorithm FFSM Fast frequent sub-graph mining Main Idea Use canonical adjacenc matrices for selective joining and support a 1 1 b 1 b counting a 1 b 1 1 b 1 1 0 b Reference a a + 1 b 1 b 1 0 b 1 1 b 0 1 0 b 0 1 0 b a a 1 b 1 b + 1 1 b 1 1 b 1 0 1 b 1 1 1 b Huan, J., Wang, W. and Prins, J., Efficient mining of frequent subgraph in the presence of isomorphism. In Proceedings of ICDM 2003 DFS Code DFS Tree Trace on depth-first-search DFS Tree in a Leicographic Order DFS tree constructed b the leicographic order of labels DFS Code A sequence of edges of the DFS tree in a leicographic order {,,,,,,,,,,,} 22
gspan Algorithm gspan Graph-based sub-structure pattern mining Main Idea Use DFS codes for frequent sub-graph pattern mining Efficient for pattern growth Reference Yan, X. and Han, J., gspan: Graph-based sub-structure pattern mining, In Proceedings of ICDM 2002 Yan, X. and Han, J., CloseGraph: Mining closed frequent graph patterns, In Proceedings of KDD 2003 Questions? Lecture Slides on the Course Website, www.ecs.balor.edu/facult/cho/4352 23