Chapters 11 and 13, Graph Data Mining

Similar documents
Data Mining in Bioinformatics Day 5: Graph Mining

Data Mining in Bioinformatics Day 3: Graph Mining

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining

Unsupervised Learning. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

gspan: Graph-Based Substructure Pattern Mining

Chapter 13, Sequence Data Mining

Community Detection. Community

Lecture 3, Review of Algorithms. What is Algorithm?

CS6220: DATA MINING TECHNIQUES

Data mining, 4 cu Lecture 8:

Web Structure Mining Community Detection and Evaluation

Graph mining-based Image Indexing

Clustering fundamentals

Community detection algorithms survey and overlapping communities. Presented by Sai Ravi Kiran Mallampati

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

MCL. (and other clustering algorithms) 858L

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Online Social Networks and Media. Community detection

Pattern Mining in Frequent Dynamic Subgraphs

Gene Clustering & Classification

Mining frequent Closed Graph Pattern

Clustering Part 2. A Partitional Clustering

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Graph Mining and Social Network Analysis

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Brief description of the base clustering algorithms

! Introduction. ! Partitioning methods. ! Hierarchical methods. ! Model-based methods. ! Density-based methods. ! Scalability

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

9/17/2009. Wenyan Li (Emily Li) Sep. 15, Introduction to Clustering Analysis

Community detection. Leonid E. Zhukov

Clusters and Communities

Mining Social Network Graphs

Introduction to Graph Theory

Comparative Study of Subspace Clustering Algorithms

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

Clustering in Data Mining

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Subdue: Compression-Based Frequent Pattern Discovery in Graph Data

Graph Theory. ICT Theory Excerpt from various sources by Robert Pergl

How and what do we see? Segmentation and Grouping. Fundamental Problems. Polyhedral objects. Reducing the combinatorics of pose estimation

Efficient Subgraph Matching by Postponing Cartesian Products

Clustering Part 4 DBSCAN

Supervised and Unsupervised Learning (II)

Unsupervised Learning

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Tight Clustering: a method for extracting stable and tight patterns in expression profiles

and Algorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 8/30/ Introduction to Data Mining 08/06/2006 1

Data Mining Concepts

Graphs. Tessema M. Mengistu Department of Computer Science Southern Illinois University Carbondale Room - Faner 3131

Machine Learning 15/04/2015. Supervised learning vs. unsupervised learning. What is Cluster Analysis? Applications of Cluster Analysis

Clustering. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Clustering Lecture 9: Other Topics. Jing Gao SUNY Buffalo

University of Florida CISE department Gator Engineering. Clustering Part 4

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Survey on Graph Query Processing on Graph Database. Presented by FAN Zhe

Cross-validation for detecting and preventing overfitting

Non Overlapping Communities

Chapter 1, Introduction

Data Mining for Knowledge Management. Association Rules

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

Frequent Pattern-Growth Approach for Document Organization

DATA MINING II - 1DL460

Strength of Co-authorship Ties in Clusters: a Comparative Analysis

Introduction to Mobile Robotics

Discovering Frequent Geometric Subgraphs

Graph Theory for Network Science

Extracting Information from Complex Networks

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition

CAIM: Cerca i Anàlisi d Informació Massiva

Association Pattern Mining. Lijun Zhang

CSE 5243 INTRO. TO DATA MINING

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

gprune: A Constraint Pushing Framework for Graph Pattern Mining

Chapter 5, Data Cube Computation

Outline. Graphs. Divide and Conquer.

KeyLabel Algorithms for Keyword Search in Large Graphs

Social Data Management Communities

Lesson 3. Prof. Enza Messina

Chapter 4: Association analysis:

Exploratory Analysis: Clustering

CUT: Community Update and Tracking in Dynamic Social Networks

Mining Mixed-drove Co-occurrence Patterns For Large Spatio-temporal Data Sets: A Summary of Results

V10 Metabolic networks - Graph connectivity

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Clustering CS 550: Machine Learning

CSE 5243 INTRO. TO DATA MINING

CS570: Introduction to Data Mining

Cluster analysis. Agnieszka Nowak - Brzezinska

Big Data Management and NoSQL Databases

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Data Mining. Cluster Analysis: Basic Concepts and Algorithms

CS224W: Analysis of Networks Jure Leskovec, Stanford University

CS145: INTRODUCTION TO DATA MINING

Graph Connectivity G G G

Discussion: Clustering Random Curves Under Spatial Dependence

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Transcription:

CSI 4352, Introduction to Data Mining Chapters 11 and 13, Graph Data Mining Young-Rae Cho Associate Professor Department of Computer Science Balor Universit Graph Representation Graph An ordered pair GV,E with a set of vertices V and a set of edges E Etended Graph Representation Directed vs. undirected graph Whether each edge has a direction Weighted vs. unweighted graph Whether each edge has a weight Labeled vs. unlabeled graph Whether each verte has a label 2-D vs. 3-D graph representation Each verte has angles between two linked edges 1

Wh Graph Mining is Important? Data are often represented as a graph Biological networks Chemical compounds Internet WWW Electric circuits Workflows Social networks Graph is a general model for data mining!! Graph Data Mining Topics 1 Single Graph Mining Frequent sub-graph pattern mining Finding sub-graphs that frequentl occur in a graph Graph clustering Verte clustering Partitioning a graph into sub-graphs Verte classification Classifing a verte in a graph 2

Graph Data Mining Topics 2 Graph Dataset Mining Frequent sub-graph pattern mining Finding sub-graphs that frequentl occur among graphs Graph data clustering Grouping similar graphs Graph data classification Classifing a new graph id 1 2 graph 3 4 Applications Application of Single Graph Mining Biological network analsis Social network analsis Web communit analsis Application of Graph Dataset Mining Biochemical structure analsis Program control flow analsis XML structure analsis Challenges Finding the complete set satisfing the minimum support threshold Developing efficient and scalable algorithms Incorporating various kinds of user-specific constraints 3

CSI 4352, Introduction to Data Mining Chapters 11 and 13, Graph Data Mining General Definitions Graph Clustering Subgraph Pattern Mining Connectivit Degree, degv i The number of links from v i to other vertices Incoming degree and outgoing degree for directed graphs Weighted degree sum of the weights of the edges directl connected for weighted graphs A Set of eighbors, v i A set of vertices directl linked to the verte v i Also called adjacent neighbors or direct neighbors Degree Distribution, Pk Probabilit that a verte has eactl k links The number of vertices whose degree is k over the total number of vertices 4

Length & Sie Walk A sequence of vertices such that each is linked to its succeeding one Path A walk such that each verte in the walk is distinct Path Length The number of edges in path p Shortest Path between v i and v j A path with the smallest length out of all paths from v i to v j Characteristic Path Length of G Average length of the shortest paths between each pair of vertices Diameter of G Largest length of the shortest paths between each pair of vertices Densit Densit of GV,E The number of actual edges in G over the number of all possible edges 2 E D G V V 1 Clique A full connected graph also called, complete graph DG = 1 Quasi-Clique Close to clique A densel connected sub-graph DG > θ where θ is a user-specified threshold 5

Modularit Clustering Coefficient of v i The densit of a sub-graph G V,E where V is the set of neighbors of v i j v, v, v v j vk k i C vi v v 1 i i Measuring the effectiveness of v i on denseness Average Clustering Coefficient of GV,E Average of the clustering coefficients of all vertices in V Maimum is 1 Measuring the modularit of G Centralit Closeness, C c v i Detects the vertices located in the center of a graph CC vi v j V 1 p v, v s i j where p s v i,v j is the shortest path length between v i and v j Betweenness, C B v i Detects the vertices located between two clusters C v B i svi tv st vi st where σ st is the number of shortest paths between s and t, and σ st v i is the number of shortest paths between s and t, which pass through the verte v i 6

CSI 4352, Introduction to Data Mining Chapters 11 and 13, Graph Data Mining General Definitions Graph Clustering Subgraph Pattern Mining Graph Clustering Problem Definition Finding densel connected sub-graphs G V,E from a graph GV,E Finding sub-graphs with dense intra-connections and sparse interconnections modularit Methods Densit-based methods Hierarchical methods Partition-based methods 7

CSI 4352, Introduction to Data Mining Chapters 11 and 13, Graph Data Mining General Definitions Graph Clustering Densit-based Methods Partition-based Methods Hierarchical Methods Subgraph Pattern Mining Maimum Clique Find all maimum sied cliques Use antimonotonic propert If a subset of set A is not a clique, then the set A is not a clique G A B C D E F J I K sie 2 cliques: {AB}, {AC}, {AE},.. sie 3 cliques: {ABC}, {ACE},.. sie 4 cliques: {JKLM} L M 8

Clique Percolation Definitions Two k-cliques are adjacent if the share k-1 vertices where k is the number of vertices in each clique A k-clique chain is a sub-graph comprising the union of a sequence of adjacent k cliques 1 Find all k-cliques 2 Find all maimal k-clique chains b iterative merging adjacent k-cliques Reference Palla, G., et al., Uncovering the overlapping communit structure of comple networks in nature and societ, ature 2005 k-core Decomposition Definition k-core is a sub-graph b pruning all vertices whose degree is less than k Remove repeatedl all vertices whose degree is less than k A B C D E F J G I K k = 3 L Reference Wuch, S. and Almaas, E., Peeling the east protein network, Proteomics 2005 M 9

Seed Growth Main Idea Search for local optimiation Use a modularit or densit function Tpes of seeds Random seeds: selected randoml Core seeds: selected b degree or clustering coefficient 1 Select a verte seed as an initial cluster S 2 Add a neighbor of the seed to S repeatedl to find the maimum modularit or densit 3 Return S if the modularit of S > threshold 4 Repeat 1, 2 and 3 to find a set of clusters Graph Entrop 1 Definitions An eample of seed-growth algorithms General notation Inner links of v in G V,E : edges from v to the vertices in V p i v: probabilit of v having inner links Outer links of v in G V,E : edges from v to the vertices not in V p o v: probabilit of v having outer links Definitions Verte entrop: ev = - p i v log 2 p i v - p o v log 2 p o v Graph entrop : egv,e = Σ vv ev Find the minimum graph entrop during seed growth 10

Graph Entrop 2 Eample Graph Entrop 3 1 Select a seed node, and include all neighbors of the seed node into a seed cluster 2 Iterativel remove a neighbor if removal decreases graph entrop 3 Iterativel add a node on the outer boundar of a current cluster if addition decreases graph entrop 4 Output the cluster with the minimal graph entrop 5 Repeat 1, 2, 3, and 4 until no seed node remains Reference Kenle, E.C. and Cho, Y.-R., Detecting protein complees and functional modules from protein interaction networks: a graph entrop approach, Proteomics 2011 11

CSI 4352, Introduction to Data Mining Chapters 11 and 13, Graph Data Mining General Definitions Graph Clustering Densit-based Methods Partition-based Methods Hierarchical Methods Subgraph Pattern Mining Restricted eighborhood Search 1 Main Idea Iterative moves of vertices to find the best global modularit Tpes of moves Global move: moving a random verte to a random cluster Intensification move: moving in the restricted neighborhood vertices on the boundar of partitions 1 Randoml partition the graph into k subgraphs 2 Make an intensification move of a random verte if this move improves modularit 3 Repeat 2 until finding the best modularit 12

Restricted eighborhood Search 2 Eample Modularit function: number of interconnecting edges between clusters G B D F I A C E J K L M Reference King, A., et al., Protein comple prediction via cost-based clustering Bioinformatics 2004 CSI 4352, Introduction to Data Mining Chapters 11 and 13, Graph Data Mining General Definitions Graph Clustering Densit-based Methods Partition-based Methods Hierarchical Methods Subgraph Pattern Mining 13

Bottom-Up vs. Top-Down Bottom-Up Agglomerative Approaches Start with each verte as a cluster Iterativel merge the closest clusters Require a distance function between two clusters Top-Down Divisive Approaches Start with the whole graph as a cluster Recursivel divide up the clusters Require a cutting algorithm Merging b Shortest Path Length Main Idea Agglomerative approach using single-link distance 1 Select two closest vertices from different clusters based on the shortest path length between them 2 Merge two clusters that include the selected vertices 3 Repeat 1 and 2 until the shortest path length reaches a threshold 14

15 Merging b Common eighbors 1 Main Idea Agglomerative approach using the similarit based on common neighbors More common neighbors two vertices share, more similar the are 1 Find the most similar vertices from different clusters based on a similarit function 2 Merge the two clusters if the merged cluster reaches a densit threshold 3 Repeat 1 and 2 until no more clusters can be merged Similarit Functions Jaccard coefficient: Geometric coefficient:, S, 2 S Merging b Common eighbors 2 More Similarit Functions Dice coefficient: Simpson coefficient: Marland bridge coefficient: Reference Brun, C., et al., Clustering proteins from interaction networks for the prediction of cellular functions BMC Bioinformatics 2004 2, S, min, S 2 1, S

Merging b Statistical Significance Statistical Similarit Function Hper-geometric coefficient P-value: V V Z V X Z X Z Y Z P V V X Y V is the total number of vertices, X=, Y=, and Z= for vertices and Repeatedl find the vertices with the smallest P-value, and merge them Reference Samanta, M.P. and Liang, S., Predicting protein functions from redundancies in large-scale protein interaction networks PAS 2003 Minimum Cut Definitions Cut: a set of edges whose removal disconnects the graph Minimum cut: a cut with minimum number of edges Recursivel find the minimum cut B D F G I A Parameter C E J K Minimum densit threshold Minimum sie threshold L M 16

Betweenness Cut Betweenness Measurement of vertices or edges located between clusters 1 Iterativel eliminate a verte or an edge with the highest Betweenness value until the graph is separated 2 Recursivel appl 1 into each subgraph 3 Repeat 1 and 2 until all subgraphs reach a densit threshold Reference Dunn, R., et al., The use of edge-betweenness clustering to investigate biological function in protein interaction networks BMC Bioinformatics 2005 Dividing b Common eighbors Main Idea Divisive approach using the dissimilarit based on common neighbors Less common neighbors two vertices share, more dissimilar the are 1 Iterativel eliminate the edge between the most dissimilar vertices based on a similarit function, until the graph is separated 2 Recursivel appl 1 into each subgraph 3 Repeat 1 and 2 until all subgraphs reach a densit threshold Reference Radicchi, F., et al., Defining and identifing communities in networks PAS 2004 17

CSI 4352, Introduction to Data Mining Chapters 11 and 13, Graph Data Mining General Definitions Graph Clustering Subgraph Pattern Mining Properties of Subgraph Patterns from Graph Dataset Properties Anti-monotonic propert Apriori algorithm If a sub-graph G is not frequent, then none of the super-graphs of G are frequent id graph Eample If is infrequent, so do and 1 2 3 4 18

Properties of Subgraph Patterns from a Single Graph Properties Frequent sub-graph pattern mining in a graph ot anti-monotonic propert! Even if a sub-graph G is not frequent, some of the super-graphs of G might be frequent Eample Suppose minimum support is 10 How man? How man? FSG Algorithm FSG Frequent Sub-graph discover 1 Initiall, find all frequent sie-3 sub-graphs 2 Generate candidate sie-k+1 sub-graphs from frequent sie-k sub-graphs 3 Count support of each candidate sub-graph to select frequent sub-graphs 4 Repeat 2 and 3 until no frequent sub-graph or no candidate is found 19

Candidate Sub-graphs Selective Joining Join two sie-k sub-graphs if the share sie-k-1 sub-graphs Ma join the same sie-k sub-graph Produce multiple distinct sie-k+1 sub-graphs + + Summar of FSG Algorithm Strength Apriori pruning Weakness Generates a huge set of candidate sub-graphs Requires multiple scans of database Inefficient for mining large-sied sub-graph patterns eeds efficient finding of isomorphic graphs to count support Reference Kuramochi, M. and Karpis, G., Frequent subgraph discover. In Proceedings of ICDM 2001 20

Structural Isomorphism Definition If two graphs are isomorphic, then the are structurall identical Eamples Canonical Adjacenc Matri Adjacenc Matri 0 1 1 1 0 1 0 1 0 0 1 1 0 1 1 1 0 1 0 0 0 0 1 0 0 Canonical Adjacenc Matri Canonical Code 1110100110 1 1 1 1 0 1 0 0 1 0 21

FFSM Algorithm FFSM Fast frequent sub-graph mining Main Idea Use canonical adjacenc matrices for selective joining and support a 1 1 b 1 b counting a 1 b 1 1 b 1 1 0 b Reference a a + 1 b 1 b 1 0 b 1 1 b 0 1 0 b 0 1 0 b a a 1 b 1 b + 1 1 b 1 1 b 1 0 1 b 1 1 1 b Huan, J., Wang, W. and Prins, J., Efficient mining of frequent subgraph in the presence of isomorphism. In Proceedings of ICDM 2003 DFS Code DFS Tree Trace on depth-first-search DFS Tree in a Leicographic Order DFS tree constructed b the leicographic order of labels DFS Code A sequence of edges of the DFS tree in a leicographic order {,,,,,,,,,,,} 22

gspan Algorithm gspan Graph-based sub-structure pattern mining Main Idea Use DFS codes for frequent sub-graph pattern mining Efficient for pattern growth Reference Yan, X. and Han, J., gspan: Graph-based sub-structure pattern mining, In Proceedings of ICDM 2002 Yan, X. and Han, J., CloseGraph: Mining closed frequent graph patterns, In Proceedings of KDD 2003 Questions? Lecture Slides on the Course Website, www.ecs.balor.edu/facult/cho/4352 23