Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining

Size: px

Start display at page:

Download "Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining"

Jocelyn Berry
6 years ago
Views:

1 Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining Chloé-Agathe Azencott & Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institutes Tübingen and Eberhard Karls Universität Tübingen Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 1

2 Graphs are everywhere Coexpression network Social network Protein structure Program flow Chemical compound Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 2

3 Mining graph data Graph comparison E.g. Compare PPIN between species Graph classification / regression Predict properties of objects represented as graphs E.g. Predict toxicity of molecular compound, functionality of protein Graph nodes classification / regression Predict properties of objects connected on a graph E.g. Predict functionality of protein, classify pixels in remote sensing images Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 3

4 Mining graph data Graph compression Representing graphs compactly E.g. Store and mine web data Graph clustering Finding dense subnetworks of graphs E.g. Find groups in social networks Link prediction Predicting relationships between nodes of the graph E.g. Predict who should be added to your social network, predict interactions between proteins Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 4

5 Graph pattern mining Graph pattern mining Find frequent / informative graph patterns Summarize patterns Approximate patterns Applications Finding biological conserved subnetworks Finding functional modules Program control flow analysis Intrusion detection Building blocks for graph classification, clustering, compression, comparison Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 5

6 Frequent Pattern Mining Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 6

7 Frequent pattern mining Frequent item set mining Market basket analysis Find items that are frequently purchased together Given a set B = {i 1, i 2,..., i n } of items a list T = {t 1, t 2,..., t m } of transactions t j B a minimum number of occurences s min N Find the set of frequent item sets, i.e. F (s min ) = {I B : {k : I t k } s min } Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 7

8 A Priori [Agrawal et al., 1994] Brute force approach Enumerate all 2 n subsets of B Count how often each of them is included in each of t 1,..., t m Generally infeasible The a-priori property No superset of an infrequent item set can be frequent All subsets of a frequent item set are frequent Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 8

9 A Priori The a-priori algorithm List all singletons, discard the infrequent ones Form pairs of frequent elements, discard infrequent ones... Augment the sets of size k 1 to form all sets of size k of frequent elements, discard infrequent ones Alternate between candidate generation and pruning. Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 9

10 A Priori Generating unique candidates There are k! ways of generating a single set of k items Ensure we do it only once Idea: assign a unique parent set to each set Canonical form The set of possible parents of an item set I is the set of its maximal proper subsets: {J I K : J K I} Put an ordering on B: i 1 < i 2 < < i n Define the canonical parent of I as p c (I) = I \ {max a I a} Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 10

11 A Priori Canonical code words code word for I B: any word w on the alphabet B canonical code word of I w c (I): smallest of these words, in lexicographic order E.g. {a, c, b, e} abce The canonical parent of I p c (I) is described by the longest proper prefix of w c (I). Prefix property: The longest proper prefix of a canonical code word is a canonical code word itself. Equivalently, any prefix of a canonical code word is a canonical code word itself. Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 11

12 A Priori Candidate set generation From frequent item sets of size k 1, construct item sets of size k by appending (frequent) items to their canonical code words Only do so for items greater than the last letter of the canonical code word abe abef, abeg, abec Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 12

13 A Priori Prefix tree a b c d ab ac ad bc bd cd abc abd acd bcd abcd Full prefix tree for B = {a, b, c, d} Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 13

14 A Priori Pruning the prefix tree Only generate unique item sets A-priori property Prune branches at infrequent items Size-based pruning a b c d ab ac ad bc bd cd T = {{a, b}, {a, b, c}, {b, c}, {b}, {b, d}, {d}, {a, c}, {b, c}, {d}, {a, c}, {b, c}, {b, c, d}, {d}, {b}, {b, c, d}, {b, c, d}} abc abd acd bcd abcd 0 Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 14

15 Frequent pattern mining Exploring the search tree Breadth-First Search: find all frequent sets of size k before moving on to size k + 1 A-priori Depth-First Search: find all frequent sets containing element a before moving on to those that contain b but do not contain a Advantage: divide-and-conquer strategy, requires less memory Eclat, FP-growth... Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 15

16 Frequent Subgraph Mining Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 16

17 Graphs A graph is an ordered pair G = (V, E) V is a set of vertices (or nodes) E V V is a set of edges (or links) Edges can be ordered G is directed or not G is undirected A labeled graph is an ordered triplet G = (V, E, l) V is a set of vertices (or nodes) E V V is a set of edges (or links) l : V E A assigns labels to vertices and edges Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 17

18 Frequent subgraph mining Frequent subgraphs Given a set D = {G 1, G 2,..., G N } of graphs a minimum frequency θ min [0, 1] Find the set of frequent subgraphs, i.e. F (θ min ) = {H {i : H subgraph of G i } Nθ min } The frequency of subgraph H is called the support of H supp(h) = {i : H subgraph of G i } θ min is called the minimimum support Often focus on connected subgraphs Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 18

19 Frequent subgraph mining Example: Call graphs Frequent subgraphs: Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 19

20 Frequent subgraph mining Example: Chemical compounds Caffeine Theobromine Sildenafil Adenine Frequent subgraphs: Imidazole Purine Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 20

21 Frequent subgraph mining Subgraph isomorphism Let G = (V G, E G, l G ) and H = (V H, E H, l H ) be two labeled graphs. A subgraph isomorphism from H to G (or an occurrence of H in G) is an injective function f : V H V G such that: v V H : l H (v) = l G (f(v)) (u, v) E H : (f(u), f(v)) E G and l H (u, v) = l G (f(u), f(v)) There may be several (many) ways to map H to G Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 21

22 Frequent subgraph mining Graph isomorphism G and H are isomorphic if there exists a subgraph isomorphism from G to H and from H to G f(1) = A f(2) = C f(3) = D f(4) = B f(5) = F f(6) = E Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 22

23 Frequent subgraph mining Subgraph isomorphism Testing whether there is a subgraph isomorphism between two graphs is generally NP-complete Special cases: linear complexity for planar graphs (e.g. paths, trees, grids) Therefore: Testing whether a subgraph occurs in the database is NP-complete Testing whether a subgraph is isomorphic to an already identified subgraph is NP-complete as well Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 23

24 Frequent subgraph mining The a-priori property No supergraph of an infrequent graph can be frequent All subgraphs of a frequent graph are frequent AGM [Inokuchi et al., 2000], FSG [Kuramochi and Karypis, 2001] Growing from k to k + 1 isn t trivial Eliminating non-frequent subgraphs of size k + 1 involves costly subgraph isomorphisms Canonical representations of graphs More difficult than with item sets. spanning trees adjacency matrices Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 24

25 gspan [Yan and Han, 2002] Spanning tree A graph G is called a tree if for any pair of vertices of G, there exists one and only one path connecting them in G A spanning tree of G is a subgraph S of G that that is a tree whose vertices are the vertices of G, ie. V S = V G G Two spanning trees of G Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 25

26 gspan DFS trees Explore G in DFS order one graph can have several DFS trees Order vertices in discovery order < V v 0 is called the root v n is called the right-most vertex right-most path: straight path v 0 v n forward edges: edges in the DFS tree (i, j) : v i < V v j backward edges: edges not in the DFS tree (i, j) : v j < V v i Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 26

27 gspan Ordering edges (i 1, j 1 ) < E (i 2, j 2 ) if: (i 1, j 1 ) and (i 2, j 2 ) forward: j 1 < j 2 or j 1 = j 2 i 2 > i 1 (i 1, j 1 ) and (i 2, j 2 ) backward: i 1 < i 2 or i 1 = i 2 j 1 < j 2 (i 1, j 1 ) backward and (i 2, j 2 ) forward: i 1 < j 2 (i 1, j 1 ) forward and (i 2, j 2 ) backward: j 1 j 2 (0, 1) < E (0, 4) (2, 0) < E (3, 0) (2, 0) < E (2, 3) Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 27

28 gspan DFS lexicographic order code(g, T ) = (e k ) i=k,...,m s. t. e k < E e k+1 is the DFS code of the DFS tree T If < L is a linear order on the labels, the lexicographic combination of < E and < L is a linear order T over E L L L Let α = (a 1, a 2,..., a mα ) and β = (b 1, b 2,..., b mβ ) be 2 DFS codes. α β iff t, 0 t min(m α, m β ) s. t. a k = b k k < t and a t T b t or a k = b k k m α and m α m β Minimum DFS code The minimum DFS code is a canonical label of G min{code(g, T ) : T spanning tree of G} Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 28

29 gspan Valid minimum DFS codes (e 1,..., e m, e) is a child of (e 1,..., e m ) (e 1,..., e m, e) is a minimum DFS code if (e 1,..., e m ) is a minimum DFS code and e m T e i.e. e must grow from a vertex on the rightmost path of the tree coded by (e 1,..., e m ). Backward edges can only grow from the rightmost vertex. Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 29

30 gspan Extending subgraphs If the extension edge is not a rightmost path extension, then the resulting code word is certainly not canonical. If the extension edge is a rightmost path extension, then the resulting code word may or may not be canonical. DFS code tree Analogous to prefix tree Each node is a DFS code As above, (e 1,..., e m, e) child of (e 1,..., e m ) DFS traversal of DFS code tree DFS lexicographic order Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 30

31 gspan gspan idea From the set of vertices and edge labels, build the DFS tree of frequent subgraphs If vertices are labeled by {A, B, C,... } and edges by {a, b, c,... }: The 1st iteration looks for all frequent subgraphs containing AaA The 2nd iteration looks for all frequent subgraphs containing AaB... At each iteration, subgraph_mining is called to grow subgraphs Growing stops when (a) frequency drops below θ min or (b) a nonminimal code is created Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 31

32 gspan subgraph_mining subgraph_mining(d = {G 1, G 2,..., G N }, S, s): if s not minimal return S S {s} for G D for each instance of s in G for each child c of this instance of s supp(c) ++ for each child c if supp(c) > min supp s c subgraph_mining(d, S, s) Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 32

33 gspan Runtime comparison of FSG and gspan N: number of labels I: average size of potentially frequent subgraphs T : average number of edges per frequent subgraph 200 potentially frequent subgraphs 10 4 graphs, θ min = 0.01 Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 33

34 Enumerating subgraphs Canonical form Adjacency matrix AGM, FSG, FFSM [Huan et al., 2003] Spanning tree gspan Graph exploration BFS ( level-wise search) MoSS/MoFa [Borgelt and Berthold, 2002], AGM DFS gspan Easy subgraphs (paths, trees) first GASTON [Nijssen and Kok, 2005] Avoiding redundancy Canonical form pruning Repository of processed subgraphs MoSS/MoFa, GASTON Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 34

35 Enumerating subgraphs Runtime per pattern (ms) vs. minimum support (%) [Wörlein et al., 2005] Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 35

36 Enumerating subgraphs Memory usage (GB) vs. minimum support (%) [Wörlein et al., 2005] Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 36

37 Pattern summarization Large number of frequent patterns Remember: all subgraphs of a frequent subgraph are frequent AIDS antiviral screen dataset, 400 compounds, support 5% > 10 6 frequent subgraphs Problems: Interpreting frequent patterns Reducing the number of the frequent patterns Setting the minimum support Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 37

38 Pattern summarization Representative Patterns Top k patterns [Xin et al., 2006] Cluster centroids [Chen et al., 2008] Cluster based on pattern similarity Cluster based on data similarity Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 38

39 Closed and maximal subgraphs Closed graph A frequent graph G is closed if there exists no supergraph of G that carries the same support as G If some of G s subgraphs have the same support, it is unnecessary to output these subgraphs (nonclosed graphs) Lossless compression: still ensures that the mining result is complete Maximal frequent graph A frequent graph G is maximal if there exists no supergraph of G that is frequent Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 39

40 Closed and maximal subgraphs (B) (A) (C) (D) is a subgraph of A, B, C, but so is D and E have the same support (3). D is not closed. No supergraph of E is a subgraph of all 3 graphs therefore E is closed. is a subgraph of A and B. F is closed as none (F) of its supergraphs has support 2. (E) If θ min = 70%, E is maximal: it is frequent and none of it supergraphs is frequent. Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 40

41 CloseGraph [Yan and Han, 2003] Extension of gspan to avoid growing subgraphs guaranteed to have only nonclosed descendants Early termination If wherever graph H 1 occurs in the data, graph H 2 = H 1 e occurs as well, then for any graph H, if H 1 is a subgraph of H and H 2 is not, then H is not closed. (1) and (2) systematically co-occur in D. Therefore (3) cannot be closed indeed (4) is a supergraph of (3) with identical support. We need to grow from (2) and not from (1). Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 41

42 CloseGraph Failure of early termination x a y and y b x co-occur in (1) and (2) If we only extend from x a y b x, then we miss pattern (3), which also co-occurs in (1) and (2) Need to distinguish between H e e (creates a new vertex) and H b e (does not create a new vertex) Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 42

43 References and further reading [Agrawal et al., 1994] Agrawal, R., Srikant, R. et al. (1994). Fast algorithms for mining association rules. In VLDB vol. 1215, pp ,. 8 [Borgelt and Berthold, 2002] Borgelt, C. and Berthold, M. R. (2002). Mining molecular fragments: Finding relevant substructures of molecules. In ICDM pp ,. 34 [Chen et al., 2008] Chen, C., Lin, C. X., Yan, X. and Han, J. (2008). On effective presentation of graph patterns: a structural representative approach. In CIKM pp ,. 38 [Huan et al., 2003] Huan, J., Wang, W. and Prins, J. (2003). Efficient mining of frequent subgraphs in the presence of isomorphism. In ICDM pp ,. 34 [Inokuchi et al., 2000] Inokuchi, A., Washio, T. and Motoda, H. (2000). An Apriori-Based Algorithm for Mining Frequent Substructures from Graph Data. In Principles of Data Mining and Knowledge Discovery vol. 1910, of LNCS pp Springer. 24 [Kuramochi and Karypis, 2001] Kuramochi, M. and Karypis, G. (2001). Frequent subgraph discovery. In ICDM pp ,. 24 [Nijssen and Kok, 2005] Nijssen, S. and Kok, J. N. (2005). Frequent graph mining and its application to molecular databases. Electronic Notes in Theoretical Computer Science [Wörlein et al., 2005] Wörlein, M., Meinl, T., Fischer, I. and Philippsen, M. (2005). A quantitative comparison of the subgraph miners MoFa, gspan, FFSM, and Gaston. In PKDD pp , Springer. 35, 36 [Xin et al., 2006] Xin, D., Cheng, H., Yan, X. and Han, J. (2006). Extracting redundancy-aware top-k patterns. In SIGKDD pp ,. 38 [Yan and Han, 2002] Yan, X. and Han, J. (2002). gspan: Graph-based substructure pattern mining. In ICDM pp ,. 25 [Yan and Han, 2003] Yan, X. and Han, J. (2003). CloseGraph: mining closed frequent graph patterns. In SIGKDD pp ,. 41 Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 43

44 The end Next topic (Monday, Dominik Grimm): Classification in Bioinformatics Chloé-Agathe Azencott: Data Mining in Bioinformatics, Page 44

Data Mining in Bioinformatics Day 3: Graph Mining

Graph Mining and Graph Kernels Data Mining in Bioinformatics Day 3: Graph Mining Karsten Borgwardt & Chloé-Agathe Azencott February 6 to February 17, 2012 Machine Learning and Computational Biology Research