Extraction of Frequent Subgraph from Graph Database

Size: px

Start display at page:

Download "Extraction of Frequent Subgraph from Graph Database"

Sharleen Floyd
6 years ago
Views:

1 Extraction of Frequent Subgraph from Graph Database Sakshi S. Mandke, Sheetal S. Sonawane Deparment of Computer Engineering Pune Institute of Computer Engineering, Pune, India. Abstract Graphs are promising abstraction of complex structured and semi-structured data. Graph mining techniques extract, analyze and summarize significant and useful information from the graph databases. Finding frequent subgraph from graph database is an essence of graph mining. Sometimes the mined subgraphs are large in numbers, posing difficulty in selecting significant subgraph. Every frequent subgraph is not always significant from the application perspective. This paper proposes an innovative concept to extract significant subgraphs. Our method does this in two stages. In the first stage, frequent subgraphs are identified using frequency threshold ( ϴ), which is an input parameter. In the second stage, feature vectors of subgraphs are generated to calculate its statistical significance. P-value is measure of statistical significance. Key terms Frequent subgraphs mining, feature selection, random walk on graph, statistical significance. I. Introduction Complex data can be effectively represented in graphs. Many application areas such as social networking, web links, bioinformatics, chemistry etc. uses graphs to represent complex data. Graph consists of set of vertices and the edges connecting it. For example, in chemistry, molecule consists of atoms and bonds that are represented in graphs as vertex and edges respectively. In web link, users are represented as nodes and communication links between them are represented as edges. Graph mining can be applied on single graph or series of graphs. A graph database consists of collection of many graphs. Let G B is graph dataset such as G B ={G 1, G 2,,..., G n }. Each graph G i = { V i, E i } is collection of set of vertices and set of edges connecting them. V= {v 1, v 2, v 3,..., v k } and E ={(u,v) u,v ϵ V}. A graph g is subgraph of G, if there is isomorphism from g to G. A support of g is number of graphs in G B where g is subgraph. A subgraph said to be frequent if its support is greater than or equal to user defined frequency threshold ϴ. Extraction of frequent substructures from series of graph database is required in many applications. For example, in chemistry, frequent subgraph mining is aimed to analyze large collections of molecules to find some regularity among molecules of a specific class. Another application can be found in web log files. Web log file are analyzed to search set of activities carried out by users, such as frequently accessed URLs, common group interactions, and so on. Numerous methods are developed to mine frequent subgraphs from graph database. However, in frequent subgraph mining has to face few challenges. The mined patterns may be large in numbers, and every subgraph may not be significant. A frequency parameter not always sufficient to categorize graphs efficiently. Other graph properties may also help in categorization of the graph. For example, benzene is common frequent subgraph in chemistry molecule dataset which is not effective as it does not indicate any biological or chemical activity. Significance of graph depends on the graph data characteristics. The domain specific or topological features are therefore being viewed as reference point to find significance of graphs. Feature analysis helps in reducing answer set and finding significant subgraphs. Extracting feature based frequent subgraph solves the problem of quality selection of frequent subgraph. Page 309

2 Figure 1: Overall Approach Our work is to filter answer set of frequent subgraphs by calculating its statistical significance. As shown in figure 1, firstly frequent subgraphs are extracted and then analyze these graphs in feature domain. Stastical significance refers to difference between samples under observation are real or they are exist just by chance. P value is measure of statistical significance. P value is probability of differences between observed and real. In graph database pvalue is definedd as: Give a graph g and observed frequency threshold µ 0 is statistical significant if probability of its occurrences in random database with frequency µ µ 0.[15] The remainder of the paper is organized as follows. Section II describes related work. Section III presents design of proposed work. Datasets are discussed in section IV. Results are discussed in section V. Conclusion in section VI. II. Related Work Many algorithms based on frequent subgraphs mining have been developed, such as AGM [6], FSG [10], gspan [21], SUBDUE [4], FFSM[7], MoFa [1] and Gaston [13]. Thesee algorithms are broadly classified into apriori- based algorithms and pattern-growth algorithms. In apriori based approach a set of k subgraphs at one level are consider first before generating k+ +1 subgraphs of next level. It uses breadth first search approach to explore graphs of next level. Pattern growth approach uses depth first search to generate subgraph candidates. In pattern growth approach each subgraph g is extended recursively to find all its subgraphs. Various FSM algorithms are developed in last past decades. Now in recent years, research is focused to optimize the result set of FSM to improve quality of it. In survey on graph miningng C. Jiang [2] noted some issues related to FSM which are still in research. He noted that there is need of reducing size of answer set generated by FSM algorithms. In many cases, as number of subgraphs from result set are loo large it is difficult to analyze them individually. Similarly, in some cases redundant subgraphs are present in large result set. Different approaches like approximate frequent subgraphs, closed frequent subgraphs, maximal frequent subgraphs and discriminative subgraphs are useful to address reducing size of subgraph. Defining compact subgraph without disturbing its importance for specific application is difficult. He also noted that feature selection can be incorporated in frequent subgraph mining process. It is useful to achieve better classification using frequent subgraph based classifier. Frequent subgraph mining can be made application specific by applying domain knowledge. In this case, features are used as mining parameters. It is difficult to select suitable parameters for given application as different features are available. In third issue he suggested that t different isomorphism test can be applied, for finding subgraphs. For example instead of exact matching approximate matching concept can be used. SUBDUE [4] algorithm uses heuristic beam search using domain knowledge to reduce search space. GREW [9], gapprox [3], RAM[22] are algorithms which uses approximate measures to generate result set. Above first two issues can be solved by applying feature analysis on graphs. But selecting parameter for significant mining is difficult. Significance parameter may change with an application. Page ranking, graph classification, frequent subgraph mining are the areas in which feature based analysis is in research. Yan and Han [17] presented pattern based ndexing in GIndex to achieve fast graph search. He and Singh proposed a GraphRank [5] which calculates statistical significance of subgraph. Subgraphs are converted into feature vectors for calculating its stastical significance using Pvalue. Gang Li [11], proposed graph Classification method based on Topological and Label Attributes. Cluster component can be used as discriminative property for graph classification is proposed by X. Yan [18]. CORK[12] uses gspan frequent subgraph mining algorithm to generate binary feature vectors for classification. Few algorithms exist that mines significant subgraphs. Milto et al.[20] proposed algorithm that Page 310

3 mines motifs as graph pattern in randomized networks. They use p value calculation to decide significance of pattern. Yan et al. [19] developed a mining framework for mining significant patterns using structural leap search and frequency descending mining concepts. GraphSig [15, 16] method mines the statistical significant subgraphs from the subgraphs at low frequency threshold. Using random walk on graph concepts graphs are converted into feature vectors. P-value of each subgraph is calculated to find statistical significance in feature space. A. Feature Vector Generation To find the feature vectors of mined subgraphs random walk is applied on it. Random walk starts from one node and it keeps jumping over all other nodes within graph. Each neighbour has an equal probability for jumping. In our work we are combining techniques mentioned in GraphRank[5] and GraphSig[15]. To preserve more structural information in subgraph feature vector, we are implementing random walk technique on subgraphs. Stastical significance of subgraph feature vector is then calculated using Pvalue. III. Design of proposed work Figure 3 outlines the proposed idea for finding significant frequent subgraphs. Existing algorithm, like Fast Frequent Subgraph algorithm [7] is applied to extract frequent subgraphs. Figure 3 a: Sample graphs Figure 3b: Frequent subgraphs with frequency threshold is 3 Figure 2: Block Diagram Sample graph database and its frequent subgraphs are illustrated in figure 3a and 3b respectively. Random walk on mined subgraphs is applied to convert them into feature vector. Statistical significance of these feature vectors is then calculated using Pvalue. Feature vector generation and its significance calculation are described in following subsection. A random walk on graph of length L on one graph is a set of X1, X2, X3,Xn random variables where X1= root vertex and Xi+1is neighbouring vertex of Xi and it is chosen uniformly at random. In random walk while traversing from one node to its neighbourhood node s features are captured. Features may consist of nodes, edges, or small subgraph. Even some pharmacophoric features can also be considered as feature. Here, edge type (NNP- node to node pair) is considered as feature [1]. For subgraph having n nodes, n number of Page 311

4 vectors will be generated. All the edges noted as column in feature vector. If specific edge is not present then 0 is inserted in row. After counting all NNP types during random walk; frequency of NNP is calculated. Value of NNP is noted in feature vector as: Value of NNP= Value of NNP is truncated to make it more Starting Node C-1-S S-1-N N-1-O C S N O traceable. calculated. First, probability density function of vector PDF(x) is computed using prior probabilities of features. In prior probabilities matrix each row represents one feature component (in our case, NNP-types). Xij element within prior probability matrix represents feature i found in subgraph feature vectors dataset at least j number of times. NNPs C-1-B C-1-A B-1-A A-1-B Table III: Prior-probability Matrix Probability of feature vector in random vector database can be expressed using joint probability: Figure 4: Sample subgraph Table I: Random walk on graph shown in figure 4. Feature vectors extracted from subgraphs are further analyzed. Subgraph represented in single feature vector by taking floor of values stored in feature vector matrix of subgraph. Finally subgraph is represented in one feature vector in which each column represents frequency count of one NNPtype. Floor of matrix: Floor([x 1,x 2,..., x n ], [y 1, y 2,..., y n ], [z 1,z 2,..., z n ]...)=[Min(x i, y i, z i,...))] for all i=1...n. P(x) = (,.. ) Where P (xi) is the probability that element i occurs at least yi times. Example: P (7, 7, 6, 0) =P(C-1-B 7) P(C-1-A 7) P (B-1-A 3) P (A-1-B) 0 = = Binomial distribution is used to measure frequency of feature vector in database. A random histogram can be viewed as a trial and x occurring in the histogram is success. Number of trials for vector x on database depends on number of histograms. Example: Floor([2,4,2],[2,3,3],[2,2,4])=[2,2,2] P-value(x, µ0) = µ binomial(p(x), i)[1] C-1-B C-1-A B-1-A A-1-B g g g g g g Table II: Subgraph feature vector dataset B. Calculating significance of feature vector In this section, we explain p-value calculation on feature vector of subgraph. The occurrences of each feature vector in random graph database are Lower the Pvalue higher is significance. Algorithm1: CalSignificance (G, maxpvalue) Input: G is a subgraph database with support of each subgraph. MaxPvalue is the p-value threshold, support of each subgraph. Output: O is the answer set of all significant subgraphs. D ø O ø For each g G do for each node in g do Page 312

5 Dg Dg + RWR (g) X X + Vector(floor(Dg)) for each NNP-type nnp in G do for i=1 to G do for k=1 to m do Pnnp (k) {probability (nnp) count of NNP at k th position and Value (G ik ) >= k} for each g in G do Pval=Calculate value(xg, g_support) if Pvalue maxpval then O O+g IV. Datasets In chemistry, molecules are represented in graphs and are analyzed using graph mining techniques. Extraction of frequent substructures from chemical database is required in many of the applications in chemistry domain such as drug discovery. Figure 5: Cyclohexene (C 6 H 10 ) compound in graph. Hydrogen s are implicit in graph. We are testing our experiment on chemical graph datasets. Three different datasets are used. The first dataset is DTP-AIDS Antiviral Screen 1 chemical compound dataset from National Chemical Institute (NCI/NIH). Compounds are divided into three categories on the basis of their antiviral activity. Compound which provides at least 50% protections are classified as CM (Confirm moderately active) and which provides 100% protections are listed as CA (Confirm active).other compounds are listed as CI (Confirm Inactive). Second dataset is anticancer compound dataset from pubchem 2. They are classified into two classes active and inactive. Third dataset is PTE 3 - Predicative Toxicology Evaluation compound dataset by NIEHS. It contains total 340 chemical compounds. 1 http : //dtp.nci.nih.gov/docs/aids/aids data.html 2 Chemical data represented in special different formats such as.sdf,.mol,.cml, and.smile etc. Tools like JoeLib[8], OpenBabel[14] are useful to convert these files format into different file format. V. Discussion about expected result We are implementing our algorithm in Java. The experiments will be performed on a 3.2GHz, 8GB memory PC running Linux Fedora 17. We are using FFSM algorithm[7] to generate frequent subgraphs from graph database. P-value often ranges from 0.01 to 0.1. If subgraph has pvalue less than 0.01 then it is very strong significant. If subgraph has pvalue<= 0.01 and >=.05 then it is strong significant. Subgraph also consider as significant if its pvalue is 0.1. Stastical significance calculation will improve result set. All insignificant subgraphs will be filtered out by calculating p-value. When numbers of frequent subgraphs are large in numbers then this filtering process is more effective. For example, as shown in figure 6, if numbers of frequent subgraphs are then significant subgraphs will not be more than Result set will be reducing by 10%. Thus, some subgraphs which exist just by chance will be filtered out. Running time also increase linearly with increasing number of frequent subgraphs. Freguent subgraphs VI. Conclusion All frequent subgraphs are not always significant one. There is need of one more filtering process. Feature analysis using random walks on graph 3 /PTE MaxPvalue=0. 01 MaxPvalue=0. 1 Significant Frequent Subgraph Figure 6: Frequent subgraphs Vs Significant frequent subgraphs Page 313

6 preserves more structural information. P-value calculation provides statistical significance of feature vector of graph. Quality and quantity of result set will be improved by applying above experiment. Significant feature vectors further can be given as input to classifier for classification. References [1] Borglet, C., & Berlthold, M. (November 2002). Mining Molecular Fragments: Finding Relevant Substructures of Molecules. IEEE International conference on Data Mining, (pp ). Maebashi City, Japan. [2] C. Jiang, F. C. (2004). A Survey of Frequent Subgraph Mining Algorithms. The Knowledge Engineering Review, Cambridge University Press. [3] C., C., Yan, X. Z., & Han., J. (2007). gapprox:mining Frequent Approximate Patterns from Massive Network. 7th IEEE International Conference on Data Mining, (pp ). [4] Cook, D. J., & Holder, L. B. (1994). Substructure Discovery Using Minimum Description Length and Background Knowledge. Journal of Artificial Intelligence Research, 1: [5] He, H., & Singh, A. (2006). "GraphRank: Stastical Modeling and Mining of Significant Subgraphs in the Feature Space". 6th International Conference on Data Mining IEEE Computer Society, (pp ). Washington, DC, US. [6] Inokuchi, A., Wahio, T., & Motoda, H. (2000). An Apriory based Algorithm for Mining Frequent Substructures from Graph Data. PKDD'00, (pp ). [7] J. Huan, Wang, W., & Prins. (2003). Efficient Mining of Frequent Subgraph in Presence of Isomorphism. International Conference on Data Mining, (pp ). [8] JoeLib: A JAva Based Computational Chemistry Pacakge. (2009). Wilhwlm-Schickard- Insitute for Computer Science. Tubinge, Germany. [9] Kuramochi, M. a. (2004). GREW: Scalable Frequent Subgraph Discovery Algorithm. 4th IEEE International Conference on Data Mining, (pp ). [10] Kuramochi, M., & Karypis, G. (2001). Frequent Subgraph discovery. ICDM, (pp ). [11] Li, G., Semerci, M., Yenar, B., & J.Zaki, M. (2011, August). "Graph Classification via Topological and Label Attributes". 9th Workshop on Mining and Learning with Graphs. SIGKDD. [12] M.Thoma, H. C.-P. (October,2010). "Descriminative Frequent Subgraph Mining with Optimally Garuntees.". Statistical Analysis and Data Mining, (pp. 3(5): ). [13] Nijssen, S., & Kok, J. N. (2004). The Gaston tool for frequent Subgraph Mining. International Workshop on Graph-Based Tools. Amsterdam, the Netherlands: Elsevier. [14] OpenBabel An open chemical toolbox. [15] Ranu, S., & Singh, A. (April, 2 009). "GraphSig: A Scalable Approach to Mining Significant Subgraphs in Large Databases". 25th IEEE International Conference on Data Engineering. (ICDE). [16] Ranu, S., & Singh, K. (2009). " Mining Statistically Significant Molecular Sub-structures for Efficient Molecular Classification". Journal of Chemical Information and Modeling, 49, [17] X. Yan, P. Y. (2004). "GrapghIndexing : a frequent structure- based approach" ACM SIGMOD (pp ). SIGMOD. [18] Xifeng Yan, F. Z. (2006). "Featur e-based similarity Search in graph structures.". ACM transaction on Database System, (pp. 31(4): ). [19] Xifeng Yan, H. C. (2008). Mining Significant graph patterns by leap search. SIGMOD, (pp ). [20] Y. Chi, Y. Y. (2003). Indexing and min ing free trees. ICDM. Page 314

7 [21] Yan, X., & Han, J. (2002). "gsapn: Graph - Based Substructure Pattern Mining ". IEEE Computer Society. Washington, DC,USA: ICDM'02. [22] Zhang, S., & Yang, J. (2008). RAM: Randomized Approximate Graph Mining. 20th International Conference on Scientific and Statistical Database Management, (pp ). Page 315

Data Mining in Bioinformatics Day 3: Graph Mining

Graph Mining and Graph Kernels Data Mining in Bioinformatics Day 3: Graph Mining Karsten Borgwardt & Chloé-Agathe Azencott February 6 to February 17, 2012 Machine Learning and Computational Biology Research