gapprox: Mining Frequent Approximate Patterns from a Massive Network

Size: px
Start display at page:

Download "gapprox: Mining Frequent Approximate Patterns from a Massive Network"

Transcription

1 gapprox: Mining Frequent Approximate Patterns from a Massive Network Chen Chen Xifeng Yan Feida Zhu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {cchen3, feidazhu, hanj}@cs.uiuc.edu xifengyan@us.ibm.com Abstract Recently, there arise a large number of graphs with massive sizes and complex structures in many new applications, such as biological networks, social networks, and the Web, demanding powerful data mining methods. Due to inherent noise or data diversity, it is crucial to address the issue of approximation, if one wants to mine patterns that are potentially interesting with tolerable variations. In this paper, we investigate the problem of mining frequent approximate patterns from a massive network and propose a method called gapprox. gapprox not only finds approximate network patterns, which is the key for many knowledge discovery applications on structural data, but also enriches the library of graph mining methodologies by introducing several novel techniques such as: () a complete and redundancy-free strategy to explore the new pattern space faced by gapprox; and (2) transform frequent in an approximate sense into an anti-monotonic constraint so that it can be pushed deep into the mining process. Systematic empirical studies on both real and synthetic data sets show that frequent approximate patterns mined from the worm protein-protein interaction network are biologically interesting and gapprox is both effective and efficient. Introduction In the past, there have been a set of interesting algorithms [4, 0, 6] that mine frequent patterns in a set of graphs. Recently, there arise a large number of graphs with massive sizes and complex structures in many new applications, such as biological networks, social networks, and the Web, demanding powerful data mining methods. Because of their characteristics, we are now interested in patterns that frequently appear at many different places of a single network. Example Let us consider a Protein-Protein Interaction (PPI) network in Biology. A PPI network is a huge graph whose vertices are individual proteins, where an edge exists between two vertices if and only if there is a significant protein-protein interaction. Due to some underlying biological process, occasionally we may observe two subnets P a and P b, which are quite similar in the sense that, after proper correspondence, discernable resemblance exists between individual proteins, e.g., with regard to their amino acids, secondary structures, etc., and the interactions within P a and P b are nearly identical to each other. ubc-8 M02G9. lys- pqn-5 pqn-54 (a) F46F. abu-8 unc-9 pqn- ubc- abu- abu- pqn-5 lys-2 M95.2 (b) F30H5.3 Y65B4A. F35A5.4 Figure. Two subnets extracted from the worm PPI network, where proteins at the corresponding positions of (a) and (b) are biologically quite similar, and 2 PPI deletions plus 3 PPI insertions transform (a) into (b). There are in general two major complications to mine such massive and highly complex networks: First, compared to algorithms targeting a set of graphs, mining frequent patterns in a single network needs to partition the network into regions, where each region contains one occurrence of the pattern. This partition changes from one pattern to another; whereas for any given partition, regions may overlap with each other as well. All these problems are not solved by existing technologies for mining a set of graphs. Second, due to various inherent noise or data diversity, it is crucial to account for approximations so that all potentially interesting patterns can be captured. Cast to the PPI network we described in Example (see Fig.), as long as their similarity is above some threshold, it is ideal to detect P b as a place where P a approximately appears. In retrospect, compared to the rich literature on mining frequent patterns in a set of graphs, single network based algorithms have been examined to a minor extent. [5,, ] In Biology, this might represent a mechanism to backup a set of proteins whose mutual interactions support a vital function of the network, so that in case of any unexpected events, the copy can switch in.

2 took an initial step toward this direction; however, they only considered the first issue and did not pay enough attention to the second, i.e., none of them mined approximate patterns. As will be manifested later, the above two issues are actually intertwined when approximation comes into play; and thus, our major challenge is to lay out a new mining paradigm that does consider approximate matching in its search space. To summarize, we made the following contributions:. We investigate the problem of mining frequent approximate patterns from a massive network: We give an approximation measure and show its impact on mining, i.e., how a pattern s support should be counted based on its approximate occurrences in the network. 2. We propose a graph mining method called gapprox to tackle this problem: We design a novel strategy that is both complete and redundancy-free to explore the new pattern space faced by gapprox, and transform frequent in an approximate sense into an anti-monotonic constraint so that it can be pushed deep into the mining process. 3. We present systematic empirical studies on both real and synthetic data sets: The results show that frequent approximate patterns mined from the worm protein-protein interaction network are biologically interesting and gapprox is both effective and efficient. 4. The techniques we developed for gapprox are general, which can be applied to networks from other domains, e.g., social networks. The rest of this paper is organized as follows. Section 2 presents the general model of mining frequent approximate patterns from a massive network. The mining algorithm is developed in Section 3. We report empirical studies, and give related work and discussions in Sections 4 and 5, respectively. Section 6 concludes the paper. 2 Problem Formulation Definition (Network) A network G is an edge-weighted graph, i.e., G is a 3-tuple (V G, E G, w G ), where w G : E G R + is a weighting function mapping each edge e uv = (u, v) E G to a real number w G [u, v] > 0. This setting is well-defined for many real applications, where vertices represent different entities, edges denote mutual relationship between entities, and weights indicate the tightness of such relationship (the tighter the relationship, the closer the two entities, and thus the smaller the weight). Definition 2 (Pattern) A pattern P is a connected and induced subgraph of the network G, which can be represented by a connected vertex set V P V G. 2. Approximate Pattern Occurrences The scenario we are interested in is: P, as a fragment extracted from a particular region of G, may also appear in some other regions approximately. This is associated with an injective function m : V p V G mapping each vertex v V P to m(v) V G. Now, to quantify the degree of approximation m incurs, we want to take into account: () approximation on vertices, and (2) approximation on edges. Vertex Penalties: For a vertex v V P, a list of matchable vertices matchable(v) V G is provided. Let { < if m(v) matchable(v) dis sim[v, m(v)] = (0) otherwise i.e., approximations can only happen within the matchable list. Usually, this is a reasonable assumption in real applications. For example, in a PPI network, though we do not want to require that two vertices are only matchable if they represent the same protein, it is also aimless to match two proteins that are very dissimilar or even irrelevant to each other. Finally, we can obtain the matchable lists by assuming a similarity cut-off δ among all pairs of vertices. Edge Penalties: For a relationship (v i, v j ) of P and its image (m(v i ), m(v j )) under mapping m, an intuitive way to define a measure penalizing approximation is to compare the relationship tightness associated with each of them. One way of quantifying the tightness here is to plug in a shortest path based distance, which in essence treats the shortest path u v as a pseudo edge between u and v, with its weight equalling the total weight summed over u v. Now, having a (pseudo) weighted edge between each pair of vertices, we can simply take an absolute difference and present the penalty function as: dist[v i, v j ] dist[m(v i ), m(v j )] () There could be other alternatives (e.g., max flow based) to define a distance between two vertices. Or, if the specific application provides us with full tightness information among all pairs of vertices, we can directly take them as input. For instance, in the case of Example, in order to reflect the number of PPIs that are different (i.e., present in one but missing in the other) after the pattern is embedded, we may adopt the following dist function for any two proteins pr i and pr j in the PPI network, { if pri and pr dist[pr i, pr j ] = j interact (2) 2 otherwise since 2 =. In this perspective, Eq. and our discussions below will be built on an abstract symbol of dist. Definition 3 (Degree of Approximation) Given a pattern P and a network G, an injective mapping m embeds P in some region of G by matching a vertex v V P to a vertex m(v) V G. This embedding is associated with a degree of approximation approx(p m G), which is defined as: dis sim[v, m(v)] + v V P dist[v i, v j ] dist[m(v i ), m(v j )] (3) v i,v j V P

3 Definition 4 (Approximate Occurrence) Given error tolerance, the vertex set {m(v) v V P } for an embedding m is said to be an approximate occurrence of P if and only if approx(p m G). 2.2 Pattern Support with Approximation Downward-closure is an important requirement to perform efficient mining [9], i.e., if P is a subpattern of P (V P V P ), then sup(p ) sup(p ) must hold. If this requirement is violated, for a task that asks for all patterns with support higher than min sup, there is no way we can stop searching for even bigger patterns at any stage during the mining process, because it is always possible for a pattern to qualify min sup after growing. This will make any algorithm suffer from uncontrollable explosions. Looking at Fig.2, as sup(p 23 ) = 2 > sup(p 2 ) = if overlapping occurrences are counted, we have to follow a support definition in which overlaps are prohibited. Definition 5 (Pattern Support with Approximation) Two occurrences of P are said to be disjoint if and only if they do not share any vertices in common. Then, P s pattern support with approximation is defined to be the maximal number of disjoint ones that can be chosen from the set of its approximate occurrences. Lemma sup(p ) sup(p ), if P is a sub-pattern of P, i.e., pattern support with approximation is an antimonotonic mining constraint. Definition 6 (Frequent Approximate Pattern Mining) Given a network G and two thresholds: () maximal degree of approximation, (2) minimum support min sup, find the pattern set P such that P P, sup(p ) min sup. Here, any P P is called a frequent approximate pattern. 2 3 P 23 2 P 2 (a) Figure 2. Embedding patterns P 23 and P 2 of (a) in (b), where can match, 2 can match 2, 3 can match 3 /3. 3 Algorithm As a mining problem targeting all frequent approximate patterns in a network, there are two major issues we need to tackle, each of them will be discussed below.. Pattern Space Exploration: What is the problem s full pattern space, and how can we search through it? As we indicated in the introduction, this search space must take approximation into account. 2. Support Counting: For patterns in the search space, how can we count their support and report those frequent ones with regard to a predetermined threshold min sup? Based on Section 2, for each pattern, we should enumerate its approximate occurrences in the network at first, and then decide the maximal number of disjoint occurrences. ' 2' (b) 3' 3'' 3. Pattern Space Exploration As influenced by approximation, the pattern space we face here is different from that in an exact mining paradigm. Previous algorithms like [5] and [] assume a network with vertex/edge labels, while a pattern is nothing but a labeled subgraph. By exact matching, in order to embed a pattern in some region of the network, vertices/edges at corresponding positions must have identical labels. In this sense, the pattern space there consists of all labeled graphs, which can be organized on a lattice and explored by search strategies such as breadth-first search [4] and depth-first search with right-most extensions [0]. In our setting, the network is not a labeled one; to accommodate approximations, what we have is a matchable list for each vertex indicating all vertices that are highly similar, treated as its copies. Keeping this in mind, it seems necessary for us to treat every vertex as unique, which means that each induced subgraph of the network may potentially be a different pattern, i.e., all connected vertex sets in a given network is our new pattern space here. In the following presentation, we are going to introduce a strategy that is both complete and redundancy-free to traverse the above search space. Fig.3 is taken as a running example. To begin with, we losslessly decompose the pattern space as follows:. Find all connected vertex sets in G that contain. 2. Remove from G, and find all connected vertex sets in the new graph G that contain And so on so forth... where step i explores all patterns that contain vertex i but do not contain vertices, 2,..., i. Now, we discuss step, i.e., generating all connected vertex sets starting from ; all other steps follow the same routine. Stage : After the decomposition shown in Fig.3(b), start from and mark. Stage 2: Expand from to reach 2, 5, 6. Mark 2, 5, 6. There are totally seven connected vertex sets in this stage: {,2},{,5},{,6},{,2,5},{,2,6},{,5,6},{,2,5,6}. Indeed, as 2, 5, 6 are all adjacent to, {} union any combination of 2, 5, 6 should be connected. Here, we can assume an order of 6>5>2, unrepeatedly traverse the powerset of {2,5,6} (as we do for itemsets), and finally union with {}. The whole procedure of stage 2 is depicted in Fig.3(c). Stage 3: Taking each of the seven connected vertex sets in stage 2 as a starting point, continue expansion. In Fig.3(d), we pick {,5,6} as an example and reach 4,. Mark 4,. Explore {,5,6} union anything in the powerset of {4,} in the same manner as we did in stage 2. Note that, though there is an edge between 5 and 2, we prohibit the expansion to 2, because 2 has already been marked in previous stages, in which case {,5,6} {2}={,2,5,6} is just another starting point among the seven in stage 2. More

4 generally, only those vertices that are both adjacent and unmarked can be absorbed during expansion. (a) (b) 5 6 A pattern P s support is defined to be the maximal number of disjoint ones that can be chosen from P s approximate occurrences in the network. However, deciding this maximum turns out to be a very hard problem, which is related to the NP-Complete maximal independent set, as some previous works have shown [5]. If it is crucial to report the accurate frequency of patterns, we can always calculate it by brute-force; otherwise, an upperbound, like the one developed in Algorithm 2, is enough, which will be used to stop growing patterns based on the downward-closure property. (d) (c) (e) Algorithm Complete and Non-redundant Exploration Function: Explore(G) Input: a network G. Output: all connected vertex sets in G. : for each v G do mark(v) = false; 2: pick a vertex u with smallest ID, mark(u) = true, output {u}; 3: DFS vertical(g, {u}); 4: remove u from G, and let the resulting graph be G; 5: Explore( G); Figure 3. Exploring the Pattern Space: (a) the original graph, (b) decomposition of the problem and stage, (c) stage 2, (d) stage 3, (e) stage 4, where each stage is enclosed respectively within a loop. We darken marked vertices along the way, while edges are correspondingly changed from dotted to thickened as new vertices are absorbed. Stage 4: Finally, in Fig.3(e), we pick {,5,6,4,} as an example from the three connected vertex sets expanded in stage 3, for which only an unmarked 3 can be absorbed. Generate {,5,6,4,,3} and stop expansion, because there are no more unmarked vertices now. Algorithm gives pseudo-code for the above process. Theorem Explore() in Algorithm is both complete and redundancy-free, i.e., given a network G, () it only generates connected vertex sets in G; (2) it can generate all connected vertex sets in G; (3) it does not generate the same connected vertex set more than once. 3.2 Support Counting Now we are ready to look at support counting, in which our first task is to enumerate the approximate occurrences of any pattern encountered during Section 3. s pattern space exploration. Since Explore() follows a depth-first fashion, which continuously absorbs new vertices and expands the current pattern P to P until backtracking, we can incrementally obtain the occurrences of P based on those of P, i.e., whenever we expand P to P = P {v}, the occurrences of P are also expanded by adding another vertex that is matchable with v. Occurrences with degree of approximation more than and patterns with support less than min sup are dropped immediately. Function: DFS horizontal(g, P, V expand ) Input: a network G, a set P of connected vertices in G, a vertex set V expand whose powerset is to be unioned with P. Output: all connected vertex sets in G that consist of P, a proper subset of V expand, and some currently unmarked vertices. : Order the vertices in V expand as v < < v k ; 2: for i = to k do 3: P = P {v i }, output P ; 4: V expand = {v i+,..., v k }; 5: DFS horizontal(g, P, V expand ); 6: DFS vertical(g, P ); Function: DFS vertical(g, P ) Input: a network G, a set P of connected vertices in G. Output: all connected vertex sets in G that consist of P and some currently unmarked vertices. : V expand ={v v G, v / P, mark(v) = false, and P {v} is connected in G}; 2: for each v V expand do mark(v) = true; 3: DFS horizontal(g, P, V expand ); 4: for each v V expand do mark(v) = false; Algorithm 2 s idea in providing an upperbound on the maximal number of disjoint ones that can be chosen from a pattern s occurrences is explained by the following example. Think each occurrence as a vertex set and assume there are 4 of them: {v, v 2 }, {v, v 3 }, {v, v 4 }, {v 2, v 5 }, then v acts like a bottleneck because it is contained in 3 sets: {v, v 2 }, {v, v 3 }, {v, v 4 }, which is the most among all 5 vertices. Obviously, at most set can be chosen from these

5 Algorithm 2 Providing a Support Upperbound Function: Upperbound(M P ) Input: the set of occurrences M P for a pattern P. Output: an upperbound on the maximal number of disjoint vertex sets that can be chosen from M P. : sup bound= 0; 2: while M P do 3: let v be the one that appears in the greatest number of vertex sets in M P ; 4: sup bound=sup bound+; 5: for each m M P do 6: if m contains v then remove m from M P ; 3 in order to ensure disjointness. Keep iterating the same procedure, we go on to consider the remaining set {v 2, v 5 } after all sets containing v are removed: As we can get only disjoint set here, the total number of disjoint occurrences that can be chosen is at most + = 2. Lemma 2 Given a pattern P, its support must be less than or equal to the Upperbound(M P ) provided by Algorithm 2. Now we are ready to combine all above together and deliver gapprox. The main skeleton of gapprox is Algorithm s pattern space exploration: When examining each pattern encountered, i.e., on the 3rd line of DFS horizontal(), we expand the occurrences of P (i.e., M P ) to obtain those of P (let them be M P ), and then based on M P, Upperbound() is called to decide whether P should be grown further: If not, we terminate early. In summary, gapprox is formed by simply inserting occurrence enumeration, support upperbound calculation, and a conditional branch on the 3rd line of Algorithm s DFS horizontal() function. 4 Experiments We performed empirical study on both real and synthetic graph data sets. The real graph dataset is a worm PPI network, obtained from the DIP database (Database of Interacting Proteins: The synthetic data generator is provided by Kuramochi et al. [4]. All experiments are done on a Windows machine with 3GHz Pentium IV CPU and GB MEM; programs are written in C Worm PPI Network The worm PPI network contains 2,63 proteins and 4,030 PPIs. There are 2,902 pairs of matchable proteins having BLAST similarity score higher than δ = 0. The most frequent protein has 4 copies, while on average a protein is similar to counterparts, which suggests that we must set min sup low in order to capture frequent patterns. We want to discover protein subnet patterns that approximately occur in more than min sup locations of the PPI network. As introduced in Section 2., Eq. is used to quantify the distance between any two vertices in the network. In order to focus solely on the interactions of proteins, within δ, we treat the vertex (protein) mismatch penalty dis sim as 0. Fig.(a) is one of the patterns discovered, while Fig.(b) gives its occurrence whose degree of approximation is 5 (2 edge deletions plus 3 edge insertions). In addition to similar interconnecting topologies, the two subnetworks are composed of proteins with similar functions and are often from the same protein family. Pairs of functionally similar proteins are located in similar locations in the network, indicating that these two protein networks may be responsible for similar biological mechanisms. For example, proteins pqn-54 and pqn-5 have the highest degree in each network and are both Prion-like-(Q/N-rich)- domain-bearing proteins. The proteins adjacent to these two proteins also share similar function such as lys- and lys-2 which are both lysozyme proteins. Apart from pattern interestingness, we further examine the computational characteristics of gapprox. Fig.4 and Fig.5 illustrate the number of frequent approximate patterns and performance (i.e., running time) with varying minimum support (min sup) and maximal approximation ( ). It can be seen that when increases, both the running time and the number of frequent approximate patterns increase; while a reverse situation exists for min sup. These phenomena are well expected. Note that, in Fig.4 and Fig.5, = 4 is the maximal approximation marked on the plots; although 4 seems to be a relatively small number, it is not small indeed: From Fig., we can see that patterns here are in general not very dense networks approximation on 4 PPIs means 50% error tolerance for a pattern with 8 PPIs, which is already a quite significant amount. Even though, our algorithm can still finish in about 6 minutes, which clearly demonstrates the efficiency of gapprox. 4.2 Synthetic Data We then test gapprox on a series of synthetic data sets, showing its efficiency. The data set we take is DT3kL200I0V60E [4], i.e., (D) network with 3,000 (T) vertices (and 4,96 edges), which is generated by 200 (L) seed patterns of size 0 (I); the number of possible vertex and edge labels are set to 60 (V) and (E), respectively. Here, two vertices are matchable (with dis sim = 0 since they are identical ) if the same label is assigned to them. Furthermore, we randomly pick a real number from [0.5, ] to be the weight on each edge: Since we do not want two separate vertices to have a distance close to 0, 0.5 is fixed as a lowerbound. We adopt the shortest path based distance and make a connectivity cut-off based on it. If the shortest path distance is less than.5, then the two vertices are considered to be connected, i.e., during pattern exploration, they are adjacent, and the pattern expansion from one vertex to the other is enabled.

6 Number of Patterns =0 = =2 =3 =4 Running Time (seconds) =0 = =2 =3 =4 Running Time (seconds) =0 = =2 =3 =4 Running Time (seconds) Minimum Support Minimum Support Minimum Support T used in the generator Figure 4. Pattern Number w.r.t min sup and Maximal Approximation, the worm PPI network Figure 5. Running Time w.r.t min sup and Maximal Approximation, the worm PPI network Figure 6. Running Time w.r.t min sup and Maximal Approximation, the synthetic dataset Figure. Running Time w.r.t the Number of Vertices in the Network, the synthetic dataset Fig.6 shows the algorithm s performance (i.e., running time) w.r.t min sup and maximal approximation ; the trends are similar as those depicted in Fig.5. We then change the size of the generated network, and show how gapprox reacts (Fig. varies the number of vertices (T) in a network from 000 to 5000). 5 Related Work and Discussions Quite a few algorithms have been developed to mine frequent patterns over graph data [4, 0, 6]. Most of them worked on a set of graphs, which do not apply to the single graph mining scenario here. Only a few [5,, ] studied the pattern mining problem in a single network. Some studies formulate the problem as a searching procedure: Given a query Q, searching asks for those places where Q exactly/approximately appears in the network. Exact search is often referred to as graph matching, which has been actively pursued for decades [2]. Recently, a few approximate search methods have also been developed to align the query path/substructure with the subject network [3, 8], with the help of pre-built indices. However, mining is still quite different from searching in that we never know what the patterns are before discovering them. Thus, there are no pre-defined queries that can be leveraged as axles for the algorithms to search around. 6 Conclusions In this paper, we investigate the problem of mining frequent approximate patterns from a massive network: We give an approximation measure and show its impact on mining, i.e., how a pattern s support should be counted based on its approximate occurrences in the network. An algorithm called gapprox is presented. Empirical studies show that patterns mined from real protein-protein interaction networks are biologically interesting and gapprox is both effective and efficient. The techniques we developed for gapprox is general, which can be applied to networks from other domains as well. As a promising topic, it would be interesting to systematically study how gapprox can be modified to reach bigger, thus more interesting patterns even faster, with some sacrifice on the completeness of mining results. Acknowledgements. The work was supported in part by the U.S. National Science Foundation NSF IIS /06-42, and NSF BDI The authors thank Xianghong Jasmine Zhou and Michael R. Mehan for providing the cleaned PPI data and their interpretation of the patterns discovered by gapprox. References [] J. Chen, W. Hsu, M.-L. Lee, and S.-K. Ng. Nemofinder: dissecting genome-wide protein-protein interactions with meso-scale network motifs. In KDD, pages 06 5, [2] D. Conte, P. Foggia, C. Sansone, and M. Vento. Thirty years of graph matching in pattern recognition. IJPRAI, 8(3): , [3] M. Koyutürk, A. Grama, and W. Szpankowski. Pairwise local alignment of protein interaction networks guided by models of evolution. In RECOMB, pages 48 65, [4] M. Kuramochi and G. Karypis. Frequent subgraph discovery. In ICDM, pages , 200. [5] M. Kuramochi and G. Karypis. Finding frequent patterns in a large sparse graph. Data Min. Knowl. Discov., (3):243 2, [6] S. Nijssen and J. N. Kok. A quickstart in frequent structure mining can make a difference. In KDD, pages , [] F. Schreiber and H. Schwöbbermeyer. Frequency concepts and pattern detection for the analysis of motifs in networks. Transactions on Computational Systems Biology, 3 (LNBI 33):89 04, [8] Y. Tian, R. C. McEachin, C. Santos, D. J. States, and J. M. Patel. SAGA: a subgraph matching tool for biological graphs. Bioinformatics, pages , [9] N. Vanetik, S. E. Shimony, and E. Gudes. Support measures for graph data. Data Min. Knowl. Discov., 3(2): , [0] X. Yan and J. Han. gspan: Graph-based substructure pattern mining. In ICDM, pages 2 24, 2002.

Data Mining in Bioinformatics Day 5: Graph Mining

Data Mining in Bioinformatics Day 5: Graph Mining Data Mining in Bioinformatics Day 5: Graph Mining Karsten Borgwardt February 25 to March 10 Bioinformatics Group MPIs Tübingen from Borgwardt and Yan, KDD 2008 tutorial Graph Mining and Graph Kernels,

More information

Data Mining in Bioinformatics Day 3: Graph Mining

Data Mining in Bioinformatics Day 3: Graph Mining Graph Mining and Graph Kernels Data Mining in Bioinformatics Day 3: Graph Mining Karsten Borgwardt & Chloé-Agathe Azencott February 6 to February 17, 2012 Machine Learning and Computational Biology Research

More information

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery

AC-Close: Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery : Efficiently Mining Approximate Closed Itemsets by Core Pattern Recovery Hong Cheng Philip S. Yu Jiawei Han University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center {hcheng3, hanj}@cs.uiuc.edu,

More information

Pattern Mining in Frequent Dynamic Subgraphs

Pattern Mining in Frequent Dynamic Subgraphs Pattern Mining in Frequent Dynamic Subgraphs Karsten M. Borgwardt, Hans-Peter Kriegel, Peter Wackersreuther Institute of Computer Science Ludwig-Maximilians-Universität Munich, Germany kb kriegel wackersr@dbs.ifi.lmu.de

More information

gspan: Graph-Based Substructure Pattern Mining

gspan: Graph-Based Substructure Pattern Mining University of Illinois at Urbana-Champaign February 3, 2017 Agenda What motivated the development of gspan? Technical Preliminaries Exploring the gspan algorithm Experimental Performance Evaluation Introduction

More information

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining Chloé-Agathe Azencott & Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institutes

More information

gprune: A Constraint Pushing Framework for Graph Pattern Mining

gprune: A Constraint Pushing Framework for Graph Pattern Mining gprune: A Constraint Pushing Framework for Graph Pattern Mining Feida Zhu Xifeng Yan Jiawei Han Philip S. Yu Computer Science, UIUC, {feidazhu,xyan,hanj}@cs.uiuc.edu IBM T. J. Watson Research Center, psyu@us.ibm.com

More information

Chapter 2. Related Work

Chapter 2. Related Work Chapter 2 Related Work There are three areas of research highly related to our exploration in this dissertation, namely sequential pattern mining, multiple alignment, and approximate frequent pattern mining.

More information

A CSP Search Algorithm with Reduced Branching Factor

A CSP Search Algorithm with Reduced Branching Factor A CSP Search Algorithm with Reduced Branching Factor Igor Razgon and Amnon Meisels Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva, 84-105, Israel {irazgon,am}@cs.bgu.ac.il

More information

On Dimensionality Reduction of Massive Graphs for Indexing and Retrieval

On Dimensionality Reduction of Massive Graphs for Indexing and Retrieval On Dimensionality Reduction of Massive Graphs for Indexing and Retrieval Charu C. Aggarwal 1, Haixun Wang # IBM T. J. Watson Research Center Hawthorne, NY 153, USA 1 charu@us.ibm.com # Microsoft Research

More information

Efficient Semi-supervised Spectral Co-clustering with Constraints

Efficient Semi-supervised Spectral Co-clustering with Constraints 2010 IEEE International Conference on Data Mining Efficient Semi-supervised Spectral Co-clustering with Constraints Xiaoxiao Shi, Wei Fan, Philip S. Yu Department of Computer Science, University of Illinois

More information

On Graph Query Optimization in Large Networks

On Graph Query Optimization in Large Networks On Graph Query Optimization in Large Networks Peixiang Zhao, Jiawei Han Department of omputer Science University of Illinois at Urbana-hampaign pzhao4@illinois.edu, hanj@cs.uiuc.edu September 14th, 2010

More information

The Encoding Complexity of Network Coding

The Encoding Complexity of Network Coding The Encoding Complexity of Network Coding Michael Langberg Alexander Sprintson Jehoshua Bruck California Institute of Technology Email: mikel,spalex,bruck @caltech.edu Abstract In the multicast network

More information

Parallel Tree-Projection-Based Sequence Mining Algorithms

Parallel Tree-Projection-Based Sequence Mining Algorithms Parallel Tree-Projection-Based Sequence Mining Algorithms Valerie Guralnik and George Karypis Department of Computer Science and Engineering, Digital Technology Center, and Army HPC Research Center. University

More information

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases

SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases SA-IFIM: Incrementally Mining Frequent Itemsets in Update Distorted Databases Jinlong Wang, Congfu Xu, Hongwei Dan, and Yunhe Pan Institute of Artificial Intelligence, Zhejiang University Hangzhou, 310027,

More information

Lecture 10. Elementary Graph Algorithm Minimum Spanning Trees

Lecture 10. Elementary Graph Algorithm Minimum Spanning Trees Lecture 10. Elementary Graph Algorithm Minimum Spanning Trees T. H. Cormen, C. E. Leiserson and R. L. Rivest Introduction to Algorithms, 3rd Edition, MIT Press, 2009 Sungkyunkwan University Hyunseung Choo

More information

Mining Significant Graph Patterns by Leap Search

Mining Significant Graph Patterns by Leap Search Mining Significant Graph Patterns by Leap Search Xifeng Yan (IBM T. J. Watson) Hong Cheng, Jiawei Han (UIUC) Philip S. Yu (UIC) Graphs Are Everywhere Magwene et al. Genome Biology 2004 5:R100 Co-expression

More information

Mining frequent Closed Graph Pattern

Mining frequent Closed Graph Pattern Mining frequent Closed Graph Pattern Seminar aus maschninellem Lernen Referent: Yingting Fan 5.November Fachbereich 21 Institut Knowledge Engineering Prof. Fürnkranz 1 Outline Motivation and introduction

More information

p v P r(v V opt ) = Algorithm 1 The PROMO algorithm for module identification.

p v P r(v V opt ) = Algorithm 1 The PROMO algorithm for module identification. BIOINFORMATICS Vol. no. 6 Pages 1 PROMO : A Method for identifying modules in protein interaction networks Omer Tamuz, Yaron Singer, Roded Sharan School of Computer Science, Tel Aviv University, Tel Aviv,

More information

Clustering Using Graph Connectivity

Clustering Using Graph Connectivity Clustering Using Graph Connectivity Patrick Williams June 3, 010 1 Introduction It is often desirable to group elements of a set into disjoint subsets, based on the similarity between the elements in the

More information

Practice Problems for the Final

Practice Problems for the Final ECE-250 Algorithms and Data Structures (Winter 2012) Practice Problems for the Final Disclaimer: Please do keep in mind that this problem set does not reflect the exact topics or the fractions of each

More information

Optimization I : Brute force and Greedy strategy

Optimization I : Brute force and Greedy strategy Chapter 3 Optimization I : Brute force and Greedy strategy A generic definition of an optimization problem involves a set of constraints that defines a subset in some underlying space (like the Euclidean

More information

FP-Growth algorithm in Data Compression frequent patterns

FP-Growth algorithm in Data Compression frequent patterns FP-Growth algorithm in Data Compression frequent patterns Mr. Nagesh V Lecturer, Dept. of CSE Atria Institute of Technology,AIKBS Hebbal, Bangalore,Karnataka Email : nagesh.v@gmail.com Abstract-The transmission

More information

Survey on Graph Query Processing on Graph Database. Presented by FAN Zhe

Survey on Graph Query Processing on Graph Database. Presented by FAN Zhe Survey on Graph Query Processing on Graph Database Presented by FA Zhe utline Introduction of Graph and Graph Database. Background of Subgraph Isomorphism. Background of Subgraph Query Processing. Background

More information

Leveraging Set Relations in Exact Set Similarity Join

Leveraging Set Relations in Exact Set Similarity Join Leveraging Set Relations in Exact Set Similarity Join Xubo Wang, Lu Qin, Xuemin Lin, Ying Zhang, and Lijun Chang University of New South Wales, Australia University of Technology Sydney, Australia {xwang,lxue,ljchang}@cse.unsw.edu.au,

More information

A Multiple-Line Fitting Algorithm Without Initialization Yan Guo

A Multiple-Line Fitting Algorithm Without Initialization Yan Guo A Multiple-Line Fitting Algorithm Without Initialization Yan Guo Abstract: The commonest way to fit multiple lines is to use methods incorporate the EM algorithm. However, the EM algorithm dose not guarantee

More information

SeqIndex: Indexing Sequences by Sequential Pattern Analysis

SeqIndex: Indexing Sequences by Sequential Pattern Analysis SeqIndex: Indexing Sequences by Sequential Pattern Analysis Hong Cheng Xifeng Yan Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign {hcheng3, xyan, hanj}@cs.uiuc.edu

More information

Computational Molecular Biology

Computational Molecular Biology Computational Molecular Biology Erwin M. Bakker Lecture 3, mainly from material by R. Shamir [2] and H.J. Hoogeboom [4]. 1 Pairwise Sequence Alignment Biological Motivation Algorithmic Aspect Recursive

More information

A Model of Machine Learning Based on User Preference of Attributes

A Model of Machine Learning Based on User Preference of Attributes 1 A Model of Machine Learning Based on User Preference of Attributes Yiyu Yao 1, Yan Zhao 1, Jue Wang 2 and Suqing Han 2 1 Department of Computer Science, University of Regina, Regina, Saskatchewan, Canada

More information

Faster parameterized algorithms for Minimum Fill-In

Faster parameterized algorithms for Minimum Fill-In Faster parameterized algorithms for Minimum Fill-In Hans L. Bodlaender Pinar Heggernes Yngve Villanger Technical Report UU-CS-2008-042 December 2008 Department of Information and Computing Sciences Utrecht

More information

Value Added Association Rules

Value Added Association Rules Value Added Association Rules T.Y. Lin San Jose State University drlin@sjsu.edu Glossary Association Rule Mining A Association Rule Mining is an exploratory learning task to discover some hidden, dependency

More information

Fundamental Properties of Graphs

Fundamental Properties of Graphs Chapter three In many real-life situations we need to know how robust a graph that represents a certain network is, how edges or vertices can be removed without completely destroying the overall connectivity,

More information

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1

Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanford.edu) January 11, 2018 Lecture 2 - Graph Theory Fundamentals - Reachability and Exploration 1 In this lecture

More information

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data

Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Outlier Detection Using Unsupervised and Semi-Supervised Technique on High Dimensional Data Ms. Gayatri Attarde 1, Prof. Aarti Deshpande 2 M. E Student, Department of Computer Engineering, GHRCCEM, University

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret Greedy Algorithms (continued) The best known application where the greedy algorithm is optimal is surely

More information

Special course in Computer Science: Advanced Text Algorithms

Special course in Computer Science: Advanced Text Algorithms Special course in Computer Science: Advanced Text Algorithms Lecture 8: Multiple alignments Elena Czeizler and Ion Petre Department of IT, Abo Akademi Computational Biomodelling Laboratory http://www.users.abo.fi/ipetre/textalg

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data: Part I Instructor: Yizhou Sun yzsun@ccs.neu.edu November 12, 2013 Announcement Homework 4 will be out tonight Due on 12/2 Next class will be canceled

More information

Association Pattern Mining. Lijun Zhang

Association Pattern Mining. Lijun Zhang Association Pattern Mining Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction The Frequent Pattern Mining Model Association Rule Generation Framework Frequent Itemset Mining Algorithms

More information

Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References q Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology 9/9/ I9 Introduction to Bioinformatics, Clustering algorithms Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Outline Data mining tasks Predictive tasks vs descriptive tasks Example

More information

Mining Edge-disjoint Patterns in Graph-relational Data

Mining Edge-disjoint Patterns in Graph-relational Data Mining Edge-disjoint Patterns in Graph-relational Data Christopher Besemann and Anne Denton North Dakota State University Department of Computer Science Fargo ND, USA christopher.besemann@ndsu.edu, anne.denton@ndsu.edu

More information

On the Space-Time Trade-off in Solving Constraint Satisfaction Problems*

On the Space-Time Trade-off in Solving Constraint Satisfaction Problems* Appeared in Proc of the 14th Int l Joint Conf on Artificial Intelligence, 558-56, 1995 On the Space-Time Trade-off in Solving Constraint Satisfaction Problems* Roberto J Bayardo Jr and Daniel P Miranker

More information

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings

On the Relationships between Zero Forcing Numbers and Certain Graph Coverings On the Relationships between Zero Forcing Numbers and Certain Graph Coverings Fatemeh Alinaghipour Taklimi, Shaun Fallat 1,, Karen Meagher 2 Department of Mathematics and Statistics, University of Regina,

More information

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences

BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences BIOL591: Introduction to Bioinformatics Alignment of pairs of sequences Reading in text (Mount Bioinformatics): I must confess that the treatment in Mount of sequence alignment does not seem to me a model

More information

Data Mining Technologies for Bioinformatics Sequences

Data Mining Technologies for Bioinformatics Sequences Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment

More information

A Graph-Based Approach for Mining Closed Large Itemsets

A Graph-Based Approach for Mining Closed Large Itemsets A Graph-Based Approach for Mining Closed Large Itemsets Lee-Wen Huang Dept. of Computer Science and Engineering National Sun Yat-Sen University huanglw@gmail.com Ye-In Chang Dept. of Computer Science and

More information

Sub-Graph Finding Information over Nebula Networks

Sub-Graph Finding Information over Nebula Networks ISSN (e): 2250 3005 Volume, 05 Issue, 10 October 2015 International Journal of Computational Engineering Research (IJCER) Sub-Graph Finding Information over Nebula Networks K.Eswara Rao $1, A.NagaBhushana

More information

Monotone Constraints in Frequent Tree Mining

Monotone Constraints in Frequent Tree Mining Monotone Constraints in Frequent Tree Mining Jeroen De Knijf Ad Feelders Abstract Recent studies show that using constraints that can be pushed into the mining process, substantially improves the performance

More information

Mining Closed Relational Graphs with Connectivity Constraints

Mining Closed Relational Graphs with Connectivity Constraints Mining Closed Relational Graphs with Connectivity Constraints Xifeng Yan Computer Science Univ. of Illinois at Urbana-Champaign xyan@uiuc.edu X. Jasmine Zhou Molecular and Computational Biology Univ. of

More information

Measuring and Evaluating Dissimilarity in Data and Pattern Spaces

Measuring and Evaluating Dissimilarity in Data and Pattern Spaces Measuring and Evaluating Dissimilarity in Data and Pattern Spaces Irene Ntoutsi, Yannis Theodoridis Database Group, Information Systems Laboratory Department of Informatics, University of Piraeus, Greece

More information

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search Informal goal Clustering Given set of objects and measure of similarity between them, group similar objects together What mean by similar? What is good grouping? Computation time / quality tradeoff 1 2

More information

FastA & the chaining problem

FastA & the chaining problem FastA & the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 1 Sources for this lecture: Lectures by Volker Heun, Daniel Huson and Knut Reinert,

More information

Scan Scheduling Specification and Analysis

Scan Scheduling Specification and Analysis Scan Scheduling Specification and Analysis Bruno Dutertre System Design Laboratory SRI International Menlo Park, CA 94025 May 24, 2000 This work was partially funded by DARPA/AFRL under BAE System subcontract

More information

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms

Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Lecture 16, Spring 2014 Instructor: 罗国杰 gluo@pku.edu.cn In This Lecture Parallel formulations of some important and fundamental

More information

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:

FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10: FastA and the chaining problem, Gunnar Klau, December 1, 2005, 10:56 4001 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem

More information

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures

Enhancing Clustering Results In Hierarchical Approach By Mvs Measures International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 10, Issue 6 (June 2014), PP.25-30 Enhancing Clustering Results In Hierarchical Approach

More information

Local Algorithms for Sparse Spanning Graphs

Local Algorithms for Sparse Spanning Graphs Local Algorithms for Sparse Spanning Graphs Reut Levi Dana Ron Ronitt Rubinfeld Intro slides based on a talk given by Reut Levi Minimum Spanning Graph (Spanning Tree) Local Access to a Minimum Spanning

More information

A Roadmap to an Enhanced Graph Based Data mining Approach for Multi-Relational Data mining

A Roadmap to an Enhanced Graph Based Data mining Approach for Multi-Relational Data mining A Roadmap to an Enhanced Graph Based Data mining Approach for Multi-Relational Data mining D.Kavinya 1 Student, Department of CSE, K.S.Rangasamy College of Technology, Tiruchengode, Tamil Nadu, India 1

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

Graphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs

Graphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs Graphs and Network Flows IE411 Lecture 21 Dr. Ted Ralphs IE411 Lecture 21 1 Combinatorial Optimization and Network Flows In general, most combinatorial optimization and integer programming problems are

More information

MARGIN: Maximal Frequent Subgraph Mining Λ

MARGIN: Maximal Frequent Subgraph Mining Λ MARGIN: Maximal Frequent Subgraph Mining Λ Lini T Thomas Satyanarayana R Valluri Kamalakar Karlapalem enter For Data Engineering, IIIT, Hyderabad flini,satyag@research.iiit.ac.in, kamal@iiit.ac.in Abstract

More information

FUTURE communication networks are expected to support

FUTURE communication networks are expected to support 1146 IEEE/ACM TRANSACTIONS ON NETWORKING, VOL 13, NO 5, OCTOBER 2005 A Scalable Approach to the Partition of QoS Requirements in Unicast and Multicast Ariel Orda, Senior Member, IEEE, and Alexander Sprintson,

More information

A Framework for Clustering Massive Text and Categorical Data Streams

A Framework for Clustering Massive Text and Categorical Data Streams A Framework for Clustering Massive Text and Categorical Data Streams Charu C. Aggarwal IBM T. J. Watson Research Center charu@us.ibm.com Philip S. Yu IBM T. J.Watson Research Center psyu@us.ibm.com Abstract

More information

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006 The Encoding Complexity of Network Coding Michael Langberg, Member, IEEE, Alexander Sprintson, Member, IEEE, and Jehoshua Bruck,

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank

More information

Chapters 11 and 13, Graph Data Mining

Chapters 11 and 13, Graph Data Mining CSI 4352, Introduction to Data Mining Chapters 11 and 13, Graph Data Mining Young-Rae Cho Associate Professor Department of Computer Science Balor Universit Graph Representation Graph An ordered pair GV,E

More information

An algorithm for Performance Analysis of Single-Source Acyclic graphs

An algorithm for Performance Analysis of Single-Source Acyclic graphs An algorithm for Performance Analysis of Single-Source Acyclic graphs Gabriele Mencagli September 26, 2011 In this document we face with the problem of exploiting the performance analysis of acyclic graphs

More information

Discovering Frequent Geometric Subgraphs

Discovering Frequent Geometric Subgraphs Discovering Frequent Geometric Subgraphs Michihiro Kuramochi and George Karypis Department of Computer Science/Army HPC Research Center University of Minnesota 4-192 EE/CS Building, 200 Union St SE Minneapolis,

More information

Towards Graph Containment Search and Indexing

Towards Graph Containment Search and Indexing Towards Graph Containment Search and Indexing Chen Chen 1 Xifeng Yan 2 Philip S. Yu 2 Jiawei Han 1 Dong-Qing Zhang 3 Xiaohui Gu 2 1 University of Illinois at Urbana-Champaign {cchen37, hanj}@cs.uiuc.edu

More information

A Partition Method for Graph Isomorphism

A Partition Method for Graph Isomorphism Available online at www.sciencedirect.com Physics Procedia ( ) 6 68 International Conference on Solid State Devices and Materials Science A Partition Method for Graph Isomorphism Lijun Tian, Chaoqun Liu

More information

A Connection between Network Coding and. Convolutional Codes

A Connection between Network Coding and. Convolutional Codes A Connection between Network Coding and 1 Convolutional Codes Christina Fragouli, Emina Soljanin christina.fragouli@epfl.ch, emina@lucent.com Abstract The min-cut, max-flow theorem states that a source

More information

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT

USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT IADIS International Conference Applied Computing 2006 USING AN EXTENDED SUFFIX TREE TO SPEED-UP SEQUENCE ALIGNMENT Divya R. Singh Software Engineer Microsoft Corporation, Redmond, WA 98052, USA Abdullah

More information

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA

UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA UAPRIORI: AN ALGORITHM FOR FINDING SEQUENTIAL PATTERNS IN PROBABILISTIC DATA METANAT HOOSHSADAT, SAMANEH BAYAT, PARISA NAEIMI, MAHDIEH S. MIRIAN, OSMAR R. ZAÏANE Computing Science Department, University

More information

6. Lecture notes on matroid intersection

6. Lecture notes on matroid intersection Massachusetts Institute of Technology 18.453: Combinatorial Optimization Michel X. Goemans May 2, 2017 6. Lecture notes on matroid intersection One nice feature about matroids is that a simple greedy algorithm

More information

DATA MINING II - 1DL460. Spring 2014"

DATA MINING II - 1DL460. Spring 2014 DATA MINING II - 1DL460 Spring 2014" A second course in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt14 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

Locality-sensitive hashing and biological network alignment

Locality-sensitive hashing and biological network alignment Locality-sensitive hashing and biological network alignment Laura LeGault - University of Wisconsin, Madison 12 May 2008 Abstract Large biological networks contain much information about the functionality

More information

CMTreeMiner: Mining Both Closed and Maximal Frequent Subtrees

CMTreeMiner: Mining Both Closed and Maximal Frequent Subtrees MTreeMiner: Mining oth losed and Maximal Frequent Subtrees Yun hi, Yirong Yang, Yi Xia, and Richard R. Muntz University of alifornia, Los ngeles, 90095, US {ychi,yyr,xiayi,muntz}@cs.ucla.edu bstract. Tree

More information

C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking

C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking C-Cubing: Efficient Computation of Closed Cubes by Aggregation-Based Checking Dong Xin Zheng Shao Jiawei Han Hongyan Liu University of Illinois at Urbana-Champaign, Urbana, IL 6, USA Tsinghua University,

More information

Pincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set

Pincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set Pincer-Search: An Efficient Algorithm for Discovering the Maximum Frequent Set Dao-I Lin Telcordia Technologies, Inc. Zvi M. Kedem New York University July 15, 1999 Abstract Discovering frequent itemsets

More information

Mining Top K Large Structural Patterns in a Massive Network

Mining Top K Large Structural Patterns in a Massive Network Mining Top K Large Structural Patterns in a Massive Network Feida Zhu Singapore Management University fdzhu@smu.edu.sg Xifeng Yan University of California at Santa Barbara xyan@cs.ucsb.edu Qiang Qu Peking

More information

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010

Principles of Bioinformatics. BIO540/STA569/CSI660 Fall 2010 Principles of Bioinformatics BIO540/STA569/CSI660 Fall 2010 Lecture 11 Multiple Sequence Alignment I Administrivia Administrivia The midterm examination will be Monday, October 18 th, in class. Closed

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies May 15, 2014 1 BLAST What is BLAST? The algorithm 2 Genome assembly De novo assembly Mapping assembly 3

More information

Efficient homomorphism-free enumeration of conjunctive queries

Efficient homomorphism-free enumeration of conjunctive queries Efficient homomorphism-free enumeration of conjunctive queries Jan Ramon 1, Samrat Roy 1, and Jonny Daenen 2 1 K.U.Leuven, Belgium, Jan.Ramon@cs.kuleuven.be, Samrat.Roy@cs.kuleuven.be 2 University of Hasselt,

More information

Chapter 2 Basic Structure of High-Dimensional Spaces

Chapter 2 Basic Structure of High-Dimensional Spaces Chapter 2 Basic Structure of High-Dimensional Spaces Data is naturally represented geometrically by associating each record with a point in the space spanned by the attributes. This idea, although simple,

More information

Mining Frequent Patterns without Candidate Generation

Mining Frequent Patterns without Candidate Generation Mining Frequent Patterns without Candidate Generation Outline of the Presentation Outline Frequent Pattern Mining: Problem statement and an example Review of Apriori like Approaches FP Growth: Overview

More information

CS 341: Algorithms. Douglas R. Stinson. David R. Cheriton School of Computer Science University of Waterloo. February 26, 2019

CS 341: Algorithms. Douglas R. Stinson. David R. Cheriton School of Computer Science University of Waterloo. February 26, 2019 CS 341: Algorithms Douglas R. Stinson David R. Cheriton School of Computer Science University of Waterloo February 26, 2019 D.R. Stinson (SCS) CS 341 February 26, 2019 1 / 296 1 Course Information 2 Introduction

More information

On Dense Pattern Mining in Graph Streams

On Dense Pattern Mining in Graph Streams On Dense Pattern Mining in Graph Streams [Extended Abstract] Charu C. Aggarwal IBM T. J. Watson Research Ctr Hawthorne, NY charu@us.ibm.com Yao Li, Philip S. Yu University of Illinois at Chicago Chicago,

More information

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures

Lectures by Volker Heun, Daniel Huson and Knut Reinert, in particular last years lectures 4 FastA and the chaining problem We will discuss: Heuristics used by the FastA program for sequence alignment Chaining problem 4.1 Sources for this lecture Lectures by Volker Heun, Daniel Huson and Knut

More information

PACKING DIGRAPHS WITH DIRECTED CLOSED TRAILS

PACKING DIGRAPHS WITH DIRECTED CLOSED TRAILS PACKING DIGRAPHS WITH DIRECTED CLOSED TRAILS PAUL BALISTER Abstract It has been shown [Balister, 2001] that if n is odd and m 1,, m t are integers with m i 3 and t i=1 m i = E(K n) then K n can be decomposed

More information

Introduction III. Graphs. Motivations I. Introduction IV

Introduction III. Graphs. Motivations I. Introduction IV Introduction I Graphs Computer Science & Engineering 235: Discrete Mathematics Christopher M. Bourke cbourke@cse.unl.edu Graph theory was introduced in the 18th century by Leonhard Euler via the Königsberg

More information

Finding frequent closed itemsets with an extended version of the Eclat algorithm

Finding frequent closed itemsets with an extended version of the Eclat algorithm Annales Mathematicae et Informaticae 48 (2018) pp. 75 82 http://ami.uni-eszterhazy.hu Finding frequent closed itemsets with an extended version of the Eclat algorithm Laszlo Szathmary University of Debrecen,

More information

Mining Interesting Itemsets in Graph Datasets

Mining Interesting Itemsets in Graph Datasets Mining Interesting Itemsets in Graph Datasets Boris Cule Bart Goethals Tayena Hendrickx Department of Mathematics and Computer Science University of Antwerp firstname.lastname@ua.ac.be Abstract. Traditionally,

More information

Faster parameterized algorithms for Minimum Fill-In

Faster parameterized algorithms for Minimum Fill-In Faster parameterized algorithms for Minimum Fill-In Hans L. Bodlaender Pinar Heggernes Yngve Villanger Abstract We present two parameterized algorithms for the Minimum Fill-In problem, also known as Chordal

More information

Discovering Frequent Topological Structures from Graph Datasets

Discovering Frequent Topological Structures from Graph Datasets Discovering Frequent Topological Structures from Graph Datasets R. Jin C. Wang D. Polshakov S. Parthasarathy G. Agrawal Department of Computer Science and Engineering Ohio State University, Columbus OH

More information

Introduction to Computer Science and Programming for Astronomers

Introduction to Computer Science and Programming for Astronomers Introduction to Computer Science and Programming for Astronomers Lecture 7. István Szapudi Institute for Astronomy University of Hawaii February 21, 2018 Outline 1 Reminder 2 Reminder We have seen that

More information

EFFICIENT TRANSACTION REDUCTION IN ACTIONABLE PATTERN MINING FOR HIGH VOLUMINOUS DATASETS BASED ON BITMAP AND CLASS LABELS

EFFICIENT TRANSACTION REDUCTION IN ACTIONABLE PATTERN MINING FOR HIGH VOLUMINOUS DATASETS BASED ON BITMAP AND CLASS LABELS EFFICIENT TRANSACTION REDUCTION IN ACTIONABLE PATTERN MINING FOR HIGH VOLUMINOUS DATASETS BASED ON BITMAP AND CLASS LABELS K. Kavitha 1, Dr.E. Ramaraj 2 1 Assistant Professor, Department of Computer Science,

More information

Chapter 4: Association analysis:

Chapter 4: Association analysis: Chapter 4: Association analysis: 4.1 Introduction: Many business enterprises accumulate large quantities of data from their day-to-day operations, huge amounts of customer purchase data are collected daily

More information

Efficient Incremental Mining of Top-K Frequent Closed Itemsets

Efficient Incremental Mining of Top-K Frequent Closed Itemsets Efficient Incremental Mining of Top- Frequent Closed Itemsets Andrea Pietracaprina and Fabio Vandin Dipartimento di Ingegneria dell Informazione, Università di Padova, Via Gradenigo 6/B, 35131, Padova,

More information

Character Recognition

Character Recognition Character Recognition 5.1 INTRODUCTION Recognition is one of the important steps in image processing. There are different methods such as Histogram method, Hough transformation, Neural computing approaches

More information