Mining more complex patterns: frequent subsequences and subgraphs. Department of Computers, Czech Technical University in Prague

Size: px
Start display at page:

Download "Mining more complex patterns: frequent subsequences and subgraphs. Department of Computers, Czech Technical University in Prague"

Transcription

1 Mining more complex patterns: frequent subsequences and subgraphs Jiří Kléma Department of Computers, Czech Technical University in Prague

2 poutline Motivation for frequent subsequence/subgraph search applications, variance in tasks, what do we already know? connection to itemsets, what changed? larger state spaces, special approaches to special tasks make sense, complexity of canonical code words, algorithms and canonical forms GSP algorithm (Agrawal s APRIORI generalization), FreeSpan for depth-first search, graph canonical forms based on spanning trees, summary the issues covered and not covered. A4M33SAD

3 pfrequent subsequences DNA example motif discovery searches for short sequential patterns in a file of unaligned DNA or protein sequences, searches for discriminative patterns (characteristics) typical for one sequence class, unusual in the other classes, this pattern could relate with the biological/regulation function of the (protein) class, transcription factor interacts with DNA through a particular motif, frequent subsequence search is a subtask, event = nucleotide, string (no time), undirected DNA. A4M33SAD

4 pfrequent subgraphs molecular fragments example acceleration of drug development, ex.: protection of human CEM cells against an HIV infection (public data), high-throughput screening of chemical compounds (37,171 substances tested) 325 confirmed active (100% protection against infection), 877 moderately active (50-99% protection against infection), others confirmed inactive (<50% protection against infection), task: why some compounds active and others not?, where to aim future screening? Borgelt: Frequent Pattern Mining. A4M33SAD

5 pfrequent subgraphs molecular fragments example search for fragments common for the active substances find molecular substructures that frequently appear in active substances, frequent active patterns = subgraphs, search for discriminative patterns we add the requirement that patterns appear only rarely in the inactive molecules, where to aim future tests? what is the most promising pharmacophore, i.e., drug candidate? Borgelt: Frequent Pattern Mining. A4M33SAD

6 pmore complex patterns the global picture The course of presentation directed sequences undirected sequences generalized sequences connected graphs. A4M33SAD

7 pfrequent subsequences similarity to frequent itemsets first of all, similarity in task representation, the process can be intrinsically identical, but we ask different questions itemsets: which insurance contracts people arrange concurrently, sequences: how people arrange insurance contracts in their course of life, transaction representation still formally possible and helpful (universal) more factors must be concerned and stored. Transaction Items (insurance type) t 1 home, life t 2 car, home t 3 pension, life t 4 travel t 5 pension, life Customer Date (time) Items (insurance type) c home, life c travel c car, pension c car, home c pension A4M33SAD

8 pfrequent subsequences similarity to frequent itemsets secondly, similarity in terms of task solution, APRIORI property can easily be generalized for sequences: Each subsequence of a frequent sequence is frequent. the anti-monotone property can also be transformed to monotone one: No supersequence of an infrequent sequence can be frequent. this generalization further extends to general graphs, the model APRIORI-like algorithm for sequential data a direct analogy of the APRIORI algorithm for itemsets, the basic operations (informal a reminder only): 1. search for trivial frequent sequences (typically of the length 0 or 1), 2. generate candidate sequences with the length incremented by 1, 3. check for their actual support in the transaction database, 4. reduce the candidate sequence set a subset of frequent sequences of the given length is created, 5. until the frequent sequence set non-empty go to the step 2. A4M33SAD

9 pfrequent substrings a trivial APRIORI application a string a directed sequence, an equidistant step, exactly one item per transaction, events given by a symbol alphabet, a pattern is an ordered list of neighboring events, a 1... a m is a subsequence of b 1... b n iff i a 1 = b i a m = b i+m. Ex.: DNA sequence (n = 20, A = {a, g, t}) i C i L i 1 {a}, {g}, {t} {a}, {g}, {t} 2 (9 patterns) {aa}, {ga}, {gg}, {gt}, {tg}, {tt} 3 {aaa}, {gaa}, {gga},... (12 patterns) {gaa}, {ggg}, {gtt}, {tga}, {ttg} 4 {gggg}, {gttg}, {tgaa}, {ttga} {gggg}, {tgaa}, {ttga} 5 {ggggg}, {ttgaa} {ttgaa} how to check for support quickly, i.e. how to find all subsequence occurrences in a sequence? algorithms Knuth-Morris-Pratt or Boyer-Moore. A4M33SAD

10 pcanonical form for sequences a canonical (standard) code word a unique sequence representation, based on the symbol alphabet ordering, a usual (not necessary) choice: the lexicographical symbol alphabet ordering a < b < c <..., the lexicographically smallest (smaller) code word taken as canonical (bac < cab), a directed sequence the only interpretation (way of reading), each (sub)sequence is a canonical code word, an undirected sequence two possible ways of reading = two alternative code words, the routine application of lexicographical ordering is not possible, prefix property in a space of canonical code words does not hold: every prefix of a canonical word is a canonical word itself, sequence canonical form prefix canonical form bab bab ba ab cabd cabd cab bac we have to find a different way of forming code words. A4M33SAD

11 pcanonical form for undirected sequences The canonical code words with the prefix property will be formed as follows even and odd length words will be handled separately, code words are started in the middle of sequence, even length odd length sequence a m a m 1... a 2 a 1 b 1 b 2... b m 1 b m a m a m 1... a 2 a 1 a 0 b 1 b 2... b m 1 b m code word a 1 b 1 a 2 b 2... a m 1 b m 1 a m b m a 0 a 1 b 1 a 2 b 2... a m 1 b m 1 a m b m code word b 1 a 1 b 2 a 2... b m 1 a m 1 b m a m a 0 b 1 a 1 b 2 a 2... b m 1 a m 1 b m a m canonical is the lexicographically smaller code word in the table, the sequence is extended by adding a pair a m+1 b m+1 or b m+1 a m+1, one item at the front and one item at the end. an example even length odd length sequence code words sequence code words at at ta ule lue leu data atda taad rules luers leusr A4M33SAD

12 pcanonical form for undirected sequences efficiency two possible code words can be created and compared in O(m), an additional symmetry flag introduced for each sequence enables the same in O(1) m s m = (a i = b i ) i=1 the symmetry flag is maintained in constant time with s m+1 = s m (a m+1 = b m+1 ) sequence extension is permissible when the flag: if s m = true, it must be a m+1 b m+1, if s m = false, any relation between a m+1 and b m+1 is possible. sequences and symmetry flags at the beginning even length: an empty sequence, s 0 = 1, odd length: all frequent alphabet symbols, s 1 = 1, the procedure guarantees exclusively the canonical sequence extensions. A4M33SAD

13 pfrequent subsequences APRIORI application to undirected sequences consider undirected sequences, otherwise the formalization as yet a 1... a m is a subsequence of b 1... b n if: i a 1 = b i a m = b i+m, or i a 1 = b i+m a m = b i, ex.: DNA sequence (n = 20, A = {a, g, t}) i C i L i 0 {} {} 1 {a}, {g}, {t} {a}, {g}, {t} 2 {aa}, {ag}, {at}, {gg}, {gt}, {tt} {aa}, {gg}, {gt}, {tt} 3 {aaa}, {aag}, {aat}, {gag}, {gat}, {tat}, {aga}, {agg}, {agt}, {ggg}, {gtt} {ggg}, {ggt}, {tgt}, {ata}, {atg}, {att}, {gtg}, {gtt}, {ttt} 4 {aaaa}, {aaag}, {aaat}, {gaag}, {gaat}, {taat}, {gggg} {agta}, {agtg}, {ggta}, {agtt}, {tgta}, {ggtg}, {ggtt}, {tgtg},... in total 27 (1) patterns A4M33SAD

14 pa generalized subsequence definition in transactional representation Items: I = {i 1, i 2,..., i m }, itemsets: (x 1, x 2,..., x k ) I, k 1, x i I, sequences: s 1,..., s n, s i = (x 1, x 2,..., x k ) I, s i, x 1 < x 2 <... < x k, an ordered list of elements, elements = itemsets, the canonical representation: lexicographical ordering of items in each itemset, ex.: a(abc)(ac)d(cf), a simplification of the form: (x i ) x i, the sequence length l (l-sequence) given by the number of item instances (occurrences) in sequence, ex.: a(abc)(ac)d(cf) is a 9-sequence, α is a subsequence of β, β is a supersequence of α: α β α = a 1,..., a n, β = b 1,..., b m, 1 j 1 <... < j n m, i = 1... n : a i b ji, ex.: a(bc)df a(abc)(ac)d(cf), d(ab) a(abc)(ac)d(cf), non-contiguous sequential patterns, a sequence database: S = { sid 1, s 1..., sid k, s k } a set of ordered pairs a sequence identifier and a sequence. A4M33SAD

15 psubsequence search in transaction representation Support of α sequence in the database S the number of sequences s S satisfying: α s, Subsequence search in transaction representation, task definition input: S a s min minimum support, output: the complete set of frequent sequential patterns all the subsequences with or above the threshold frequency. Id Sequence 10 a(abc)(ac)d(cf) 20 (ad)c(bc)(ae) 30 (ef)(ab)(df)cb 40 eg(af)cbc Id Time Items 10 t 1 a 10 t 2 a, b, c 10 t 3 a, c 10 t 4 d 10 t 5 c, f l sequential pattern (s min =2) 3 a(bc), aba, abc, (ab)c, (ab)d, (ab)f, aca, acb, acc, adc,... 4 a(bc)a, (ab)dc,... Pei, Han et al.: PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth. A4M33SAD

16 pgsp: Generalized Sequential Patterns [Agrawal, Srikant, 1996] applies the core idea of APRIORI to sequential data, the key issue is generation of the candidate sequential patterns divided into two steps 1. join l-sequence is created by joining of two (l-1)-sequences, (l-1)-sequences can be joined when identical after removal of the first item in one and the last one in second, 2. prune skip each l-sequence which contains an infrequent (l-1)-subsequence. C L 4 3 after join after prune (ab)c, (ab)d, (ab)(cd) (ab)(cd) a(cd), (ac)e, (ab)ce b(cd), bce Agrawal, Srikant: Mining Sequential Patterns: Generalizations and Performance. A4M33SAD

17 pexample: GSP, s min =2 Id Sequence 10 (bd)cb(ac) 20 (bf)(ce)b(fg) 30 (ah)(bf)abf 40 (be)(ce)d 50 a(bd)bcb(ade) s( g ) = s( h ) = 1 < s min (skips a large portion of 92 available 2-candidates), (bd)cba s 10 (bd)cba s 50 (created from (bd)cb a dcba ), (the patterns (bd)ba, (bd)ca and bcba must also be frequent). A4M33SAD

18 papriori algorithm for sequences disadvantages the generate (join step) and test (prune step) method, the problems discussed in terms of frequent itemsets persist and intensify 1. generates a large amount of candidate patterns obvious even for 2-sequences: m m + m(m 1) 2 O(m 2 ) (for itemsets it was just the second fraction, one third or so of candidates), 2. requires a lot of database scans one scan per sequence length, the number of scans given by the max pattern length the length max( s, s S) (typically >> m), (max itemset length is m and thus m scans at most), 3. search for long sequential patterns is difficult the total amount of candidate patterns is exponential with the pattern length, (the same growth as for itemsets, however the problem max( s, s S) >> m). the disadvantages addressed by alternative methods FreeSpan and PrefixSpan algorithms as examples. A4M33SAD

19 pepisode rules association rule analogy for sequences, predict the further development of sequence with the aid of patterns, S is a sequence database, β a frequent subsequence and α is its prefix, episode rule is a probabilistic implication α postfix(β, α), if conf(α postfix(β, α)) = s(β,s) s(α,s) conf min. Id Sequence 10 (bd)cb(ac) 20 (bf)(ce)b(fg) 30 (ah)(bf)abf 40 (be)(ce)d 50 a(bd)bcb(ade) let s min = 2, conf min = 0.7 β = (bd)cba, α = (bd) conf( (bd) cba ) = 1 conf min, β = bcb, α = bc conf( bc b ) = 3 4 conf min, when dealing with a single sequence S the support definition works with constraints (adjoining) sequence elements must not be too far (MaxGap), only the minimal occurences of a sequence are considered. A4M33SAD

20 pfrequent subgraph mining: definition given: graphs G = {G 1,..., G n } with labels A = {a 1,..., a m } and minimum support s min output: the set of frequent (sub)graphs with support meeting the minimum threshold F G (s min ) = {S s G (S) s min }, common constraint connected subgraphs only, main problem to avoid redundancy when searching canonical representation of (sub)graphs, partial order of (sub)graph space, efficient pruning of the searched subgraph space, fragment repository for processed graphs. subgraph S G informally: omit some vertices and (their incident) edges, proper subgraph S G, cannot be identical with the original graph, NP-complete subgraph isomorphism problem needs to be solved to count support! A4M33SAD

21 pfrequent subgraphs: example G contains three molecules, minimum support s min = 2, 15 frequent subgraphs exist, empty graph is properly contained in all graphs by definition. Borgelt: Frequent Pattern Mining. The same for the rest of the presentation. A4M33SAD

22 ptypes of frequent subgraphs closed and maximal maximal subgraph is frequent but none of its proper supergraphs is frequent, the set of maximal (sub)graphs: M G (s min ) = {S s G (S) s min R S : s G (R) < s min }, every frequent (sub)graph has a maximal supergraph, no supergraph of a maximal (sub)graph is frequent, M G (s min ) does not preserve knowledge of support values of all frequent subgraphs, closed subgraph is frequent but none of its proper supergraphs has the same support, the set of closed (sub)graphs: C G (s min ) = {S s G (S) s min R S : s G (R) < s G (S)}, every frequent (sub)graph has a closed supergraph (with the identical support), C G (s min ) preserves knowledge of support values of all frequent subgraphs, every maximal subgraph is also closed. A4M33SAD

23 pclosed and maximal subgraphs: example G contains three molecules, s min = 2, 4 closed subgraphs exist, 2 of them are maximal too. A4M33SAD

24 ppartially ordered set of subgraphs and its search subgraph (isomorphism) relationship defines a partial order on subgraphs Hasse diagram exists, the empty graph makes its infimum, no natural supremum exists, diagram can be completely searched top-down from the empty graph, branching factor is large, the depth-first search is usually preferable. the main problem a (sub)graph can be grown in several different ways, diagram must be turned into a tree each subgraph has a unique parent. A4M33SAD

25 ppartially ordered set of subgraphs and its search Searching for frequent (sub)graphs in the subgraph tree, base loop traverse all possible vertex attributes (their unique parent is the empty graph), recursively process all vertex attributes that are frequent, Recursive processing for a given frequent (sub)graph S generate all extensions R of S by an edge or by an edge and a vertex edge addition (u, v) E S, u V S v V S, if u V S v V S, the missing node is added too, S must be the unique parent of R, if R is frequent, further extend it, otherwise STOP. A4M33SAD

26 passigning unique parents How can we formally define the set of parents of subgraph S? subgraphs that contain exactly one edge less than the subgraph S, in other words, all the maximal proper subgraphs, canonical (unique) parent p c (S) of subgraph S an order on the edges of the (sub)graph S must be given before, let e be the last edge in the order whose removal does not diconnect S then p c (S) is the graph S without the edge e, if e leads to a leaf, we can remove it along with the created isolated node, if e is the only edge of S, we also need an order of the nodes, in order to define an order of the edges we will rely on a canonical form of (sub)graphs each (sub)graph is described by a code word, it unambiguously identifies the (sub)graph (up to automorphism = symmetries), having multiple code words per graph one of them is (lexicographically) singled out as the canonical code word. A4M33SAD

27 pcanonical graph representation Basic idea the characters of the code word describe the edges of the graph, vertex labels need not be unique, they must be endowed with unique labels (numbers), usual requirement on canonical form prefix property, which means that... when the last edge e is removed, the canonical word of the canonical parent originates, assuming the prefix property holds, search algorithm takes the canonical word of a parent and generates all possible extensions by an edge (and maybe a vertex), checks whether the extended code words are the canonical code words, consequence: easy and non-redundant access to children, the most common canonical forms spanning tree, adjacency matrix. A4M33SAD

28 pcanonical forms based on spanning trees Graph code word is created when constructing a spanning tree of the graph numbering the vertices in the order in which they are visited, describing each edge by the numbers of incident vertices, the edge and vertex labels, listing the edge descriptions in the order in which the edges are visited (edges closing cycles may need special treatment), the most common ways of constructing a spanning tree are search: depth-first breadth-first, both approaches ask for their own way of code word construction, one graph may be described by a large number of code words a graph has multiple spanning trees and its traversals (initial vertex, branching options), the main problem is to find the lexicographically smallest = canonical word quickly. A4M33SAD

29 pcanonical forms based on spanning trees A precedence order of labels is introduced due to efficiency, frequency of labels shall be concerned, vertex labels are recommended to be in ascending order, Regular expressions for code words depth-first: a (i d i s b a) m, (exception: indices in decreasing order) meaning of symbols: n the number of vertices of the graph, m the number of edges of the graph, i s index of the source vertex of an edge, i s {0,..., n 1}, i d index of the destination vertex of an edge, i d {0,..., n 1}, a the attribute of a vertex, b the attribute of an edge. A4M33SAD

30 pcanonical spanning tree: example Order of labels elements (vertices): S N O C, bonds (edges): =, A code word A: S 10-N 21-O 31-C 43-C 54-O 64=O 73-C 87-C 80-C A4M33SAD

31 precursive checking for canonical form traverse all vertices with a label no less than the current spanning tree root vertex, recursively add edges, compare the code word with the checked one (potentially canonical) if the new edge description is larger, the edge can be skipped (backtrack), if the new edge description is smaller, the checked code word is not canonical, if the new edge description is equal, the rest of the code word is processed recursively. A: S 10-N 21-O 31-C 43-C 54-O 64=O 73-C 87-C 80-C A4M33SAD

32 prestricted extensions of canonical words Principle of recursive search of subgraph tree generate all possible extensions of a given canonical code word (of a frequent parent), extension = add the description of an edge at the end of the code word, canonical form is checked, if met then proceed recursively, otherwise the word is discarded, how to verify efficiently whether a word is canonical? in general, a lex. smaller word with the same root vertex needs to be found, simple local rules exist that reject extensions locally = immediately only certain vertices are extendable, certain cycles cannot be closed, they represent necessary canonicity conditions, not sufficient. A4M33SAD

33 prestricted extensions of canonical words Depth-first: rightmost path extension extendable vertices must be on the rightmost path of the spanning tree (other vertices cannot be extended in the given search-tree branch), if the source vertex of the new edge is not a leaf, the edge description must not precede the description of the downward edge on the path (the edge attribute must be no less than the edge attribute of the downward edge, if it is equal, the attribute of its destination vertex must be no less than the attribute of the downward edge s destination vertex), edges closing cycles must start at an extendable vertex, must lead to the rightmost leaf (a subgraph has only one vertex meeting the condition), the index of the source vertex must precede the index of the source vertex of any edge already incident to the rightmost leaf. A4M33SAD

34 prestricted extensions: examples Extendability vertices: 0, 1, 3, 7, 8, edges closing cycles: none, Extension: attach a single bond carbon atom at the leftmost oxygen atom A: S 10-N 21-O 31-C 43-C 54-O 64=O 73-C 87-C 80-C 92-C S 10-N 21-O 32-C A4M33SAD

35 pfrequent subgraphs: use of restricted extensions Crossed-arc extension gets immediately rejected, bottom-right graph is visited from its upper-right canonical parent (prefix) only, see the canonical code words: upper left: S 10-N 21-O 31-C 43-C 54-O 64=O 73-C 87-C 80-C bottom left: S 10-N 21-O 31-C 43-C 54-O 64=O 73-C 87-C 80-C 90-O upper right: S 10-N 21-O 32-C 41-C 54-C 65-O 75=O 84-C 98-C bottom right: S 10-N 21-O 32-C 41-C 54-C 65-O 75=O 84-C 98-C 90-C A4M33SAD

36 pfrequent subgraphs search with canonical form Start with a single seed vertex, add an edge (and maybe a vertex) in each step (restricted extensions), determine the support and prune infrequent (sub)graphs (outside the code word space), check for canonical form and prune (sub)graphs with non-canonical code words. A4M33SAD

37 padditional issues Fragment repository canonical code words represent the dominant approach to redundancy reduction, an alternative is to store already processed subgraphs, they are not processed again, key efficiency issues: memory, fast access (hash), extensions for molecules frequent molecular fragments processed en bloc, ring mining, carbon chains and wildcard vertices, single graph only distinct definition of support (more complex), trees ordered unordered, rooted unrooted, in general easier than unrestricted graphs. A4M33SAD

38 pfrequent subsequence/subgraph mining summary Problem closely related to frequent itemset mining APRIORI property, however, to avoid redundancy during search gets more difficult larger branching factor, unlike complex patterns, itemsets have no internal structure, non-trivial canonical sequence/graph representation guarantees that support is counted at most once (additional necessary condition is parental support), choice of representation related with choice of searching algorithm, prefix property allows for early rejection of non-canonical candidates, the canonical form based on spanning trees introduced, support checking gets more time consuming e.g., subgraph isomorphism test for two graphs is NP-complete, special pattern types get dedicated treatment e.g., generalized sequences with gap constraints. A4M33SAD

39 precommended reading, lecture resources :: Reading Agrawal, Srikant: Mining Sequential Patterns. Agrawal, Srikant: MSPs: Generalizations and Performance. from APRIORI towards its sequential versions AprioriAll and GSP, Mannila et al.: Discovery of Frequent Episodes in Event Sequences. episodal rules, Borgelt: Frequent Pattern Mining. both frequent subsequence and subgraph mining, detailed slides, A4M33SAD

Frequent Pattern Mining

Frequent Pattern Mining Frequent Pattern Mining...3 Frequent Pattern Mining Frequent Patterns The Apriori Algorithm The FP-growth Algorithm Sequential Pattern Mining Summary 44 / 193 Netflix Prize Frequent Pattern Mining Frequent

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING Sequence Data: Sequential Pattern Mining Instructor: Yizhou Sun yzsun@cs.ucla.edu November 27, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification

More information

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler BBS654 Data Mining Pinar Duygulu Slides are adapted from Nazli Ikizler 1 Sequence Data Sequence Database: Timeline 10 15 20 25 30 35 Object Timestamp Events A 10 2, 3, 5 A 20 6, 1 A 23 1 B 11 4, 5, 6 B

More information

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining Chloé-Agathe Azencott & Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institutes

More information

Data Mining: Concepts and Techniques. Chapter Mining sequence patterns in transactional databases

Data Mining: Concepts and Techniques. Chapter Mining sequence patterns in transactional databases Data Mining: Concepts and Techniques Chapter 8 8.3 Mining sequence patterns in transactional databases Jiawei Han and Micheline Kamber Department of Computer Science University of Illinois at Urbana-Champaign

More information

Advance Association Analysis

Advance Association Analysis Advance Association Analysis 1 Minimum Support Threshold 3 Effect of Support Distribution Many real data sets have skewed support distribution Support distribution of a retail data set 4 Effect of Support

More information

CS6220: DATA MINING TECHNIQUES

CS6220: DATA MINING TECHNIQUES CS6220: DATA MINING TECHNIQUES Sequence Data Instructor: Yizhou Sun yzsun@ccs.neu.edu November 22, 2015 Announcement TRACE faculty survey myneu->self service tab Homeworks HW5 will be the last homework

More information

Lecture 10 Sequential Pattern Mining

Lecture 10 Sequential Pattern Mining Lecture 10 Sequential Pattern Mining Zhou Shuigeng June 3, 2007 Outline Sequence data Sequential patterns Basic algorithm for sequential pattern mining Advanced algorithms for sequential pattern mining

More information

Mining Frequent Patterns without Candidate Generation

Mining Frequent Patterns without Candidate Generation Mining Frequent Patterns without Candidate Generation Outline of the Presentation Outline Frequent Pattern Mining: Problem statement and an example Review of Apriori like Approaches FP Growth: Overview

More information

Chapter 13, Sequence Data Mining

Chapter 13, Sequence Data Mining CSI 4352, Introduction to Data Mining Chapter 13, Sequence Data Mining Young-Rae Cho Associate Professor Department of Computer Science Baylor University Topics Single Sequence Mining Frequent sequence

More information

On Canonical Forms for Frequent Graph Mining

On Canonical Forms for Frequent Graph Mining n anonical Forms for Frequent Graph Mining hristian Borgelt School of omputer Science tto-von-guericke-university of Magdeburg Universitätsplatz 2, D-39106 Magdeburg, Germany Email: borgelt@iws.cs.uni-magdeburg.de

More information

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42 Pattern Mining Knowledge Discovery and Data Mining 1 Roman Kern KTI, TU Graz 2016-01-14 Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 1 / 42 Outline 1 Introduction 2 Apriori Algorithm 3 FP-Growth

More information

Data Mining Part 3. Associations Rules

Data Mining Part 3. Associations Rules Data Mining Part 3. Associations Rules 3.2 Efficient Frequent Itemset Mining Methods Fall 2009 Instructor: Dr. Masoud Yaghini Outline Apriori Algorithm Generating Association Rules from Frequent Itemsets

More information

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey

Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey Knowledge Discovery from Web Usage Data: Research and Development of Web Access Pattern Tree Based Sequential Pattern Mining Techniques: A Survey G. Shivaprasad, N. V. Subbareddy and U. Dinesh Acharya

More information

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R, statistics foundations 5 Introduction to D3, visual analytics

More information

DATA MINING II - 1DL460

DATA MINING II - 1DL460 DATA MINING II - 1DL460 Spring 2013 " An second class in data mining http://www.it.uu.se/edu/course/homepage/infoutv2/vt13 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Frequent Pattern Mining and Association Analysis Cengiz Gunay Partial slide credits: Li Xiong, Jiawei Han and Micheline Kamber George Kollios 1 Mining Frequent Patterns,

More information

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke Apriori Algorithm For a given set of transactions, the main aim of Association Rule Mining is to find rules that will predict the occurrence of an item based on the occurrences of the other items in the

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 16: Association Rules Jan-Willem van de Meent (credit: Yijun Zhao, Yi Wang, Tan et al., Leskovec et al.) Apriori: Summary All items Count

More information

Association Pattern Mining. Lijun Zhang

Association Pattern Mining. Lijun Zhang Association Pattern Mining Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction The Frequent Pattern Mining Model Association Rule Generation Framework Frequent Itemset Mining Algorithms

More information

gspan: Graph-Based Substructure Pattern Mining

gspan: Graph-Based Substructure Pattern Mining University of Illinois at Urbana-Champaign February 3, 2017 Agenda What motivated the development of gspan? Technical Preliminaries Exploring the gspan algorithm Experimental Performance Evaluation Introduction

More information

Data mining, 4 cu Lecture 8:

Data mining, 4 cu Lecture 8: 582364 Data mining, 4 cu Lecture 8: Graph mining Spring 2010 Lecturer: Juho Rousu Teaching assistant: Taru Itäpelto Frequent Subgraph Mining Extend association rule mining to finding frequent subgraphs

More information

Chapter 4: Association analysis:

Chapter 4: Association analysis: Chapter 4: Association analysis: 4.1 Introduction: Many business enterprises accumulate large quantities of data from their day-to-day operations, huge amounts of customer purchase data are collected daily

More information

Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner

Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner Genome Reconstruction: A Puzzle with a Billion Pieces Phillip E. C. Compeau and Pavel A. Pevzner Outline I. Problem II. Two Historical Detours III.Example IV.The Mathematics of DNA Sequencing V.Complications

More information

Data Mining for Knowledge Management. Association Rules

Data Mining for Knowledge Management. Association Rules 1 Data Mining for Knowledge Management Association Rules Themis Palpanas University of Trento http://disi.unitn.eu/~themis 1 Thanks for slides to: Jiawei Han George Kollios Zhenyu Lu Osmar R. Zaïane Mohammad

More information

Supplementary Table 1. Data collection and refinement statistics

Supplementary Table 1. Data collection and refinement statistics Supplementary Table 1. Data collection and refinement statistics APY-EphA4 APY-βAla8.am-EphA4 Crystal Space group P2 1 P2 1 Cell dimensions a, b, c (Å) 36.27, 127.7, 84.57 37.22, 127.2, 84.6 α, β, γ (

More information

Chapter 6: Association Rules

Chapter 6: Association Rules Chapter 6: Association Rules Association rule mining Proposed by Agrawal et al in 1993. It is an important data mining model. Transaction data (no time-dependent) Assume all data are categorical. No good

More information

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar Frequent Pattern Mining Based on: Introduction to Data Mining by Tan, Steinbach, Kumar Item sets A New Type of Data Some notation: All possible items: Database: T is a bag of transactions Transaction transaction

More information

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the Chapter 6: What Is Frequent ent Pattern Analysis? Frequent pattern: a pattern (a set of items, subsequences, substructures, etc) that occurs frequently in a data set frequent itemsets and association rule

More information

An Algorithm for Mining Large Sequences in Databases

An Algorithm for Mining Large Sequences in Databases 149 An Algorithm for Mining Large Sequences in Databases Bharat Bhasker, Indian Institute of Management, Lucknow, India, bhasker@iiml.ac.in ABSTRACT Frequent sequence mining is a fundamental and essential

More information

A Graph-Based Approach for Mining Closed Large Itemsets

A Graph-Based Approach for Mining Closed Large Itemsets A Graph-Based Approach for Mining Closed Large Itemsets Lee-Wen Huang Dept. of Computer Science and Engineering National Sun Yat-Sen University huanglw@gmail.com Ye-In Chang Dept. of Computer Science and

More information

Induction of Association Rules: Apriori Implementation

Induction of Association Rules: Apriori Implementation 1 Induction of Association Rules: Apriori Implementation Christian Borgelt and Rudolf Kruse Department of Knowledge Processing and Language Engineering School of Computer Science Otto-von-Guericke-University

More information

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Chapter 4: Mining Frequent Patterns, Associations and Correlations Chapter 4: Mining Frequent Patterns, Associations and Correlations 4.1 Basic Concepts 4.2 Frequent Itemset Mining Methods 4.3 Which Patterns Are Interesting? Pattern Evaluation Methods 4.4 Summary Frequent

More information

Mining Association Rules in Large Databases

Mining Association Rules in Large Databases Mining Association Rules in Large Databases Association rules Given a set of transactions D, find rules that will predict the occurrence of an item (or a set of items) based on the occurrences of other

More information

Canonical Forms for Frequent Graph Mining

Canonical Forms for Frequent Graph Mining Canonical Forms for Frequent Graph Mining Christian Borgelt Dept. of Knowledge Processing and Language Engineering Otto-von-Guericke-University of Magdeburg borgelt@iws.cs.uni-magdeburg.de Summary. A core

More information

Part 2. Mining Patterns in Sequential Data

Part 2. Mining Patterns in Sequential Data Part 2 Mining Patterns in Sequential Data Sequential Pattern Mining: Definition Given a set of sequences, where each sequence consists of a list of elements and each element consists of a set of items,

More information

MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS

MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS by Ramin Afshar B.Sc., University of Alberta, Alberta, 2000 THESIS SUBMITTED IN PARTIAL FULFILMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE

More information

Chapter 7: Frequent Itemsets and Association Rules

Chapter 7: Frequent Itemsets and Association Rules Chapter 7: Frequent Itemsets and Association Rules Information Retrieval & Data Mining Universität des Saarlandes, Saarbrücken Winter Semester 2013/14 VII.1&2 1 Motivational Example Assume you run an on-line

More information

Frequent Pattern Mining

Frequent Pattern Mining Frequent Pattern Mining How Many Words Is a Picture Worth? E. Aiden and J-B Michel: Uncharted. Reverhead Books, 2013 Jian Pei: CMPT 741/459 Frequent Pattern Mining (1) 2 Burnt or Burned? E. Aiden and J-B

More information

Knowledge Discovery in Databases II Winter Term 2015/2016. Optional Lecture: Pattern Mining & High-D Data Mining

Knowledge Discovery in Databases II Winter Term 2015/2016. Optional Lecture: Pattern Mining & High-D Data Mining Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases II Winter Term 2015/2016 Optional Lecture: Pattern Mining

More information

Data Mining in Bioinformatics Day 3: Graph Mining

Data Mining in Bioinformatics Day 3: Graph Mining Graph Mining and Graph Kernels Data Mining in Bioinformatics Day 3: Graph Mining Karsten Borgwardt & Chloé-Agathe Azencott February 6 to February 17, 2012 Machine Learning and Computational Biology Research

More information

Combining Ring Extensions and Canonical Form Pruning

Combining Ring Extensions and Canonical Form Pruning Combining Ring Extensions and Canonical Form Pruning Christian Borgelt European Center for Soft Computing c/ Gonzalo Gutiérrez Quirós s/n, 00 Mieres, Spain christian.borgelt@softcomputing.es Abstract.

More information

Sequential Pattern Mining: Concepts and Primitives. Mining Sequence Patterns in Transactional Databases

Sequential Pattern Mining: Concepts and Primitives. Mining Sequence Patterns in Transactional Databases 32 Chapter 8 Mining Stream, Time-Series, and Sequence Data 8.3 Mining Sequence Patterns in Transactional Databases A sequence database consists of sequences of ordered elements or events, recorded with

More information

Data warehouse and Data Mining

Data warehouse and Data Mining Data warehouse and Data Mining Lecture No. 14 Data Mining and its techniques Naeem A. Mahoto Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology

More information

Association Rule Mining. Introduction 46. Study core 46

Association Rule Mining. Introduction 46. Study core 46 Learning Unit 7 Association Rule Mining Introduction 46 Study core 46 1 Association Rule Mining: Motivation and Main Concepts 46 2 Apriori Algorithm 47 3 FP-Growth Algorithm 47 4 Assignment Bundle: Frequent

More information

by the Genevestigator program (www.genevestigator.com). Darker blue color indicates higher gene expression.

by the Genevestigator program (www.genevestigator.com). Darker blue color indicates higher gene expression. Figure S1. Tissue-specific expression profile of the genes that were screened through the RHEPatmatch and root-specific microarray filters. The gene expression profile (heat map) was drawn by the Genevestigator

More information

Performance and Scalability: Apriori Implementa6on

Performance and Scalability: Apriori Implementa6on Performance and Scalability: Apriori Implementa6on Apriori R. Agrawal and R. Srikant. Fast algorithms for mining associa6on rules. VLDB, 487 499, 1994 Reducing Number of Comparisons Candidate coun6ng:

More information

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining Gyozo Gidofalvi Uppsala Database Laboratory Announcements Updated material for assignment 3 on the lab course home

More information

Association rule mining

Association rule mining Association rule mining Association rule induction: Originally designed for market basket analysis. Aims at finding patterns in the shopping behavior of customers of supermarkets, mail-order companies,

More information

BCB 713 Module Spring 2011

BCB 713 Module Spring 2011 Association Rule Mining COMP 790-90 Seminar BCB 713 Module Spring 2011 The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Outline What is association rule mining? Methods for association rule mining Extensions

More information

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS ABSTRACT V. Purushothama Raju 1 and G.P. Saradhi Varma 2 1 Research Scholar, Dept. of CSE, Acharya Nagarjuna University, Guntur, A.P., India 2 Department

More information

Data Warehousing and Data Mining

Data Warehousing and Data Mining Data Warehousing and Data Mining Lecture 3 Efficient Cube Computation CITS3401 CITS5504 Wei Liu School of Computer Science and Software Engineering Faculty of Engineering, Computing and Mathematics Acknowledgement:

More information

Graph and Digraph Glossary

Graph and Digraph Glossary 1 of 15 31.1.2004 14:45 Graph and Digraph Glossary A B C D E F G H I-J K L M N O P-Q R S T U V W-Z Acyclic Graph A graph is acyclic if it contains no cycles. Adjacency Matrix A 0-1 square matrix whose

More information

Association Rules. A. Bellaachia Page: 1

Association Rules. A. Bellaachia Page: 1 Association Rules 1. Objectives... 2 2. Definitions... 2 3. Type of Association Rules... 7 4. Frequent Itemset generation... 9 5. Apriori Algorithm: Mining Single-Dimension Boolean AR 13 5.1. Join Step:...

More information

Graph Mining: Repository vs. Canonical Form

Graph Mining: Repository vs. Canonical Form Graph Mining: Repository vs. Canonical Form Christian Borgelt and Mathias Fiedler European Center for Soft Computing c/ Gonzalo Gutiérrez Quirós s/n, 336 Mieres, Spain christian.borgelt@softcomputing.es,

More information

Association Rule Mining

Association Rule Mining Association Rule Mining Generating assoc. rules from frequent itemsets Assume that we have discovered the frequent itemsets and their support How do we generate association rules? Frequent itemsets: {1}

More information

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

ANU MLSS 2010: Data Mining. Part 2: Association rule mining ANU MLSS 2010: Data Mining Part 2: Association rule mining Lecture outline What is association mining? Market basket analysis and association rule examples Basic concepts and formalism Basic rule measurements

More information

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB) Association rules Marco Saerens (UCL), with Christine Decaestecker (ULB) 1 Slides references Many slides and figures have been adapted from the slides associated to the following books: Alpaydin (2004),

More information

Sequential Pattern Mining A Study

Sequential Pattern Mining A Study Sequential Pattern Mining A Study S.Vijayarani Assistant professor Department of computer science Bharathiar University S.Deepa M.Phil Research Scholar Department of Computer Science Bharathiar University

More information

CSE 5243 INTRO. TO DATA MINING

CSE 5243 INTRO. TO DATA MINING CSE 543 INTRO. TO DATA MINING Advanced Frequent Pattern Mining & Locality Sensitive Hashing Huan Sun, CSE@The Ohio State University Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan Parthasarathy

More information

Fundamental Data Mining Algorithms

Fundamental Data Mining Algorithms 2018 EE448, Big Data Mining, Lecture 3 Fundamental Data Mining Algorithms Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/ee448/index.html REVIEW What is Data

More information

Monotone Constraints in Frequent Tree Mining

Monotone Constraints in Frequent Tree Mining Monotone Constraints in Frequent Tree Mining Jeroen De Knijf Ad Feelders Abstract Recent studies show that using constraints that can be pushed into the mining process, substantially improves the performance

More information

Data Structure. IBPS SO (IT- Officer) Exam 2017

Data Structure. IBPS SO (IT- Officer) Exam 2017 Data Structure IBPS SO (IT- Officer) Exam 2017 Data Structure: In computer science, a data structure is a way of storing and organizing data in a computer s memory so that it can be used efficiently. Data

More information

Mining Minimal Contrast Subgraph Patterns

Mining Minimal Contrast Subgraph Patterns Mining Minimal Contrast Subgraph Patterns Roger Ming Hieng Ting James Bailey Abstract In this paper, we introduce a new type of contrast pattern, the minimal contrast subgraph. It is able to capture structural

More information

Lecture 2 Wednesday, August 22, 2007

Lecture 2 Wednesday, August 22, 2007 CS 6604: Data Mining Fall 2007 Lecture 2 Wednesday, August 22, 2007 Lecture: Naren Ramakrishnan Scribe: Clifford Owens 1 Searching for Sets The canonical data mining problem is to search for frequent subsets

More information

Roadmap. PCY Algorithm

Roadmap. PCY Algorithm 1 Roadmap Frequent Patterns A-Priori Algorithm Improvements to A-Priori Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results Data Mining for Knowledge Management 50 PCY

More information

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application Data Structures Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali 2009-2010 Association Rules: Basic Concepts and Application 1. Association rules: Given a set of transactions, find

More information

Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach Data Mining and Knowledge Discovery, 8, 53 87, 2004 c 2004 Kluwer Academic Publishers. Manufactured in The Netherlands. Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach

More information

Finding Frequent Patterns Using Length-Decreasing Support Constraints

Finding Frequent Patterns Using Length-Decreasing Support Constraints Finding Frequent Patterns Using Length-Decreasing Support Constraints Masakazu Seno and George Karypis Department of Computer Science and Engineering University of Minnesota, Minneapolis, MN 55455 Technical

More information

2. Discovery of Association Rules

2. Discovery of Association Rules 2. Discovery of Association Rules Part I Motivation: market basket data Basic notions: association rule, frequency and confidence Problem of association rule mining (Sub)problem of frequent set mining

More information

Subgroup Mining (with a brief intro to Data Mining)

Subgroup Mining (with a brief intro to Data Mining) Subgroup Mining (with a brief intro to Data Mining) Paulo J Azevedo INESC-Tec, HASLab Departamento de Informática 1 Expectations 2 Knowledge Discovery (non trivial relation between data elements) in DataBases

More information

Effectiveness of Freq Pat Mining

Effectiveness of Freq Pat Mining Effectiveness of Freq Pat Mining Too many patterns! A pattern a 1 a 2 a n contains 2 n -1 subpatterns Understanding many patterns is difficult or even impossible for human users Non-focused mining A manager

More information

Graph Algorithms in Bioinformatics

Graph Algorithms in Bioinformatics Graph Algorithms in Bioinformatics Computational Biology IST Ana Teresa Freitas 2015/2016 Sequencing Clone-by-clone shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics

More information

Association Analysis: Basic Concepts and Algorithms

Association Analysis: Basic Concepts and Algorithms 5 Association Analysis: Basic Concepts and Algorithms Many business enterprises accumulate large quantities of data from their dayto-day operations. For example, huge amounts of customer purchase data

More information

Association Rule Mining. Entscheidungsunterstützungssysteme

Association Rule Mining. Entscheidungsunterstützungssysteme Association Rule Mining Entscheidungsunterstützungssysteme Frequent Pattern Analysis Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set

More information

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition S.Vigneswaran 1, M.Yashothai 2 1 Research Scholar (SRF), Anna University, Chennai.

More information

Pyramidal and Chiral Groupings of Gold Nanocrystals Assembled Using DNA Scaffolds

Pyramidal and Chiral Groupings of Gold Nanocrystals Assembled Using DNA Scaffolds Pyramidal and Chiral Groupings of Gold Nanocrystals Assembled Using DNA Scaffolds February 27, 2009 Alexander Mastroianni, Shelley Claridge, A. Paul Alivisatos Department of Chemistry, University of California,

More information

The Relation of Closed Itemset Mining, Complete Pruning Strategies and Item Ordering in Apriori-based FIM algorithms (Extended version)

The Relation of Closed Itemset Mining, Complete Pruning Strategies and Item Ordering in Apriori-based FIM algorithms (Extended version) The Relation of Closed Itemset Mining, Complete Pruning Strategies and Item Ordering in Apriori-based FIM algorithms (Extended version) Ferenc Bodon 1 and Lars Schmidt-Thieme 2 1 Department of Computer

More information

USING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS

USING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS INFORMATION SYSTEMS IN MANAGEMENT Information Systems in Management (2017) Vol. 6 (3) 213 222 USING FREQUENT PATTERN MINING ALGORITHMS IN TEXT ANALYSIS PIOTR OŻDŻYŃSKI, DANUTA ZAKRZEWSKA Institute of Information

More information

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1 Data Mining: Concepts and Techniques Chapter 5 SS Chung April 5, 2013 Data Mining: Concepts and Techniques 1 Chapter 5: Mining Frequent Patterns, Association and Correlations Basic concepts and a road

More information

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE Vandit Agarwal 1, Mandhani Kushal 2 and Preetham Kumar 3

More information

CSE 417 Dynamic Programming (pt 5) Multiple Inputs

CSE 417 Dynamic Programming (pt 5) Multiple Inputs CSE 417 Dynamic Programming (pt 5) Multiple Inputs Reminders > HW5 due Wednesday Dynamic Programming Review > Apply the steps... optimal substructure: (small) set of solutions, constructed from solutions

More information

MULTI-DIMENSIONAL SEQUENTIAL PATTERN MINING

MULTI-DIMENSIONAL SEQUENTIAL PATTERN MINING MULTI-DIMENSIONAL SEQUENTIAL PATTERN MINING by Helen Pinto B.Sc. B.Mgmt., University of Lethbridge, Alberta, 1998 THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF

More information

Data Mining in Bioinformatics Day 5: Graph Mining

Data Mining in Bioinformatics Day 5: Graph Mining Data Mining in Bioinformatics Day 5: Graph Mining Karsten Borgwardt February 25 to March 10 Bioinformatics Group MPIs Tübingen from Borgwardt and Yan, KDD 2008 tutorial Graph Mining and Graph Kernels,

More information

Chapter 2. Related Work

Chapter 2. Related Work Chapter 2 Related Work There are three areas of research highly related to our exploration in this dissertation, namely sequential pattern mining, multiple alignment, and approximate frequent pattern mining.

More information

TCGR: A Novel DNA/RNA Visualization Technique

TCGR: A Novel DNA/RNA Visualization Technique TCGR: A Novel DNA/RNA Visualization Technique Donya Quick and Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University Dallas, Texas 75275 dquick@mail.smu.edu, mhd@engr.smu.edu

More information

Sequential PAttern Mining using A Bitmap Representation

Sequential PAttern Mining using A Bitmap Representation Sequential PAttern Mining using A Bitmap Representation Jay Ayres, Jason Flannick, Johannes Gehrke, and Tomi Yiu Dept. of Computer Science Cornell University ABSTRACT We introduce a new algorithm for mining

More information

Association Rule Mining: FP-Growth

Association Rule Mining: FP-Growth Yufei Tao Department of Computer Science and Engineering Chinese University of Hong Kong We have already learned the Apriori algorithm for association rule mining. In this lecture, we will discuss a faster

More information

Fast incremental mining of web sequential patterns with PLWAP tree

Fast incremental mining of web sequential patterns with PLWAP tree Data Min Knowl Disc DOI 10.1007/s10618-009-0133-6 Fast incremental mining of web sequential patterns with PLWAP tree C. I. Ezeife Yi Liu Received: 1 September 2007 / Accepted: 13 May 2009 Springer Science+Business

More information

Sequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics

Sequencing. Computational Biology IST Ana Teresa Freitas 2011/2012. (BACs) Whole-genome shotgun sequencing Celera Genomics Computational Biology IST Ana Teresa Freitas 2011/2012 Sequencing Clone-by-clone shotgun sequencing Human Genome Project Whole-genome shotgun sequencing Celera Genomics (BACs) 1 Must take the fragments

More information

CSE 417 Branch & Bound (pt 4) Branch & Bound

CSE 417 Branch & Bound (pt 4) Branch & Bound CSE 417 Branch & Bound (pt 4) Branch & Bound Reminders > HW8 due today > HW9 will be posted tomorrow start early program will be slow, so debugging will be slow... Review of previous lectures > Complexity

More information

CMTreeMiner: Mining Both Closed and Maximal Frequent Subtrees

CMTreeMiner: Mining Both Closed and Maximal Frequent Subtrees MTreeMiner: Mining oth losed and Maximal Frequent Subtrees Yun hi, Yirong Yang, Yi Xia, and Richard R. Muntz University of alifornia, Los ngeles, 90095, US {ychi,yyr,xiayi,muntz}@cs.ucla.edu bstract. Tree

More information

Discrete mathematics

Discrete mathematics Discrete mathematics Petr Kovář petr.kovar@vsb.cz VŠB Technical University of Ostrava DiM 470-2301/02, Winter term 2018/2019 About this file This file is meant to be a guideline for the lecturer. Many

More information

Association Rule Discovery

Association Rule Discovery Association Rule Discovery Association Rules describe frequent co-occurences in sets an itemset is a subset A of all possible items I Example Problems: Which products are frequently bought together by

More information

Trees. Arash Rafiey. 20 October, 2015

Trees. Arash Rafiey. 20 October, 2015 20 October, 2015 Definition Let G = (V, E) be a loop-free undirected graph. G is called a tree if G is connected and contains no cycle. Definition Let G = (V, E) be a loop-free undirected graph. G is called

More information

International Journal of Scientific Research and Reviews

International Journal of Scientific Research and Reviews Research article Available online www.ijsrr.org ISSN: 2279 0543 International Journal of Scientific Research and Reviews A Survey of Sequential Rule Mining Algorithms Sachdev Neetu and Tapaswi Namrata

More information

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS APPLYIG BIT-VECTOR PROJECTIO APPROACH FOR EFFICIET MIIG OF -MOST ITERESTIG FREQUET ITEMSETS Zahoor Jan, Shariq Bashir, A. Rauf Baig FAST-ational University of Computer and Emerging Sciences, Islamabad

More information

Interestingness Measurements

Interestingness Measurements Interestingness Measurements Objective measures Two popular measurements: support and confidence Subjective measures [Silberschatz & Tuzhilin, KDD95] A rule (pattern) is interesting if it is unexpected

More information

Association Rule Discovery

Association Rule Discovery Association Rule Discovery Association Rules describe frequent co-occurences in sets an item set is a subset A of all possible items I Example Problems: Which products are frequently bought together by

More information

CSE 373 MAY 10 TH SPANNING TREES AND UNION FIND

CSE 373 MAY 10 TH SPANNING TREES AND UNION FIND CSE 373 MAY 0 TH SPANNING TREES AND UNION FIND COURSE LOGISTICS HW4 due tonight, if you want feedback by the weekend COURSE LOGISTICS HW4 due tonight, if you want feedback by the weekend HW5 out tomorrow

More information