Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Similar documents
Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Data Mining Part 3. Associations Rules

Association Rule Mining. Entscheidungsunterstützungssysteme

BCB 713 Module Spring 2011

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Mining Frequent Patterns without Candidate Generation

Frequent Pattern Mining

Association Rules. Berlin Chen References:

Tutorial on Association Rule Mining

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Improved Frequent Pattern Mining Algorithm with Indexing

Chapter 4: Association analysis:

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Association Rules. A. Bellaachia Page: 1

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Induction of Association Rules: Apriori Implementation

An Improved Apriori Algorithm for Association Rules

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

CS145: INTRODUCTION TO DATA MINING

Association Rule Mining. Introduction 46. Study core 46

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Nesnelerin İnternetinde Veri Analizi

Product presentations can be more intelligently planned

Association Rule Discovery

Data Mining Algorithms

Association Rule Discovery

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

CS570 Introduction to Data Mining

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

INFREQUENT WEIGHTED ITEM SET MINING USING NODE SET BASED ALGORITHM

2 CONTENTS

Data Mining for Knowledge Management. Association Rules

Appropriate Item Partition for Improving the Mining Performance

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Machine Learning: Symbolische Ansätze

Frequent Pattern Mining

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Association mining rules

Keshavamurthy B.N., Mitesh Sharma and Durga Toshniwal

A Data Mining Framework for Extracting Product Sales Patterns in Retail Store Transactions Using Association Rules: A Case Study

Advance Association Analysis

Data warehouse and Data Mining

Association rule mining

2. Discovery of Association Rules

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Data Mining Clustering

BINARY DECISION TREE FOR ASSOCIATION RULES MINING IN INCREMENTAL DATABASES

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Association Rule Mining

Generation of Potential High Utility Itemsets from Transactional Databases

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Chapter 4 Data Mining A Short Introduction

Mining Association Rules in Large Databases

Chapter 7: Frequent Itemsets and Association Rules

FP-Growth algorithm in Data Compression frequent patterns

ALGORITHM FOR MINING TIME VARYING FREQUENT ITEMSETS

Roadmap DB Sys. Design & Impl. Association rules - outline. Citations. Association rules - idea. Association rules - idea.

CSE 5243 INTRO. TO DATA MINING

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

Predicting Missing Items in Shopping Carts

APPLYING BIT-VECTOR PROJECTION APPROACH FOR EFFICIENT MINING OF N-MOST INTERESTING FREQUENT ITEMSETS

Association Pattern Mining. Lijun Zhang

Chapter 6: Association Rules

Association Rules Apriori Algorithm

Chapter 7: Frequent Itemsets and Association Rules

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

Association Rule Learning

ISSN: (Online) Volume 2, Issue 7, July 2014 International Journal of Advance Research in Computer Science and Management Studies

Survey: Efficent tree based structure for mining frequent pattern from transactional databases

ETP-Mine: An Efficient Method for Mining Transitional Patterns

Fast Discovery of Sequential Patterns Using Materialized Data Mining Views

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

Mining of Web Server Logs using Extended Apriori Algorithm

Association Rule Mining

Association Rules Outline

Interestingness Measurements

Fundamental Data Mining Algorithms

CLOSET+:Searching for the Best Strategies for Mining Frequent Closed Itemsets

A Graph-Based Approach for Mining Closed Large Itemsets

ISSN Vol.03,Issue.09 May-2014, Pages:

MINING FREQUENT MAX AND CLOSED SEQUENTIAL PATTERNS

DESIGN AND CONSTRUCTION OF A FREQUENT-PATTERN TREE

Pamba Pravallika 1, K. Narendra 2

Supervised and Unsupervised Learning (II)

Association Rule Mining: FP-Growth

Interestingness Measurements

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

Fuzzy Cognitive Maps application for Webmining

A Survey on Moving Towards Frequent Pattern Growth for Infrequent Weighted Itemset Mining

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

PTclose: A novel algorithm for generation of closed frequent itemsets from dense and sparse datasets

Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining

Incremental Mining of Frequent Patterns Without Candidate Generation or Support Constraint

Discovering interesting rules from financial data

I. INTRODUCTION. Keywords : Spatial Data Mining, Association Mining, FP-Growth Algorithm, Frequent Data Sets

Materialized Data Mining Views *

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Transcription:

Pattern Mining Knowledge Discovery and Data Mining 1 Roman Kern KTI, TU Graz 2016-01-14 Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 1 / 42

Outline 1 Introduction 2 Apriori Algorithm 3 FP-Growth Algorithm 4 Sequence Pattern Mining 5 Alternatives & Remarks Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 2 / 42

Introduction Introduction to Pattern Mining What & Why Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 3 / 42

Introduction U. Fayyad, G. Piatetsky-Shapiro, and P. Smyth, From data mining to knowledge discovery in databases, AI Mag., vol. 17, no. 3, p. 37, 1996. Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 4 / 42 Pattern Mining Intro Probability Theory Linear Algebra Map-Reduce Information Theory Statistical Inference Mathematical Tools Infrastructure Preprocessing Transformation Knowledge Discovery Process

Introduction Pattern Mining Intro Definition of KDD KDD is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data But, until now we did not directly investigate into mining of these patterns. Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 5 / 42

Introduction Pattern Mining Example Famous example for pattern mining Many supermarkets/grocery (chains) do have loyalty cards gives the store the ability to analyse the purchase behaviour (and increase sales, for marketing, etc.) Benefits Which items have been bought together Hidden relationships: probability of a purchase an item given another item... or temporal patterns Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 6 / 42

Introduction Pattern Mining Example Transactions in a grocery All purchases are organised in transactions Each transaction contains a list of purchased items For example: Id Items 0 Bread, Milk 1 Bread, Diapers, Beer, Eggs 2 Milk, Diapers, Beer, Coke 3 Bread, Milk, Diapers, Beer 4 Bread, Milk, Diapers, Coke Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 7 / 42

Introduction Pattern Mining Example Ways of analysing the transaction data Frequent itemset (which items are often bought together) association analysis Frequent associations (relations between items (if/then)) association rule mining Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 8 / 42

Introduction Pattern Mining Example Result A grocery store chain in the Midwest of the US found that on Thursdays men tend to buy beer if they bought diapers as well. Association Rule {diapers beer} Note #1: The grocery store could make use of this knowledge to place diapers and beer close to each other and to make sure none of these product has a discount on Thursdays. Note #2: Not clear whether this is actually true. Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 9 / 42

Introduction Pattern Mining Basics Formal description: items and transactions I - the set of items (e.g. the grocery goods - inventory) D = {t 1, t 2,..., t n } - the transactions, where each transaction consists of a set of items (t x I) Note #1: One could see each item as feature and thus each transaction as feature vector with zeros for items not contained in the transaction. Note #2: This organisation of transaction is sometimes referred to as horizontal data format Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 10 / 42

Introduction Pattern Mining Basics Formal description: itemset and association rule X = {I 1, I 2,..., I n } - the itemset, consisting of a set of items Frequent itemset is an itemset that found often in the transactions X Y - an association rule, where X and Y are itemssets (premise conclusion) Frequent association rule is an association rule that is true for many transactions Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 11 / 42

Introduction Pattern Mining Basics Formal description: support (itemset) How frequent is a frequent itemset? What is the probability of finding the itemset in the transactions, which is called the support Support(X ) = P(X ) = P(I 1, I 2,..., I n ) Support(X ) = {t D X t} D The relative frequency of the itemset (relative number of transactions containing all items from the itemset) Note: Some algorithms take the total number of transactions instead of the relative frequency Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 12 / 42

Introduction Pattern Mining Basics Formal description: support (association rule) How frequent is a frequent association rule? What is the probability of finding the association rule to be true in the transactions, again called the support Support(X Y ) = P(X Y ) = P(I X 1, I X 2,..., I X n, I Y 1, I Y 2,..., I Y m ) Support(X ) = {t D X Y t} D... equals to the joint itemset Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 13 / 42

Introduction Pattern Mining Basics Formal description: confidence How relevant is a (frequent) association rule? What is the probability of the conclusion given the premise, which is called the confidence Conf(X Y ) = P(I Y 1, I Y 2,..., I Y m I X 1, I X 2,..., I X n ) Conf(X Y ) = Conf(X Y ) = Support(X Y ) Support(X ) {t D X Y t} {t D X t} The confidence reflects how often one finds Y in transactions, which contain X Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 14 / 42

Introduction Pattern Mining Basics Association Rule Mining Goal Find all frequent, relevant association rules, i.e. all rules with a sufficient support and sufficient confidence What is it useful for? Market basket analysis, query completion, click stream analysis, genome analysis,... Preprocessing for other data mining tasks Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 15 / 42

Introduction Pattern Mining Basics Simple Solution (for association analysis) 1 Pick out any combination of items 2 Test for sufficient support 3 Goto 1 (until all combinations have been tried) Not the optimal solution two improvements are presented next, the Aprioi Algorithm and the FP-Growth Algorithm Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 16 / 42

Apriori Algorithm The Apriori Algorithm... and the apriori principle Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 17 / 42

Apriori Algorithm Apriori Algorithm Apriori algorithm Input Simple and efficient algorithm to find frequent itemsets and (optionally) find association rules The transactions The (minimal) support threshold The (minimal) confidence threshold (just needed for association rules) R. Agrawal, H. Road, and S. Jose, Mining Association Rules between Sets of Items in Large Databases, ACM SIGMOD Record. Vol. 22. No. 2. ACM, 1993. Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 18 / 42

Apriori Algorithm Apriori Algorithm Apriori principle If an itemset is frequent, then all of the subsets are frequent as well If an itemset is infrequent, then all of the supersets are infrequent as well Note: The actual goal is to reduce the number of support calculations Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 19 / 42

Apriori Algorithm Apriori Algorithm Apriori basic approach for frequent itemsets 1 Create frequent itemsets of size k = 1 (where k is the size of the itemset) 2 Extends the itemset for size k + 1, following the apriori principle (generate superset candidates only for frequent itemsets) 3 Only test those itemsets for sufficient support that are theoretically possible 4 Goto 2 (until there are no more candidates) Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 20 / 42

Apriori Algorithm Apriori Algorithm Apriori basic approach for frequent association rules 1 Take the frequent itemsets (of size k 2) 2 Create frequent association rules of size k = 1 (where k is the size of the conclusion set), by picking each I i X and test for sufficient confidence (i.e. split the itemset into premise and conclusion) 3 Extends the conclusion for size k + 1, following the apriori principle (and create new association rule candidates) 4 Only test those association rules for sufficient confidence that are theoretically possible 5 Goto 2 (until there are no more candidates) Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 21 / 42

FP-Growth Algorithm The FP-Growth Algorithm... more efficient than Apriori Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 22 / 42

FP-Growth Algorithm FP-Growth Algorithm Overview (Usually) faster alternative to Apriori for finding frequent itemsets Needs two scans through the transaction database Main idea: transform the data set into an efficient data structure, the FP Tree 1 Build the FP-Tree 2 Mine the FP-Tree for frequent itemsets... thereby creating (conditional) FP-Tree on-the-fly Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 23 / 42

FP-Growth Algorithm FP-Growth Algorithm The FP-Tree Contains all information from the transactions Allows efficient retrieval of frequent itemsets Consists of a tree Each node represents an item Each node has a count, # of transactions from the root to the node Consists of a linked list For each frequent item there is a head with count Links to all nodes with the same item Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 24 / 42

FP-Growth Algorithm FP-Growth Algorithm Building the FP-Tree 1 Scan through all transactions and count each item, I x count x 2 Scan through all transactions and 1 Sort the items (descending) by count 2 Throw away infrequent items, count x < minsupport 3 Iterate over the items and match against tree (starting with the root node) if the current item is found as child to the current node, increase its count if the current item is not found, create a new child (branch) Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 25 / 42

FP-Growth Algorithm FP-Growth Algorithm Given a set of transactions (1), sort the items by frequency in the data set and prune infrequent items (4), create the FP-Tree with the tree itself (root on the left) a linked list for each of the items (head on the top) Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 26 / 42

FP-Growth Algorithm FP-Growth Algorithm Mine the FP-Tree for frequent itemsets 1 Start with single item itemsets 2 Recursively mine the FP-Tree Starting by the items and traverse to the root Test each parent items for minimal support Build a new (conditional) FP-Tree (new recursion) Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 27 / 42

FP-Growth Algorithm FP-Growth Algorithm Comparison of Apriori, FP-Growth and others on four data sets with varying support counts (x-axis), measured in runtime (y-axis). Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 28 / 42

Sequence Pattern Mining Sequence Pattern Mining Mining patterns in temporal data Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 29 / 42

Sequence Pattern Mining Sequence Pattern Mining Initial definition Given a set of sequences, where each sequence consists of a list of elements and each element consists of a set of items, and given a user-specified min support threshold, sequential pattern mining is to find all frequent subsequences, i.e., the subsequences whose occurrence frequency in the set of sequences is no less than min support. - R. Agrawal and R. Srikant, Mining Sequential Patterns, Proc. 1995 Int l Conf. Data Eng. (ICDE 95), pp. 3-14, Mar. 1995. Note #1: This definition can then be expanded to include all sorts of constraints. Note #2: Each element in a sequence is an itemset. Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 30 / 42

Sequence Pattern Mining Sequence Pattern Mining Applications & Use-Cases Shopping cart analysis e.g. common shopping sequences DNA sequence analysis (Web)Log-file analysis Common access patterns Derive features for subsequent analysis Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 31 / 42

Sequence Pattern Mining Sequence Pattern Mining Transaction database vs. sequence database ID Itemset 1 a, b, c 2 b, c 3 a, e, f, g 4 c, f, g ID Sequence 1 <a(abc)(ac)d(cf)> 2 <(ad)c(bc)(ae)> 3 <(ef)(ab)(df)cb> 4 <eg(af)cbc> Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 32 / 42

Sequence Pattern Mining Sequence Pattern Mining Subsequence & Super-sequence A sequence is an ordered list of elements, denoted < x 1 x 2... x k > Given two sequences α =< a 1 a 2... a n > and β =< b 1 b 2... b m > α is called a subsequence of β, denoted as α β if there exist integers 1 j 1 < j 2 < < j n m such that a 1 b j1, a 2 b j2,..., a n b jn β is a super-sequence of α e.g. α =< (ab), d > and β =< (abc), (de) > Note #1: A database transaction (single sequence) is defined to contain a sequence, if that sequence is a subsequence. Note #2: Frequently found subsequences (within a transaction database) are called sequential patterns. Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 33 / 42

Sequence Pattern Mining Sequence Pattern Mining Initial approaches to sequence pattern mining relied on the a-priory principle If a sequence S is not frequent, then none of the super-sequences of S is frequent Generalized Sequential Pattern Mining (GSP) is one example of such algorithms Disadvantages: Requires multiple scans within the database Potentially many candidates Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 34 / 42

Sequence Pattern Mining FreeSpan - Algorithm FreeSpan - Frequent Pattern-Projected Sequential Pattern Mining Input: minimal support, sequence database Output: all sequences with a frequency equal or higher than the minimal support Main idea: Recursively reduce (project) the database instead of scanning the whole database multiple times Projection Reduced version of a database consisting just a set of sequences Thus each sequence in the projected database contains the sequence Keep the sequences sorted All sequences are kept in an fixed order Starting with the most frequent to the least frequent Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 35 / 42

Sequence Pattern Mining FreeSpan - Algorithm FreeSpan - Algorithm 1 Compute all items that pass the minimal support test, sort them by frequency {x 1,..., x n } 2 For each x i of the items build a set of items {x 1,..., x i } Project the database given the set of items Mine the projected database containing x i for frequent sequences of length 2 Recursively apply the algorithm on the projected database to find frequent sequences of length + 1 Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 36 / 42

Alternatives & Remarks Alternatives & Remarks... and remarks Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 37 / 42

Alternatives & Remarks Remarks Problems of frequent association rules Rules with high confidence, but also high support e.g. use significance test Confidence is relative to the frequency of the premise Hard to compare multiple association rules Redundant association rules Many frequent itemsets (often even more than transactions) Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 38 / 42

Alternatives & Remarks Remarks How to deal with numeric attributes? Binning (discretization) on predefined ranges Distribution-based binning Extend algorithm to allow numeric attributes, e.g. Approximate Association Rule Mining Frequent closed pattern mining A pattern is called closed, if there is no superset that satisfies the minimal support Produces a smaller number of patterns Similar maximum pattern mining Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 39 / 42

Alternatives & Remarks Further Algorithms Frequent tree mining Frequent graph mining Constraint based mining Approximate pattern mining e.g. picking a single representative for a group of patterns... by invoking a clustering algorithm Mining cyclic or periodic patterns J. Han, H. Cheng, D. Xin, and X. Yan, Frequent pattern mining: current status and future directions, Data Min. Knowl. Discov., vol. 15, no. 1, pp. 55 86, Jan. 2007. Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 40 / 42

Alternatives & Remarks Connection to ML Connection to other tasks in machine learning Frequent pattern based classification Mine the relation between features/items and the label... useful for feature selection/engineering Frequent pattern based clustering Clustering in high dimensions is a challenge, e.g. text... use frequent patterns as a dimensionality reduction Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 41 / 42

Alternatives & Remarks The End Christian Borgelt Slides: http://www.borgelt.net/teach/fpm/slides.html (516 slides) http://www.is.informatik.uni-duisburg.de/courses/im_ss09/ folien/miningsequentialpatterns.ppt Roman Kern (KTI, TU Graz) Pattern Mining 2016-01-14 42 / 42