ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Similar documents
BCB 713 Module Spring 2011

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Association Rules. Berlin Chen References:

Chapter 4 Data Mining A Short Introduction

Frequent Pattern Mining

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Association Rules. A. Bellaachia Page: 1

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Mining Association Rules in Large Databases

Product presentations can be more intelligently planned

CS570 Introduction to Data Mining

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Association Analysis. CSE 352 Lecture Notes. Professor Anita Wasilewska

Interestingness Measurements

Lecture notes for April 6, 2005

Interestingness Measurements

Supervised and Unsupervised Learning (II)

Mining Association Rules in Large Databases

Association Rule Mining. Entscheidungsunterstützungssysteme

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Data Mining Part 3. Associations Rules

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Which Null-Invariant Measure Is Better? Which Null-Invariant Measure Is Better?

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 7

Association Rules Apriori Algorithm

Association Rules Apriori Algorithm

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

1. Interpret single-dimensional Boolean association rules from transactional databases

Chapter 4 Data Mining A Short Introduction. 2005/6, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Production rule is an important element in the expert system. By interview with

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Association mining rules

Chapter 4: Association analysis:

2 CONTENTS

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Association Rule Mining (ARM) Komate AMPHAWAN

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Course Content. Outline of Lecture 10. Objectives of Lecture 10 DBMS & WWW. CMPUT 499: DBMS and WWW. Dr. Osmar R. Zaïane. University of Alberta 4

Association Pattern Mining. Lijun Zhang

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Frequent Pattern Mining

CompSci 516 Data Intensive Computing Systems

Chapter 6: Association Rules

Roadmap DB Sys. Design & Impl. Association rules - outline. Citations. Association rules - idea. Association rules - idea.

Data Mining Clustering

Discovering interesting rules from financial data

Association Rule Discovery

Introduction to Data Mining. Yücel SAYGIN

Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Data mining - detailed outline. Problem

Mining Frequent Patterns without Candidate Generation

Machine Learning: Symbolische Ansätze

Data mining - detailed outline. Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Problem.

Fundamental Data Mining Algorithms

Chapter 7: Frequent Itemsets and Association Rules

Effectiveness of Freq Pat Mining

Association Rule Discovery

Association Rules. Juliana Freire. Modified from Jeff Ullman, Jure Lescovek, Bing Liu, Jiawei Han

CS 6093 Lecture 7 Spring 2011 Basic Data Mining. Cong Yu 03/21/2011

Data Mining Techniques

Tutorial on Association Rule Mining

Association Rule Learning

Mining Association Rules. What Is Association Mining?

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

Data warehouse and Data Mining

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Reading Material. Data Mining - 2. Data Mining - 1. OLAP vs. Data Mining 11/19/17. Four Main Steps in KD and DM (KDD) CompSci 516: Database Systems

Data Mining Concepts

Data Mining for Knowledge Management. Association Rules

Nesnelerin İnternetinde Veri Analizi

Data Warehousing and Data Mining

Tutorial on Assignment 3 in Data Mining 2009 Frequent Itemset and Association Rule Mining. Gyozo Gidofalvi Uppsala Database Laboratory

Jeffrey D. Ullman Stanford University

Advanced Pattern Mining

CompSci 516 Data Intensive Computing Systems

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL

Improved Frequent Pattern Mining Algorithm with Indexing

Advance Association Analysis

Unsupervised learning: Data Mining. Associa6on rules and frequent itemsets mining

signicantly higher than it would be if items were placed at random into baskets. For example, we

COMP Associa0on Rules

CSE 5243 INTRO. TO DATA MINING

Jarek Szlichta

Performance and Scalability: Apriori Implementa6on

Knowledge Discovery in Databases

Data Mining Algorithms

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

Association Rules and

UNIT 2. DATA PREPROCESSING AND ASSOCIATION RULES

Ajay Acharya et al. / International Journal on Computer Science and Engineering (IJCSE)

AN IMPROVISED FREQUENT PATTERN TREE BASED ASSOCIATION RULE MINING TECHNIQUE WITH MINING FREQUENT ITEM SETS ALGORITHM AND A MODIFIED HEADER TABLE

Road Map. Objectives. Objectives. Frequent itemsets and rules. Items and transactions. Association Rules and Sequential Patterns

Chapter 7: Frequent Itemsets and Association Rules

Association Rules Outline

Transcription:

ANU MLSS 2010: Data Mining Part 2: Association rule mining

Lecture outline What is association mining? Market basket analysis and association rule examples Basic concepts and formalism Basic rule measurements The Apriori algorithm Performance bottlenecks in Apriori Multi level and multi dimensional association mining Quantitative association mining Constraint based mining Visualising association rules

What is association mining? Association mining is the task of finding frequent rules / associations / patterns / correlations / causual structures within (large) sets of items in transactional (relational) databases Unsupervised learning techniques (descriptive data mining, not predictive data mining) The main applications are Market basket analysis (customers who buys X also buys Y) Web log analysis (click stream) Cross marketing Sale campaign analysis DNS sequence analysis

Market basket analysis Source: Han and Kamber, DM Book, 2 nd Ed. (Copyright 2006 Elsevier Inc.)

Association rules examples Rules form: body head [support, confidence] Market basket: buys(x, `beer') buys(x, `snacks') [1%, 60%] If a customer X purchased `beer', in 60% she or he also purchased `snacks' 1% of all transactions contain the items `beer' and `snacks' Student grades: major(x, `MComp') and takes(x, `COMP8400') grade(x, `D') [3%, 60%] * If a student X, who's degree is `MComp', took the course `COMP8400' she or he in 60% achieved a grade `D' The combination `MComp', `COMP8400' and `D' appears in 3% of all transactions (records) in the database

Basic concepts Given: A (large) database of transactions Each transaction contains a list of one or more items (e.g. purchased by a customer in a visit) Find the rules that correlate the presence of one set of items with that of another set of items Normally one is only interested in rules that are frequent For example, 70% of customers who buy tires and car accessories also get their car service done Question: How can this be improved to 80%? Possibly offer special deals like a 15% reduction of tire costs when the service is done

Formalism Set of items X = {x 1, x 2,..., x k } Database D containing transactions Each transaction T is a set of items, such that T is a subset of X Each transaction is associated with a unique identifier, called TID (for example, a unique number) Let A be a set of items (a subset of X) An association rule is an implication of the form A B, where A is a subset of X and B is a subset of X, and the intersection of A and B is empty No item in A can be in B, and vice versa No rule of the form: {`beer', `chips'} {`chips', `peanuts'}

Basic rule measurements A rule A B holds in a database D with support s, with s being the percentage of transactions in D that contain A and B support(a B) = P(A U B) The rule A B has a confidence c in a database D if c is the percentage of transactions in D containing A that also contain B confidence(a B) = P(B A) = P(A U B) / P(A) confidence(a B) = support(a B) / support(a)

Rule measurements example Customer buys both Customer buys beer Customer buys diaper Find all the rules {X, Y} Z with minimum confidence and support Support, s, is the probability that a transaction contains {X, Y, Z} Confidence, c, is the conditional probability that a transaction having {X, Y} also contains Z Transaction ID Items Bought 2000 a, b, c 1000 a, c 4000 a, d 5000 b, e, f Let minimum support = 50%, and minimum confidence = 50%, so we have ([s, c]): a c [50%, 66.67%] c a [50%, 100%] Source: Han and Kamber, DM Book, 1st Ed.

Rule measurements example (2) Transaction ID Items Bought 2000 a, b, c 1000 a, c 4000 a, d 5000 b, e, f Itemset Support a 75.00% b 50.00% c 50.00% a, c 50.00% Minimum support = 50% and confidence = 50% Rule a c support (a c): 50% confidence (a c) = support(a c) / support(a) = 50% / 75% = 66.67%

Mining frequent item sets Key step: Find the frequent sets of items that have minimum support (appear in at least xx% of all transactions in a database) Basic principle (Apriori principle): A sub set of a frequent item set must also be a frequent item set For example, if {a,b} is frequent, both {a} and {b} have to be frequent (if `beer' and 'chips' are purchased frequently together, then `beer' is purchased frequently and `chips' are also purchased frequently) Basic approach: Iteratively find frequent item sets with cardinality from 1 to k (k item sets), k > 1 Use the frequent item sets to generate association rules For example, frequent 3 item set {a,b,c} contains rules: a c, b c, a b, {a,b} c, {a,c} b, {b,c} a, etc. We are normally only interested in longer rules (with all except one element on the left hand side)

The Apriori algorithm (Agrawal & Srikant, VLDB'94) C k : Candidate item set of size k L k : Frequent item set of size k Pseudo code: L 1 = {frequent items}; for (k = 1; L k!= { }; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end do return k L k ;

The Apriori algorithm An example (sup=50%) Database D TID Items 100 a,c,d 200 b,c,e 300 a,b,c,e 400 b,e Scan D itemset sup. {a} 2 {b} 3 {c} 3 {d} 1 {e} 3 C 1 L 1 itemset sup {a, b} 1 {a, c} 2 {a, e} 1 {b, c} 2 {b, e} 3 {c, e} 2 L 2 C itemset sup 2 C 2 {a, c} 2 Scan D {b, c} 2 {b, e} 3 {c, e} 2 C 3 itemset sup L C 3 3 {b, c, e} {b, c, e} 2 itemset sup. {a} 2 {b} 3 {c} 3 {e} 3 itemset {a, b} {a, c} {a, e} {b, c} {b, e} {c, e} itemset Scan D itemset sup {b, c, e} 2

The Apriori algorithm An example (2) Database D TID Items 100 a,c,d 200 b,c,e 300 a,b,c,e 400 b,e L 3 itemset sup {b, c, e} 2 Minimum support = 50% and minimum confidence = 50% Rules: b c [50%, 66.67%] b e [75%, 100%] c e [50%, 66.67%] {b, c} e [50%, 100%] {b, e} c [50%, 66.67%] {c, e} b [50%, 100%]

Important details of the Apriori algorithm How to generate candidate sets? Step 1: Self joining L k (C k is generated by joining L k 1 with itself) Step 2: Pruning (any (k 1) item set that is not frequent cannot be a subset of a frequent k item set) Example of candidate generation: L 3 = {{a,b,c}, {a,b,d}, {a,c,d}, {a,c,e}, {b,c,d}} Self joining: L 3 * L 3 ({a,b,c,d} from {a,b,c} and {a,b,d}, and {a,c,d,e} from {a,c,d} and {a,c,e}) Pruning: {a,c,d,e} is removed because {a,d,e} is not in L 3 C 4 = {{a,b,c,d}} How to count supports for candidates?

How to generate candidate item sets? Suppose the items in L k 1 are listed in an order (e.g. a < b) Step 1: Self joining L k 1 insert into C k select p.item 1, p.item 2,, p.item k 1, q.item k 1 from L k 1 p, L k 1 q where p.item 1 = q.item 1,, p.item k 2 =q.item k 2, p.item k 1 < q.item k 1 Step 2: Pruning forall item sets c in C k do forall (k 1) sub sets s of c do if (s is not in L k 1 ) then delete c from C k

Apriori performance bottlenecks The core of the Apriori algorithm is to Use frequent (k 1) item sets to generate candidate frequent k item sets Use database scan and pattern matching to collect counts for candidate item sets Candidate generation is the main bottleneck 104 frequent 1 item sets (sets of length 1) will generate 10 7 candidate 2 item sets! To discover a frequent pattern of size 100 (for example {a 1, a 2,..., a 100 }) one needs to generate 2 100 = 10 30 candidates Multiple scans of the database are needed (n+1 scans if the longest pattern is n items long)

Methods to improve Apriori's efficiency Reduce the number of scans of the database Any item set that is potentially frequent in the database must be frequent in at least one of the partitions of the database Scan 1: Partition database and find local frequent patterns Scan 2: Consolidate global frequent patterns Shrink number of candidates Select a sample of the database, mine frequent patterns within sample using Apriori Scan database once to verify frequent item sets found in sample Scan database again to find missed frequent patterns Facilitate support of counting candidates For example, use special data structures like Frequent Pattern tree (FP tree)

Multi level association mining Items often form hierarchies Items at lower levels are expected to have lower support Flexible support setting (uniform, reduced, or group based (user specific)) Source: Han and Kamber, DM Book, 2 nd Ed. (Copyright 2006 Elsevier Inc.)

Multi level association mining (2) Some rules may be redundant due to ancestor relationships between items For example: buys(x, `milk') buys(x, `bread') [8%, 70%] buys(x, `skim milk') buys(x, `bread') [2%, 72%] The first rule is said to be an ancestor of the second rule A rule is redundant if its support is close to the expected value, based on the rule s ancestor For example, if around 25% of all milk purchased is `skim milk', then the second rule above is redundant, as it has a ¼ of the support of the first, more general rule (and similar confidence)

Multi dimensional association mining Single dimensional rules: buys(x, `milk') buys(x, `bread') Multi dimensional rules: Two or more dimensions or predicates (or attributes) Inter dimension association rules (no repeated predicates): age(x, `19 25') and occupation(x, `student') buys(x, `coke') Hybrid dimension association rules (repeated predicates): age(x, `19 25') and buys(x, `popcorn') buys(x, `coke') Categorical Attributes: finite number of possible values, no ordering among values (data cube approach) Quantitative Attributes: numeric, implicit ordering among values (discretisation, clustering, etc.)

Quantitative association mining Techniques can be categorised by how numerical attributes, such as age or income, are treated Static discretisation based on predefined concept hierarchies Dynamic discretisation based on data distribution A quant1 and A quant2 A cat Example: age(x, `19 25') and income(x, `40K 60K') buys(x, `HDTV') For quantitative rules, do discretisation such that (for example) the confidence of the rules mined is maximised

Mining interesting correlation patterns Flexible support Some items might be very rare but are valuable (like diamonds) Customise support min specification and application Top k frequent patterns It can be hard to specify support min, but top k rules with length min are more desirable Achievable using special data structures, like Frequent Pattern (FP) tree Dynamically raise support min during FP tree construction phase, and select most promising to mine

Constraint based data mining Finding all the frequent rules or patterns in a database autonomously is unrealistic The rules / patterns could be too many and not focussed Data mining should be an interactive process The user directs what should be mined using a data mining query language or a graphical user interface Constraint based mining User flexibility: provides constraints on what to be mined (and what not) System optimisation: explores such constraints for efficient mining

Constraints in data mining Knowledge type constraint Correlation, association, etc. Data constraint (use SQL like queries) For example: Find product pairs sold frequently in both stores in Sydney and Melbourne Dimension / level constraint In relevance to region, price, brand, customer category, etc. Rule or pattern constraint Small sales (price < $10) trigger big sales (sum > $200) Interestingness constraint Strong rules only: support min > 3%, confidence min > 75%

Visualisation of association rules (1)

Visualisation of association rules (2)

Visualisation of association rules (3)