Association Rules. Comp 135 Machine Learning Computer Science Tufts University. Association Rules. Association Rules. Data Model.

Similar documents
2. Discovery of Association Rules

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Association Rule Discovery

Association Pattern Mining. Lijun Zhang

Machine Learning: Symbolische Ansätze

Association Rule Discovery

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Data Mining and Knowledge Discovery: Practice Notes

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Chapter 4: Association analysis:

Data Mining and Knowledge Discovery: Practice Notes

CS570 Introduction to Data Mining

Survey on Frequent Pattern Mining

Association rule mining

Data Mining and Knowledge Discovery Practice notes Numeric prediction and descriptive DM

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Supervised and Unsupervised Learning (II)

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Chapter 4: Mining Frequent Patterns, Associations and Correlations

BCB 713 Module Spring 2011

Effectiveness of Freq Pat Mining

Association Rule Learning

Data Mining and Knowledge Discovery: Practice Notes

Association Rules. A. Bellaachia Page: 1

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Chapter 7: Frequent Itemsets and Association Rules

DATA MINING II - 1DL460

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Data Mining Techniques

Association Rules Outline

Chapter 7: Frequent Itemsets and Association Rules

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Interestingness Measurements

Frequent Pattern Mining

Association Rules. Berlin Chen References:

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

OPTIMISING ASSOCIATION RULE ALGORITHMS USING ITEMSET ORDERING

Mining Association Rules in Large Databases

Data Mining Algorithms

Knowledge Discovery in Databases

Association Rule Mining. Entscheidungsunterstützungssysteme

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Frequent Pattern Mining

Mining Association Rules in Large Databases

Chapter 6: Association Rules

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Lecture notes for April 6, 2005

Association Rule Mining

Data Mining Concepts

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

A Roadmap to an Enhanced Graph Based Data mining Approach for Multi-Relational Data mining

Association Rules. Charles Sutton Data Mining and Exploration Spring Based on slides by Chris Williams and Amos Storkey. Thursday, 8 March 12

Chapter 13, Sequence Data Mining

Dynamic Itemset Counting and Implication Rules For Market Basket Data

Data Mining Part 3. Associations Rules

Performance and Scalability: Apriori Implementa6on

Lecture 2 Wednesday, August 22, 2007

CHAPTER 3 ASSOCIATION RULE MINING WITH LEVELWISE AUTOMATIC SUPPORT THRESHOLDS

Association mining rules

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Pincer-Search: An Efficient Algorithm. for Discovering the Maximum Frequent Set

Closed Non-Derivable Itemsets

Memory issues in frequent itemset mining

Roadmap. PCY Algorithm

Hash-Based Improvements to A-Priori. Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms

Mining Frequent Patterns without Candidate Generation

Production rule is an important element in the expert system. By interview with

We will be releasing HW1 today It is due in 2 weeks (1/25 at 23:59pm) The homework is long

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Improvements to A-Priori. Bloom Filters Park-Chen-Yu Algorithm Multistage Algorithm Approximate Algorithms Compacting Results

Contents. Preface to the Second Edition

Association Rule Mining: FP-Growth

PFPM: Discovering Periodic Frequent Patterns with Novel Periodicity Measures

Chapter 4 Data Mining A Short Introduction

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Data warehouse and Data Mining

CompSci 516 Data Intensive Computing Systems

CS 6093 Lecture 7 Spring 2011 Basic Data Mining. Cong Yu 03/21/2011

CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem. Muhammad Asiful Islam, SBID:

PESIT Bangalore South Campus

CHAPTER 8. ITEMSET MINING 226

Improved Frequent Pattern Mining Algorithm with Indexing

Association Analysis: Basic Concepts and Algorithms

Interestingness Measurements

Introduction to Data Mining

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Building Intelligent Learning Database Systems

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases. Paper s goals. H-mine characteristics. Why a new algorithm?

Tutorial on Association Rule Mining

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Quick Inclusion-Exclusion

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Association Rules Apriori Algorithm

gspan: Graph-Based Substructure Pattern Mining

Knowledge Discovery in Databases II Winter Term 2015/2016. Optional Lecture: Pattern Mining & High-D Data Mining

Materialized Data Mining Views *

Transcription:

Comp 135 Machine Learning Computer Science Tufts University Fall 2017 Roni Khardon Unsupervised learning but complementary to data exploration in clustering. The goal is to find weak implications in the data that have non-negligible coverage Useful in marketing, in understanding application data, as feature generator for supervised learning. Text from paper by [AIS93] that introduced the topic Text from paper by [AIS93] that introduced the topic Data Model Following the market-basket application We assume a table where Each row is a transaction Each column is an item Table entries are in {0,1} i.e., discrete A transaction can be seen to represent the corresponding set of items What are useful rules? At least % coverage: support support(x) = #transactions including X frequency(x) =support(x)/#transactions At least % predictive: confidence confidence(x ) Y )= support(x [ Y ) support(x) 1

Applications and data characteristics from some early papers: Data #transact. #items transact. size Avg size Market Basket 50K 13K 1-100 10 Web Clicks 50K 500 1-267 2.5 Census 30K 2000 70 70 In more demanding applications data does not fit in memory More data characteristics Frequency (%) 50 40 30 20 10 0 IBM-Artificial BMS-WebView-1 BMS-WebView-2 BMS-POS 0 5 10 15 20 25 30 Transaction size Figure from [ZKM01] Too many sets and rules 10,000,000 Count 1,000,000 100,000 10,000 1,000 100 10 #Frequent Itemsets #Association Rules #Closed Frequent Itemsets 0.00 0.20 0.40 0.60 0.80 1.00 Minimum support (%) Figure from [ZKM01] Huge data: technology challenge making use of memory hierarchy Huge data: algorithmic challenge to process it efficiently Huge output: conceptual challenge to identify most interesting rules A concrete task 1: find all rules with frequency at least f and confidence at least c. How can we do this? If (XàY) satisfies conditions then (X+Y) must also have frequency at least f. A concrete task 2: find all sets Z with frequency at least f. From frequent sets to rules Given frequent set Z for example {A,B,C,D,E} Remove potential conclusion W for example {D} And check the confidence of (Z\WàW) of (ABCEàD) 2

Frequent Set Mining find all sets Z with frequency at least f Lattice Structure of Freq Sets {beer} {chips} {pizza} {wine} How can we do this? Main insight: anti-monotonicity if set Z is frequent then all its subsets are also frequent Algorithmic ideas? {beer, chips} {beer, pizza} {beer, wine} {chips, pizza} {chips, wine} {pizza, wine} {beer, chips, pizza} {beer, chips, wine} {beer, pizza, wine} {chips, pizza, wine} positive border {beer, chips, pizza, wine} negative border Figure 1: The lattice for the itemsets of Example 1 and its border. Figure from [Goethals 2003] Lattice Structure of Freq Sets Level-wise (Apriori) Algorithm Notice the notions of positive border negative border that are implicit in the monotonicity property and in the view via the lattice The borders capture all the frequent sets. Some algorithms attempt to find these directly. level=1 cands[1] = Sets with single items While cands[level] not empty 1. Calc support for cands[level] 2. freq[level] = cands[level] with high support 3. cands[level+1]=generate from freq[level] 4. level=level+1 Calculating support for candidates Basic implementation of step 1 For each row R For each candidate X if X subset of R then: count[x]+=1 One pass over database Improve run time via trie data structure that captures set of candidates Generating candidates Monotonicity can be used to generate and prune potential candidates Use trie or lexicographical ordering to identify potential candidates Prune via subset relation Prune via upper/lower bounds on frequency (we skip details of this idea) 3

Vertical View View each item as a set of transactions This directly captures support Support of a set is the intersection of support of parents This incurs large space cost for storing support DFS recursive exploration avoids this and leads to efficient algorithm (we skip the details of this) Confidence can often be misleading If p(b) is large p(b A)=p(B) i.e., independent Confidence(AàB) is still large confidence(x ) Y )= support(x [ Y ) support(x) Lift measures dist from independence Lift(X ) Y )= freq(x [ Y ) freq(x)freq(y ) Conviction aims at implication Conviction(X ) Y )= freq(x)(1 freq(y ) freq(x) freq(x [ Y ) Interpret as inverse of Lift(X and Not Y) Table from [BMUT97] Positive and Negative Borders AMSS Alg: Find one maximal set 4

Dualize and Advance Algorithm Dualize and Advance Algorithm ABC and BD are max set found so far ABC and BD are max set found so far AD and CD are the negative border of this set D and AC are their complements Any undiscovered frequent set must lie above the negative border. Computing Transversals (apply distributive law) for (D and AC) yields (AD and CD) Which is the negative border Frequent Sets as Features One way to generate enriched features for supervised learning is to generate frequent sets (because they occur and thus have a chance of making a difference) Very successful in frequent sub-graph mining, which extends the topic of this lecture to graphs, and its application to classifying molecules Summary Association rules: a novel form of data exploration with different goals from previous supervised and unsupervised learning Algorithmic/computational challenge Frequent set mining as an important subtask Property: Anti-monotonicity Level-wise algorithm Many alternative algorithms 5