Interestingness Measurements

Similar documents
Interestingness Measurements

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Mining Association Rules in Large Databases

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Association Analysis. CSE 352 Lecture Notes. Professor Anita Wasilewska

Association Rule Mining. Entscheidungsunterstützungssysteme

Association Rules. Berlin Chen References:

Knowledge Discovery in Databases II Winter Term 2015/2016. Optional Lecture: Pattern Mining & High-D Data Mining

Association Rules. A. Bellaachia Page: 1

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Chapter 4 Data Mining A Short Introduction

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Data mining, 4 cu Lecture 6:

Which Null-Invariant Measure Is Better? Which Null-Invariant Measure Is Better?

Product presentations can be more intelligently planned

BCB 713 Module Spring 2011

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 7

2 CONTENTS

Association Pattern Mining. Lijun Zhang

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Machine Learning: Symbolische Ansätze

Association Rule Mining. Introduction 46. Study core 46

Association Rule Discovery

1. Interpret single-dimensional Boolean association rules from transactional databases

Roadmap DB Sys. Design & Impl. Association rules - outline. Citations. Association rules - idea. Association rules - idea.

Improved Frequent Pattern Mining Algorithm with Indexing

Advanced Pattern Mining

Association Rules: Past, Present & Future. Ramakrishnan Srikant.

Association Rule Discovery

Lecture notes for April 6, 2005

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Mining Frequent Patterns with Counting Inference at Multiple Levels

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING

What Is Data Mining? CMPT 354: Database I -- Data Mining 2

CS570 Introduction to Data Mining

Data Mining Clustering

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Sequential Pattern Mining Methods: A Snap Shot

An Evolutionary Algorithm for Mining Association Rules Using Boolean Approach

Chapter 1, Introduction

Nesnelerin İnternetinde Veri Analizi

Chapter 6: Association Rules

Chapter 7: Frequent Itemsets and Association Rules

CSE 5243 INTRO. TO DATA MINING

Discovering interesting rules from financial data

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Data Preprocessing UE 141 Spring 2013

Scalable Frequent Itemset Mining Methods

Data Mining Part 3. Associations Rules

Supervised and Unsupervised Learning (II)

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

CSE 5243 INTRO. TO DATA MINING

Association Rules Apriori Algorithm

Association Rules. Juliana Freire. Modified from Jeff Ullman, Jure Lescovek, Bing Liu, Jiawei Han

Frequent Pattern Mining

5. MULTIPLE LEVELS AND CROSS LEVELS ASSOCIATION RULES UNDER CONSTRAINTS

Data Mining Course Overview

PESIT Bangalore South Campus

Production rule is an important element in the expert system. By interview with

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Chapter 7: Frequent Itemsets and Association Rules

ISSN Vol.03,Issue.09 May-2014, Pages:

Mining Association Rules. What Is Association Mining?

CSE 5243 INTRO. TO DATA MINING

A NOVEL ALGORITHM FOR MINING CLOSED SEQUENTIAL PATTERNS

PESIT- Bangalore South Campus Hosur Road (1km Before Electronic city) Bangalore

Data Mining Concepts

Chapter 4 Data Mining A Short Introduction. 2005/6, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Mining Association Rules in Large Databases

Advance Association Analysis

To Enhance Projection Scalability of Item Transactions by Parallel and Partition Projection using Dynamic Data Set

CHAPTER 5 WEIGHTED SUPPORT ASSOCIATION RULE MINING USING CLOSED ITEMSET LATTICES IN PARALLEL

Data Mining Algorithms

Information Management course

Sequential Data. COMP 527 Data Mining Danushka Bollegala

An Algorithm for Mining Large Sequences in Databases

A Comprehensive Survey on Sequential Pattern Mining

Chapter 4: Association analysis:

Generating Cross level Rules: An automated approach

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

Association Rules Apriori Algorithm

Algorithm for Efficient Multilevel Association Rule Mining

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

CS145: INTRODUCTION TO DATA MINING

An Improved Apriori Algorithm for Association Rules

Mining Quantitative Maximal Hyperclique Patterns: A Summary of Results

CS6220: DATA MINING TECHNIQUES

Knowledge Discovery and Data Mining

Association mining rules

HYBRID DIMENSION MINING BY FUZZY ASSOCIATION RULE

Sequential PAttern Mining using A Bitmap Representation

Fundamental Data Mining Algorithms

Information Management course

Mining for Mutually Exclusive Items in. Transaction Databases

A Study on Mining of Frequent Subsequences and Sequential Pattern Search- Searching Sequence Pattern by Subset Partition

Knowledge Discovery in Databases

Transcription:

Interestingness Measurements Objective measures Two popular measurements: support and confidence Subjective measures [Silberschatz & Tuzhilin, KDD95] A rule (pattern) is interesting if it is unexpected (surprising to the user) and/or actionable (the user can do something with it) Frequent Itemset Mining Simple Association Rules 39

Criticism to Support and Confidence Example 1 [Aggarwal & Yu, PODS98] Among 5000 students 3000 play basketball (=60%) 3750 eat cereal (=75%) 2000 both play basket ball and eat cereal (=40%) Rule play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7% Rule play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence Observation: play basketball and eat cereal are negatively correlated Not all strong association rules are interesting and some can be misleading. augment the support and confidence values with interestingness measures such as the correlation,, Frequent Itemset Mining Simple Association Rules 40

Other Interestingness Measures: Correlation Lift is a simple correlation measure between two items A and B:, =! The two rules and have the same correlation coefficient. take both P(A) and P(B) in consideration, 1 the two items A and B are positively correlated, 1 there is no correlation between the two items A and B, 1 the two items A and B are negatively correlated Frequent Itemset Mining Simple Association Rules 41

Other Interestingness Measures: Correlation Example 2: X 1 1 1 1 0 0 0 0 Y 1 1 0 0 0 0 0 0 Z 0 1 1 1 1 1 1 1 X and Y: positively correlated X and Z: negatively related support and confidence of X=>Z dominates but items X and Z are negatively correlated Items X and Y are positively correlated rule support confidence correlation 25% 50% 2 37.5% 75% 0.86 12.5% 50% 0.57 Frequent Itemset Mining Simple Association Rules 42

Chapter 3: Frequent Itemset Mining 1) Introduction Transaction databases, market basket data analysis 2) Mining Frequent Itemsets Apriori algorithm, hash trees, FP-tree 3) Simple Association Rules Basic notions, rule generation, interestingness measures 4) Further Topics Hierarchical Association Rules Motivation, notions, algorithms, interestingness Quantitative Association Rules Motivation, basic idea, partitioning numerical attributes, adaptation of apriori algorithm, interestingness 5) Extensions and Summary Outline 43

Hierarchical Association Rules: Motivation Problem of association rules in plain itemsets High minsup: apriori finds only few rules Low minsup: apriori finds unmanagably many rules Exploit item taxonomies (generalizations, is-a hierarchies) which exist in many applications clothes shoes outerwear shirts sports shoes boots jackets jeans New task: find all generalized association rules between generalized items Body and Head of a rule may have items of any level of the hierarchy Generalized association rule: with,, and no item in is an ancestor of any item in e.g. is essentially trivial Frequent Itemset Mining Further Topics Hierarchical Association Rules 44

Hierarchical Association Rules: Motivating Example Examples Jeans boots jackets boots Outerwear boots Support < minsup Support > minsup Characteristics Support( outerwear boots ) is not necessarily equal to the sum support( jackets boots ) + support( jeans boots ) e.g. if a transaction with jackets, jeans and boots exists Support for sets of generalizations (e.g., product groups) is higher than support for sets of individual items If the support of rule outerwear boots exceeds minsup, then the support of rule clothes boots does, too Frequent Itemset Mining Further Topics Hierarchical Association Rules 45

Mining Multi-Level Associations A top_down, progressive deepening approach: First find high-level strong rules: milk bread [20%, 60%]. Then find their lower-level weaker rules: 1.5% milk wheat bread [6%, 50%]. 3.5% milk Food bread 1.5% white wheat Fraser Sunset Wonder Different min_support threshold across multi-levels lead to different algorithms: adopting the same min_support across multi-levels adopting reduced min_support at lower levels Frequent Itemset Mining Further Topics Hierarchical Association Rules 46

Minimum Support for Multiple Levels Uniform Support milk support = 10 % minsup = 5 % 3.5% support = 6 % 1.5% support = 4 % minsup = 5 % + the search procedure is simplified (monotonicity) + the user is required to specify only one support threshold Reduced Support (Variable Support) milk support = 10 % minsup = 5 % 3.5% support = 6 % 1.5% support = 4 % minsup = 3 % + takes the lower frequency of items in lower levels into consideration Frequent Itemset Mining Further Topics Hierarchical Association Rules 47

Multilevel Association Mining using Reduced Support A top_down, progressive deepening approach: First find high-level strong rules: milk bread [20%, 60%]. Then find their lower-level weaker rules: 1.5% milk wheat bread [6%, 50%]. level-wise processing (breadth first) 3 approaches using reduced Support: Level-by-level independent method: Examine each node in the hierarchy, regardless of whether or not its parent node is found to be frequent Level-cross-filtering by single item: Examine a node only if its parent node at the preceding level is frequent Level-cross- filtering by k-itemset: 3.5% milk Fraser Food Sunset bread 1.5% white wheat Examine a k-itemset at a given level only if its parent k-itemset at the preceding level is frequent Wonder Frequent Itemset Mining Further Topics Hierarchical Association Rules 48

Multilevel Associations: Variants A top_down, progressive deepening approach: First find high-level strong rules: milk bread [20%, 60%]. Then find their lower-level weaker rules: 1.5% milk wheat bread [6%, 50%]. 3.5% milk Food bread 1.5% white wheat level-wise processing (breadth first) Fraser Sunset Wonder Variations at mining multiple-level association rules. Level-crossed association rules: 1.5 % milk Wonder wheat bread Association rules with multiple, alternative hierarchies: 1.5 % milk Wonder bread Frequent Itemset Mining Further Topics Hierarchical Association Rules 49

Multi-level Association: Redundancy Filtering Some rules may be redundant due to ancestor relationships between items. Example : milk wheat bread [support = 8%, confidence = 70%] : 1.5% milk wheat bread [support = 2%, confidence = 72%] We say that rule 1 is an ancestor of rule 2. Redundancy: A rule is redundant if its support is close to the expected value, based on the rule s ancestor (see [SA 95] R. Srikant, R. Agrawal: Mining Generalized Association Rules. In VLDB, 1995. ) Frequent Itemset Mining Further Topics Hierarchical Association Rules 50

Expected Support and Expected Confidence How to compute the expected support? Given the rule for X Y and its ancestor rule X Y the expected support of X Y is defined as: P P P P P P where,,,,,,,, and each is an ancestor of [SA 95] R. Srikant, R. Agrawal: Mining Generalized Association Rules. In VLDB, 1995. Frequent Itemset Mining Further Topics Hierarchical Association Rules 51

Expected Support and Expected Confidence How to compute the expected confidence? Given the rule for X Y and its ancestor rule X Y, then the expected confidence of X Y is defined as: P P P P P P where,, and,,,,, an ancestor of and each is [SA 95] R. Srikant, R. Agrawal: Mining Generalized Association Rules. In VLDB, 1995. Frequent Itemset Mining Further Topics Hierarchical Association Rules 52

Chapter 3: Frequent Itemset Mining 1) Introduction Transaction databases, market basket data analysis 2) Simple Association Rules Basic notions, rule generation, interestingness measures 3) Mining Frequent Itemsets Apriori algorithm, hash trees, FP-tree 4) Further Topics Hierarchical Association Rules Motivation, notions, algorithms, interestingness Multidimensional and Quantitative Association Rules Motivation, basic idea, partitioning numerical attributes, adaptation of apriori algorithm, interestingness 5) Summary Outline 53

Multi-Dimensional Association: Concepts Single-dimensional rules: buys milk buys bread Multi-dimensional rules: 2 dimensions Inter-dimension association rules (no repeated dimensions) age between 19-25 status is student buys coke hybrid-dimension association rules (repeated dimensions) age between 19-25 buys popcorn buys coke Frequent Itemset Mining Extensions & Summary 54

Techniques for Mining Multi- Dimensional Associations Search for frequent k-predicate set: Example: {age, occupation, buys} is a 3-predicate set. Techniques can be categorized by how age is treated. 1. Using static discretization of quantitative attributes Quantitative attributes are statically discretized by using predefined concept hierarchies. 2. Quantitative association rules Quantitative attributes are dynamically discretized into bins based on the distribution of the data. 3. Distance-based association rules This is a dynamic discretization process that considers the distance between data points. Frequent Itemset Mining Extensions & Summary 55

Quantitative Association Rules Up to now: associations of boolean attributes only Now: numerical attributes, too Example: Original database ID age marital status # cars 1 23 single 0 2 38 married 2 Boolean database ID age: 20..29 age: 30..39 m-status: single m-status: married... 1 1 0 1 0... 2 0 1 0 1... Frequent Itemset Mining Further Topics Quantitative Association Rules 56

Quantitative Association Rules: Ideas Static discretization Discretization of all attributes before mining the association rules E.g. by using a generalization hierarchy for each attribute Substitute numerical attribute values by ranges or intervals Dynamic discretization Discretization of the attributes during association rule mining Goal (e.g.): maximization of confidence Unification of neighboring association rules to a generalized rule Frequent Itemset Mining Further Topics Quantitative Association Rules 57

Partitioning of Numerical Attributes Problem: Minimum support Too many intervals too small support for each individual interval Too few intervals too small confidence of the rules Solution First, partition the domain into many intervals Afterwards, create new intervals by merging adjacent interval Numeric attributes are dynamically discretized such that the confidence or compactness of the rules mined is maximized. Frequent Itemset Mining Further Topics Quantitative Association Rules 58

Quantitative Association Rules 2-D quantitative association rules: A quan1 A quan2 A cat Cluster adjacent association rules to form general rules using a 2-D grid. Example: age(x, 30-34 ) income(x, 24K - 48K ) buys(x, high resolution TV ) Frequent Itemset Mining Further Topics Quantitative Association Rules 59

Chapter 3: Frequent Itemset Mining 1) Introduction Transaction databases, market basket data analysis 2) Mining Frequent Itemsets Apriori algorithm, hash trees, FP-tree 3) Simple Association Rules Basic notions, rule generation, interestingness measures 4) Further Topics Hierarchical Association Rules Motivation, notions, algorithms, interestingness Quantitative Association Rules Motivation, basic idea, partitioning numerical attributes, adaptation of apriori algorithm, interestingness 5) Summary Outline 60

Chapter 3: Summary Mining frequent itemsets Apriori algorithm, hash trees, FP-tree Simple association rules support, confidence, rule generation, interestingness measures (correlation), Further topics Hierarchical association rules: algorithms (top-down progressive deepening), multilevel support thresholds, redundancy and R- interestingness Quantitative association rules: partitioning numerical attributes, adaptation of apriori algorithm, interestingness Extensions: multi-dimensional association rule mining Frequent Itemset Mining Extensions & Summary 61

Further Applications Customer analysis Facilitator for other data mining techniques Indexing and retrieval: provide a concise data representation Web mining tasks: sequential pattern mining for traversal patterns which help in designing and organizing web sites Temporal applications, e.g. event detection Spatial and spatiotemporal analysis: association rules can characterize useful relationships between spatial and non-spatial properties Image and multimedia data mining: frequent image features help in several mining tasks for image data Chemical and biological applications: often important motifs correspond to frequent patterns in graphs and structured data (toxicological analysis, chemical compound prediction, RNA analysis ) Frequent Itemset Mining Introduction 62

Sequential pattern mining Outlook to KDD 2 Task 1: find all subsets of items that occur with a specific sequence in many transactions. E.g.: 97% of transactions contain the sequence {jogging high ECG sweating} Task 2: find all rules that correlate the order of one set of items after that of another set of items in the transaction database. E.g.: 72% of users who perform a web search then make a long eye gaze over the ads follow that by a successful add-click The order of the items matters, thus all possible permutations of items must be considered when checking possible frequent sequences, not only the combinations of items Applications: data with temporal order (streams), e.g.: bioinformatics, Web mining, text mining, sensor data mining, process mining etc. Sequential Pattern Mining Basics 63

Sequential Pattern Mining vs. Frequent Itemset Mining Both can be applied on similar dataset Each customer has a customer id and aligned with transactions. Each transaction has a transaction id and belongs to one customer. Based on the transaction id, each customer also aligned to a transaction sequence. Frequent itemset mining No temporal importance in the order of items happening together items frequency {butter} 4 {milk} 5 {butter, milk} 2 Cid Tid Item 1 {butter} 1 2 {milk} 3 {sugar} 4 {butter, sugar} 5 {milk, sugar} 2 6 {butter, milk, sugar} 7 {eggs} 8 {sugar} 9 {butter, milk} 3 10 {eggs} 11 {milk} Cid Item 1 {butter},{milk}, {sugar} {butter, sugar}, {milk, sugar}, {butter, milk, sugar}, 2 {eggs} 3 {sugar}, {butter, milk}, {eggs}, {milk} Sequential pattern mining The order of items matters sequences frequency {butter} 4 {butter, milk} 2 {butter},{milk} 4 {milk},{butter} 1 {butter},{butter,milk} 1 Sequential Pattern Mining Basics 64

Sequential Pattern Mining Algorithm Breadth-first search based GSP (Generalized Sequential Pattern) algorithm 1 SPADE 2 Depth-first search based PrefixSpan 3 SPAM 4 1 Sirkant & Aggarwal: Mining sequential patterns: Generalizations and performance improvements. EDBT 1996 2 Zaki M J. SPADE: An efficient algorithm for mining frequent sequences[j]. Machine learning, 2001, 42(1-2): 31-60. 3 Pei at. al.: Mining sequential patterns by pattern-growth: PrefixSpan approach. TKDE 2004 4 Ayres, Jay, et al: Sequential pattern mining using a bitmap representation. SIGKDD 2002. Sequential Pattern Mining Algorithms 65