What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Similar documents
Frequent Pattern Mining

BCB 713 Module Spring 2011

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Data warehouse and Data Mining

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

COMP 465: Data Mining Classification Basics

CS570 Introduction to Data Mining

Extra readings beyond the lecture slides are important:

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Basic Data Mining Technique

CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem. Muhammad Asiful Islam, SBID:

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

Classification with Decision Tree Induction

Nesnelerin İnternetinde Veri Analizi

Association Rule Mining. Entscheidungsunterstützungssysteme

Frequent Pattern Mining

Data Mining Part 3. Associations Rules

Association Rules. A. Bellaachia Page: 1

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Chapter 4 Data Mining A Short Introduction. 2005/6, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

Pattern Mining. Knowledge Discovery and Data Mining 1. Roman Kern KTI, TU Graz. Roman Kern (KTI, TU Graz) Pattern Mining / 42

Association Rules Apriori Algorithm

Chapter 7: Frequent Itemsets and Association Rules

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Fundamental Data Mining Algorithms

Mining Association Rules in Large Databases

Frequent Pattern Mining. Based on: Introduction to Data Mining by Tan, Steinbach, Kumar

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Product presentations can be more intelligently planned

Prediction. What is Prediction. Simple methods for Prediction. Classification by decision tree induction. Classification and regression evaluation

Classification and Prediction

Association Rules Apriori Algorithm

Mining Frequent Patterns without Candidate Generation

Effectiveness of Freq Pat Mining

Basic Concepts: Association Rules. What Is Frequent Pattern Analysis? COMP 465: Data Mining Mining Frequent Patterns, Associations and Correlations

Knowledge Discovery in Databases

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Association Rule Mining

Decision Support Systems

Chapter 4: Association analysis:

Association Rules. Berlin Chen References:

Data Mining for Knowledge Management. Association Rules

Data Mining Course Overview

BITS F464: MACHINE LEARNING

Market baskets Frequent itemsets FP growth. Data mining. Frequent itemset Association&decision rule mining. University of Szeged.

Data Mining Concepts

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

Introduction to Data Mining. Yücel SAYGIN

Frequent Pattern Mining S L I D E S B Y : S H R E E J A S W A L

Association Rules. Charles Sutton Data Mining and Exploration Spring Based on slides by Chris Williams and Amos Storkey. Thursday, 8 March 12

Introduction to Machine Learning

CSE 5243 INTRO. TO DATA MINING

Data Warehousing & Mining. Data integration. OLTP versus OLAP. CPS 116 Introduction to Database Systems

Data Warehousing and Data Mining. Announcements (December 1) Data integration. CPS 116 Introduction to Database Systems

Rule induction. Dr Beatriz de la Iglesia

2 CONTENTS

Machine Learning: Symbolische Ansätze

Chapter 4 Data Mining A Short Introduction

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Chapter 13, Sequence Data Mining

Roadmap DB Sys. Design & Impl. Association rules - outline. Citations. Association rules - idea. Association rules - idea.

Improved Frequent Pattern Mining Algorithm with Indexing

Chapter 6: Association Rules

Interestingness Measurements

Association Rule Discovery

Infrequent Weighted Itemset Mining Using SVM Classifier in Transaction Dataset

Supervised and Unsupervised Learning (II)

Data Mining Techniques

Jarek Szlichta

Data Mining Classification - Part 1 -

Credit card Fraud Detection using Predictive Modeling: a Review

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Association rule mining

Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Data mining - detailed outline. Problem

PATTERN DISCOVERY IN TIME-ORIENTED DATA

DATA MINING II - 1DL460

Chapter 1, Introduction

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

An Improved Apriori Algorithm for Association Rules

COMP90049 Knowledge Technologies

Data mining - detailed outline. Carnegie Mellon Univ. Dept. of Computer Science /615 DB Applications. Problem.

Nesnelerin İnternetinde Veri Analizi

Classification by Association

A Novel method for Frequent Pattern Mining

Knowledge Discovery in Data Bases

Chapter 7: Frequent Itemsets and Association Rules

CSE 5243 INTRO. TO DATA MINING

Association Rule Discovery

Tutorial on Association Rule Mining

Decision tree learning

PFPM: Discovering Periodic Frequent Patterns with Novel Periodicity Measures

ISSUES IN DECISION TREE LEARNING

Knowledge Discovery and Data Mining

High dim. data. Graph data. Infinite data. Machine learning. Apps. Locality sensitive hashing. Filtering data streams.

Comparing the Performance of Frequent Itemsets Mining Algorithms

Transcription:

Data Mining

What Is Data Mining? Mining data mining knowledge Data mining is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data CMPT 354: Database I -- Data Mining 2

The KDD Process Knowledge Selection Target data Preprocessed data Transformed data Transformation Preprocessing Patterns Data mining Interpretation/ evaluation Data CMPT 354: Database I -- Data Mining 3

What Kind of Patterns? Association rules and sequential patterns Classification Clusters Many others CMPT 354: Database I -- Data Mining 4

Frequent Patterns and Association Rules (Time {Fri, Sat}) buy(x, diaper) buy(x, beer) Dads taking care of babies in weekends drink beers Itemsets should be frequent It can be applied extensively Rules should be strong, i.e., confident With strong prediction capability CMPT 354: Database I -- Data Mining 5

Sequential Patterns Frequent patterns in sequence databases Within 3 months, buy computer buy CD-ROM buy digital camera The (temporal) order is important CMPT 354: Database I -- Data Mining 6

Utilizations Find regularities in data What products were often purchased together? What are the subsequent purchases after buying a PC? Can we automatically classify web documents? What kinds of patients are sensitive to this new drug? Basket data analysis, cross-marketing, catalog design, sale campaign analysis, web log (click stream) analysis, CMPT 354: Database I -- Data Mining 7

Classification A decision tree for PlayTennis Day O utlook Temp Humid Wind P laytennis D 1 S unny Hot High Weak No D2 Sunny Hot High Strong No D3 Overcast Hot High Weak Yes D4 Rain Mild High Weak Yes D5 Rain Cool Normal Weak Yes D6 Rain Cool Normal Strong No D7 Overcast Cool Normal Strong Yes D8 Sunny Mild High Weak No D9 Sunny Cool Normal Weak Yes D10 Rain Mild Normal Weak Yes D11 Sunny Mild Normal Strong Yes D12 Overcast Mild High Strong Yes D13 Overcast Hot Normal Weak Yes D14 Rain Mild High Strong No Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes CMPT 354: Database I -- Data Mining 8

Utilizations Understanding the key features of large data sets Predictions Credit card approval Fraud detection Intrusion detection CMPT 354: Database I -- Data Mining 9

Clusters and Outliers Outliers Cluster 1 Cluster 2 Maximizing the intra-class similarity and minimizing the inter-class similarity CMPT 354: Database I -- Data Mining 10

Utilizations Data summarization Market/customer segmentation Pattern recognition Data preprocessing and compression Exception detection Fraud detection CMPT 354: Database I -- Data Mining 11

Frequent Patterns: Basics Itemset: a set of items E.g., acm={a, c, m} Support of itemsets Sup(acm)=3 Given min_sup = 3, acm is a frequent pattern Frequent pattern mining: find all frequent patterns in a database Sup(c am)=3 Conf(c am)=75% Transaction database TDB TID Items bought 100 f, a, c, d, g, I, m, p 200 a, b, c, f, l, m, o 300 b, f, h, j, o 400 b, c, k, s, p 500 a, f, c, e, l, p, m, n CMPT 354: Database I -- Data Mining 12

A Naïve Attempt Generate all possible itemsets, test their supports against the database 100 items 2 100-1 possible itemets How to test the supports of a huge number of itemsets against a large database, say containing 1 million transactions? CMPT 354: Database I -- Data Mining 13

How to Get an Efficient Method? Reduce the number of itemsets that need to be checked Check the supports of selected itemsets efficiently CMPT 354: Database I -- Data Mining 14

Apriori: Anti-monotonic Property Any subset of a frequent itemset must be also frequent an anti-monotone property A transaction containing {beer, diaper, nuts} also contains {beer, diaper} {beer, diaper, nuts} is frequent {beer, diaper} must also be frequent In other words, any superset of an infrequent itemset must also be infrequent No superset of any infrequent itemset should be generated or tested Many item combinations can be pruned! CMPT 354: Database I -- Data Mining 15

Apriori-based Mining Candidate-generation-and-test Generate length (k+1) candidate itemsets from length k frequent itemsets, and Test the candidates against DB CMPT 354: Database I -- Data Mining 16

Apriori Algorithm A level-wise, candidate-generation-and-test approach (Agrawal & Srikant 1994) Data base D TID Items 10 a, c, d 20 b, c, e 30 a, b, c, e 40 b, e Min_sup=2 Scan D Scan D 3-candidates Itemset bce Freq 3-itemsets Itemset Sup bce 2 1-candidates Itemset Sup a 2 b 3 c 3 d 1 e 3 Freq 2-itemsets Itemset ac bc be ce Sup 2 2 3 2 Freq 1-itemsets Itemset Sup a 2 b 3 c 3 e 3 Counting Itemset ab ac ae bc be ce Sup 1 2 1 2 3 2 2-candidates Itemset ab ac ae bc be ce Scan D CMPT 354: Database I -- Data Mining 17

The Apriori Algorithm C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k!= ; k++) do C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support return k L k ; CMPT 354: Database I -- Data Mining 18

Classification and Prediction Classification: predict categorical class labels Build a model for a set of classes/concepts Classify bank loan applications (safe/risky) Prediction: model continuous-valued functions Predict the economic growth in 2004 CMPT 354: Database I -- Data Mining 19

A Two-step Process Model construction: describe a set of predetermined classes Training dataset: tuples for model construction Each tuple/sample belongs to a predefined class Classification rules, decision trees, or math formulae Model application: classify unseen objects Estimate accuracy of the model using an independent test set Acceptable accuracy apply the model to classify tuples with unknown class labels CMPT 354: Database I -- Data Mining 20

Model Construction Training Data Classification Algorithms Name Rank Years Tenured Mike Ass. Prof 3 No Mary Ass. Prof 7 Yes Bill Prof 2 Yes Jim Asso. Prof 7 Yes Dave Ass. Prof 6 No Anne Asso. Prof 3 No Classifier (Model) IF rank = professor OR years > 6 THEN tenured = yes CMPT 354: Database I -- Data Mining 21

Model Application Classifier Testing Data Name Rank Years Tenured Tom Ass. Prof 2 No Merlisa Asso. Prof 7 No George Prof 5 Yes Joseph Ass. Prof 7 Yes Unseen Data (Jeff, Professor, 4) Tenured? CMPT 354: Database I -- Data Mining 22

Decision Tree A node in the tree a test of some attribute A branch: a possible value of the attribute Classification Start at the root Test the attribute Move down the tree branch Outlook Sunny Overcast Rain Humidity Yes Wind High Normal Strong Weak No Yes No Yes CMPT 354: Database I -- Data Mining 23

Training Dataset Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes Rain Mild High Strong No CMPT 354: Database I -- Data Mining 24

Basic Algorithm ID3 Construct a tree in a top-down recursive divideand-conquer manner Which attribute is the best at the current node? Create a nodes for each possible attribute value Partition training data into descendant nodes Conditions for stopping recursion All samples at a given node belong to the same class No attribute remained for further partitioning Majority voting is employed for classifying the leaf There is no sample at the node CMPT 354: Database I -- Data Mining 25

Which Attribute Is the Best? The attribute most useful for classifying examples Information gain and gini index Statistical properties Measure how well an attribute separates the training examples CMPT 354: Database I -- Data Mining 26

Entropy Measure homogeneity of examples Entropy( S) i= 1 S is the training data set, and pi is the proportion of S belong to class i The smaller the entropy, the purer the data set c p i log 2 p i CMPT 354: Database I -- Data Mining 27

Information Gain The expected reduction in entropy caused by partitioning the examples according to an attribute Gain( S, A) Entropy( S) v Values( A) S S v Entropy( S Value(A) is the set of all possible values for attribute A, and S v is the subset of S for which attribute A has value v v ) CMPT 354: Database I -- Data Mining 28

Example 9 9 5 Entropy( S) = log 2 log 2 14 14 14 = 0.94 Gain ( S, Wind = = Entropy ( S ) 0.94 8 14 8 14 0.811 6 14 1.00 5 14 ) = Entropy ( S ) Engropy ( S v { Weak, Strong } Weak = ) 0.048 Outlook Temp Humid Wind PlayTennis Sunny Hot High Weak No Sunny Hot High Strong No Overcast Hot High Weak Yes Rain Mild High Weak Yes Rain Cool Normal Weak Yes Rain Cool Normal Strong No Overcast Cool Normal Strong Yes Sunny Mild High Weak No Sunny Cool Normal Weak Yes Rain Mild Normal Weak Yes Sunny Mild Normal Strong Yes Overcast Mild High Strong Yes Overcast Hot Normal Weak Yes Rain Mild High Strong No CMPT 354: Database I -- Data Mining 29 6 14 S S v Engropy ( S Entropy ( S Strong ) v )

Extracting Classification Rules Each path from the root to a leaf an IF- THEN rule Each attribute-value pair along a path forms a conjunction The leaf node holds the class prediction IF age = <=30 AND student = no THEN buys_computer = no Rules are easy to understand CMPT 354: Database I -- Data Mining 30

Summary Mining patterns Frequent patterns Classification Clustering Frequent patterns and Apriori algorithm Decision tree and ID3 algorithm CMPT 354: Database I -- Data Mining 31