CMPUT 391 Database Management Systems. Data Mining. Textbook: Chapter (without 17.10)

Similar documents
What Is Data Mining? CMPT 354: Database I -- Data Mining 2

Association Rules. Berlin Chen References:

Supervised and Unsupervised Learning (II)

ANU MLSS 2010: Data Mining. Part 2: Association rule mining

Data warehouse and Data Mining

Association Rule Mining. Entscheidungsunterstützungssysteme

Chapter 4 Data Mining A Short Introduction

Classification with Decision Tree Induction

Chapter 4 Data Mining A Short Introduction. 2005/6, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Unsupervised Learning

Interestingness Measurements

Cluster Analysis. Ying Shen, SSE, Tongji University

Basic Data Mining Technique

Interestingness Measurements

Apriori Algorithm. 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke

Chapter 6: Basic Concepts: Association Rules. Basic Concepts: Frequent Patterns. (absolute) support, or, support. (relative) support, s, is the

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Data Science components and tasks 3 Data types Project #1 out 4 Introduction to R,

CompSci 516 Data Intensive Computing Systems

Data Mining: Concepts and Techniques. Chapter 5. SS Chung. April 5, 2013 Data Mining: Concepts and Techniques 1

CSE 634/590 Data mining Extra Credit: Classification by Association rules: Example Problem. Muhammad Asiful Islam, SBID:

Unsupervised: no target value to predict

Frequent Pattern Mining

Chapter 4: Mining Frequent Patterns, Associations and Correlations

Course Content. Outline of Lecture 10. Objectives of Lecture 10 DBMS & WWW. CMPUT 499: DBMS and WWW. Dr. Osmar R. Zaïane. University of Alberta 4

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Introduction to Data Mining

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

ECLT 5810 Clustering

Product presentations can be more intelligently planned

Lesson 3. Prof. Enza Messina

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Thanks to the advances of data processing technologies, a lot of data can be collected and stored in databases efficiently New challenges: with a

Rule induction. Dr Beatriz de la Iglesia

BCB 713 Module Spring 2011

Cse634 DATA MINING TEST REVIEW. Professor Anita Wasilewska Computer Science Department Stony Brook University

ECLT 5810 Clustering

10701 Machine Learning. Clustering

An Improved Apriori Algorithm for Association Rules

Data Mining Algorithms

COMP 465: Data Mining Classification Basics

Data Structures. Notes for Lecture 14 Techniques of Data Mining By Samaher Hussein Ali Association Rules: Basic Concepts and Application

2 CONTENTS

UNIT-IV DATA MINING ALGORITHMS. March 14, 2012 Prof. Asha Ambhaikar

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Association Rules. Charles Sutton Data Mining and Exploration Spring Based on slides by Chris Williams and Amos Storkey. Thursday, 8 March 12

Nesnelerin İnternetinde Veri Analizi

Mining Association Rules in Large Databases

Basic Concepts Weka Workbench and its terminology

Extra readings beyond the lecture slides are important:

Lecture 6: Unsupervised Machine Learning Dagmar Gromann International Center For Computational Logic

Introduction to Data Mining. Yücel SAYGIN

Data Mining and Analytics

Association mining rules

Nesnelerin İnternetinde Veri Analizi

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

This tutorial has been prepared for computer science graduates to help them understand the basic-to-advanced concepts related to data mining.

Reading Material. Data Mining - 2. Data Mining - 1. OLAP vs. Data Mining 11/19/17. Four Main Steps in KD and DM (KDD) CompSci 516: Database Systems

Machine Learning Chapter 2. Input

Data Mining Clustering

Tutorial on Association Rule Mining

COMP33111: Tutorial and lab exercise 7

Mining Association Rules in Large Databases

Keywords Clustering, Goals of clustering, clustering techniques, clustering algorithms.

CompSci 516 Data Intensive Computing Systems

A Review on Cluster Based Approach in Data Mining

Data Mining Part 3. Associations Rules

INF4820. Clustering. Erik Velldal. Nov. 17, University of Oslo. Erik Velldal INF / 22

CSE 5243 INTRO. TO DATA MINING

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

Association Pattern Mining. Lijun Zhang

Production rule is an important element in the expert system. By interview with

Road map. Basic concepts

Contents. Preface to the Second Edition

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Data Mining Concepts

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

Gene Clustering & Classification

Chapter 6: Association Rules

CSE 5243 INTRO. TO DATA MINING

Decision Tree CE-717 : Machine Learning Sharif University of Technology

University of Florida CISE department Gator Engineering. Clustering Part 2

Data Mining Practical Machine Learning Tools and Techniques

Chapter 28. Outline. Definitions of Data Mining. Data Mining Concepts

Clustering part II 1

Jarek Szlichta

Introduction to Machine Learning

CHAPTER V ADAPTIVE ASSOCIATION RULE MINING ALGORITHM. Please purchase PDF Split-Merge on to remove this watermark.

Machine Learning: Symbolische Ansätze

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Association Rules Apriori Algorithm

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 6

Lecture 5: Decision Trees (Part II)

Association rules. Marco Saerens (UCL), with Christine Decaestecker (ULB)

Association Rules. A. Bellaachia Page: 1

BITS F464: MACHINE LEARNING

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

CS570 Introduction to Data Mining

Data Mining. Ryan Benton Center for Advanced Computer Studies University of Louisiana at Lafayette Lafayette, La., USA.

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Transcription:

CMPUT 391 Database Management Systems Data Mining Textbook: Chapter 17.7-17.11 (without 17.10) University of Alberta 1

Overview Motivation KDD and Data Mining Association Rules Clustering Classification University of Alberta 2

The Current Situation Data Rich Technology is available to help us collect data Bar code, scanners, satellites, cameras, etc. Technology is available to help us store data Databases, data warehouses, variety of repositories Information Poor We need powerful data analysis processes to turn the data into information (competitive edge, research, etc.) Sales / Service Transactions Sky and Earth Observation Molecular Biology Medical Data... University of Alberta 3

Overview Motivation KDD and Data Mining Association Rules Clustering Classification University of Alberta 4

KDD: Knowledge Discovery in Databases Confluence of Multiple Disciplines Machine Learning Statistics Database Technology KDD Visualization... Process of non trivial extraction of implicit, previously unknown and potentially useful information from large collections of data University of Alberta 5

The KDD Process Data mining: the core of knowledge discovery process. Pattern Evaluation Task-relevant Data Data Cleaning Data Warehouse Selection and Transformation KDD is an Iterative Process Data Integration University of Alberta 6

Basic Data Mining Functionalities Association Rules Find frequent associations in transaction databases Clustering Find natural groups in the data Classification Find models to predict class membership for new objects Concept Characterization and Discrimination Generalize, summarize, and contrast data characteristics Other methods Outlier detection Sequential patterns Methods for special data types, e.g., spatial data mining, web mining, text mining, University of Alberta 7

Overview Motivation KDD and Data Mining Association Rules Clustering Classification University of Alberta 8

Basic Concepts I ={i 1, i 2, i n } is the set of all possible items A transaction is a set of items: T={i a, i b, i t }, T I D, is a set (database) of transactions the task relevant data. An association rule is of the form: P Q where P I, Q I, and P Q = University of Alberta 9

Rule Measures: Support and Confidence Support of a rule P Q - s D (P Q) = s D (P Q): Probability that a transaction in D contains the items from both itemsets P and Q. Customer buys both Customer buys bread Confidence of a rule P Q - c D (P Q) = s D (P Q) / s D (P): Probability that a transaction in D that already contains the items in P also contains the items in Q. Determined by counting the corresponding relative frequencies in the database Customer buys milk University of Alberta 10

Support and Confidence An Example Transaction ID Items Bought 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F For rule A C: support = support({a, C}) = 50% Min. support 50% Min. confidence 50% Frequent Itemset Support {A} 75% {B} 50% {C} 50% {A,C} 50% confidence = support({a, C})/support({A}) = 66.6% University of Alberta 11

Strong Rules Thresholds: minimum support: minsup minimum confidence: minconf Itemsets are called frequent if their support is greater or equal than minsup Task: Find all rules P Q [s, c] so that s D (P Q) minsup c D (P Q) minconf University of Alberta 12

Mining Association Rules Input A database of transactions Each transaction is a list of items (Ex. purchased by a customer in a visit) Task: Find Find all rules P Q [s, c] that associate the presence of one set of items with that of another set of items so that sd(p Q) minsup cd(p Q) minconf Example: 98% of people who purchase tires and auto accessories also get automotive services done University of Alberta 13

Mining Frequent Itemsets: the Key Step Iteratively find the frequent itemsets, i.e. sets of items that have minimum support, with cardinality from 1 to k (k-itemsets) Based on the Apriori principle: Any subset of a frequent itemset must also be frequent. E.g., if {AB} is a frequent itemset, both {A} and {B} must be frequent itemsets. Method based on the apriori principle When generating candidate k-itemsets for the next iteration, only consider those where all subsets of length k -1 have been determined as frequent in the previous step Use the frequent itemsets to generate association rules. University of Alberta 14

The Apriori Algorithm Join Step: C k is generated by joining L k-1 with itself Prune Step: Any (k-1)-itemset that is not frequent cannot be a subset of a frequent k-itemset Pseudo-code: C k : Candidate itemset of size k L k : frequent itemset of size k L 1 = {frequent items}; for (k = 1; L k!= ; k++) do begin C k+1 = candidates generated from L k ; for each transaction t in database do increment the count of all candidates in C k+1 that are contained in t L k+1 = candidates in C k+1 with min_support end return k L k ; University of Alberta 15

Candidate Generation: 1. Join Step Suppose the items in L k-1 are listed in an order Step 1: self-joining L k-1 insert into C k select p.item 1, p.item 2,, p.item k-1, q.item k-1 from L k-1 p, L k-1 q where p.item 1 =q.item 1,, p.item k-2 =q.item k-2, p.item k-1 < q.item k-1 p L k-1 (1, 2, 3) q L k-1 (1, 2, 4) (1, 2, 3, 4) C k University of Alberta 16

Candidate Generation: 2. Prune Step Step 2: pruning forall itemsets c in C k do forall (k-1)-subsets s of c do if (s is not in L k-1 ) then delete c from C k Example L 3 = {(1 2 3), (1 2 4), (1 3 4), (1 3 5), (2 3 4)} Candidates after the join step: {(1 2 3 4), (1 3 4 5)} In the pruning step: delete (1 3 4 5) because (3 4 5) L 3 C 4 = {(1 2 3 4)} University of Alberta 17

The Apriori Algorithm -- Example Database D TID Items 100 1 3 4 200 2 3 5 300 1 2 3 5 400 2 5 L 2 itemset sup {1 3} 2 {2 3} 2 {2 5} 3 {3 5} 2 C 3 itemset {2 3 5} C 1 Scan D C 2 itemset sup. {1} 2 {2} 3 {3} 3 {4} 1 {5} 3 itemset sup {1 2} 1 {1 3} 2 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2 Scan D L 3 L 1 Scan D itemset sup {2 3 5} 2 itemset sup. {1} 2 {2} 3 {3} 3 {5} 3 C 2 itemset {1 2} {1 3} {1 5} {2 3} {2 5} {3 5} minsup = 50% Note: {1,2,3}{1,2,5} and {1,3,5} not in C 3 University of Alberta 18

Generating Association Rules from Frequent Itemsets Only strong association rules are generated. Frequent itemsets satisfy minimum support threshold. Strong AR satisfy minimum confidence threshold. Confidence(P Q) = Prob(Q/P) = Support(P Q) Support(P) For each frequent itemset, f, generate all non-empty subsets of f. For every non-empty subset s of f do output rule s (f-s) if support(f)/support(s) min_confidence end University of Alberta 19

Interestingness of Association Rules Motivation Example (Aggarwal & Yu, PODS98) Among 5000 students 3000 play basketball 3750 eat cereal 2000 both play basket ball and eat cereal play basketball eat cereal [40%, 66.7%] is misleading because the overall percentage of students eating cereal is 75% which is higher than 66.7%. play basketball not eat cereal [20%, 33.3%] is far more accurate, although with lower support and confidence basketball not basketball sum(row) cereal 2000 1750 3750 not cereal 1000 250 1250 sum(col.) 3000 2000 5000 University of Alberta 20

Interestingness of Association Rules Delete misleading association rules Condition for a rule A B P( A B) P( A) > P( B) + Measure for the interestingness of a rule P( A B) P( B) P( A) The larger the value, the more interesting the relation between A and B, expressed by the rule. Other measures: correlation between A and B d for a suitable threshold d > 0 University of Alberta 21

Overview Motivation KDD and Data Mining Association Rules Clustering Classification University of Alberta 22

What is Clustering in Data Mining? Clustering is a process of partitioning a set of data (or objects) in a set of meaningful sub-classes, called clusters. Helps users understand the natural grouping or structure in a data set. Cluster: a collection of data objects that are similar to one another and thus can be treated collectively as one group. Clustering: unsupervised classification: no predefined classes. University of Alberta 23

What Is Good Clustering? A good clustering method will produce high quality clusters where: the intra-class similarity (that is within a cluster) is high. the inter-class similarity (that is between clusters) is low. The quality of a clustering result also depends on the similarity measure used by the method. The quality of a clustering result also depends on the definition and representation of cluster different clustering algorithms may have different underlying notions of clusters. University of Alberta 24

Major Clustering Techniques Partitioning algorithms: Construct various partitions and then evaluate them by some criterion. Hierarchical algorithms: Create a hierarchical decomposition of the set of data (or objects) using some criterion. There is an agglomerative approach and a divisive approach. Density-based: based on connectivity and density functions. Grid-based: based on a multiple-level granularity structure. Model-based: A model is hypothesized for each of the clusters and the idea is to find the best fit of that model to each other. University of Alberta 25

Partitioning Algorithms: Basic Concept Partitioning method: Given a number k, partition a database D of n objects into a set of k clusters so that a chosen objective function is minimized (e.g., sum of distances to the center of the clusters). Global optimum: exhaustively enumerate all partitions too expensive! Heuristic methods based on iterative refinement of an initial partition Bad Clustering Optimal Clustering 5 1 5 1 x x x x Centroids 5 1 5 1 x x x x Centroids 1 5 1 5 1 5 1 5 University of Alberta 26

The K-Means Clustering Method Given k, the k-means algorithm is implemented in 4 steps: 1. Partition objects into k nonempty subsets 2. Compute seed points as the centroids of the clusters of the current partition. The centroid of a cluster for the k-means algorithm is the mean point of all points in the cluster. 3. Assign each object to the cluster with the nearest seed point. 4. Go back to Step 2, stop when no more new assignment. 10 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2 10 9 8 7 6 5 4 3 2 1 1 1 1 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 University of Alberta 27

Comments on the K-Means Method Strength of the k-means: Relatively efficient: O(tkn), where n is # of objects, k is # of clusters, and t is # of iterations. Normally, k, t << n. Weakness of the k-means: Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance. Not suitable to discover clusters with non-convex shapes. Often terminates at a local optimum. University of Alberta 28

Hierarchical Clustering Hierarchical decomposition of the data set (with respect to a given similarity measure) into a set of nested clusters Result represented by a so called dendrogram Nodes in the dendrogram represent possible clusters can be constructed bottom-up (agglomerative approach) or top down (divisive approach) Step 0 Step 1 Step 2 Step 3 Step 4 a b c d e a b d e a b c d e c d e Step 4 Step 3 Step 2 Step 1 Step 0 agglomerative divisive University of Alberta 29

Hierarchical Clustering: Example Interpretation of the dendrogram The root represents the whole data set A leaf represents a single objects in the data set An internal node represent the union of all objects in its sub-tree The height of an internal node represents the distance/similarity between its two child nodes 5 1 1 1 7 2 4 6 3 5 5 8 9 1 2 3 4 5 6 7 8 9 2 1 0 distance between groups University of Alberta 30

Agglomerative Hierarchical Clustering Single-Link Method and Variants: start by placing each object in its own cluster. keep merging closest pairs (most similar pairs) of clusters into larger clusters until all objects are in a single cluster. Most hierarchical methods belong to this category. They differ mainly in their definition of betweencluster similarity. Single-Link: similarity is defined as the similarity between the closest (i.e., most similar) pair of objects. University of Alberta 31

Overview Motivation KDD and Data Mining Association Rules Clustering Classification University of Alberta 32

What is Classification? The goal of data classification is to organize and categorize data in distinct classes. A model is first created based on the data distribution. The model is then used to classify new data. Given the model, a class can be predicted for new data.? 1 2 3 4 n University of Alberta 33

Classification is a three-step process 1. Model construction (Learning): Each tuple is assumed to belong to a predefined class, as determined by one of the attributes, called the class label. Training Set: Outlook Tempreature Humidity W indy Class sunny hot high false N sunny hot high true N overcast hot high false P The set of all tuples used for construction of the model is called training set. The model can be represented in the following forms: Classification rules, (IF-THEN statements), Decision tree Mathematical formulae University of Alberta 34

1. Classification Process (Learning) Training Data Classification Algorithms Name Income Age Credit rating Bruce Low <30 bad Dave Medium [30..40] good William High <30 good Marie Medium >40 good Anne Low [30..40] good Chris Medium <30 bad Classifier (Model) IF Income = High OR Age > 30 THEN CreditRating = Good University of Alberta 35

Classification is a three-step process 2. Model Evaluation (Accuracy): Test Set: Estimate accuracy rate of the model based on a test set. The known label of test sample is compared with the classified result from the model. Accuracy rate is the percentage of test set samples that are correctly classified by the model. Test set is independent of training set otherwise over-fitting will occur. Outlook Tempreature Humidity W indy Class rainy hot low false P sunny hot high true N overcast hot high false P = Predicted Class? = P? = P? = N? University of Alberta 36

2. Classification Process (Accuracy Evaluation) Testing Data Classifier (Model) Name Income Age Credit rating Tom Medium <30 bad Jane High <30 bad Wei High >40 good Hua Medium [30..40] good How accurate is the model? IF Income = High OR Age > 30 THEN CreditRating = Good University of Alberta 37

Classification is a three-step process 3. Model Use (Classification): The model is used to classify new objects where the class label is not known. Using the attributes of the new object and the model, assign a class label to the new object Outlook Tempreature Humidity W indy Class New Data: rainy hot low false? Predict Class! rainy mild high false? overcast hot high true? University of Alberta 38

3. Classification Process (Classification) New Data Classifier (Model) Name Income Age Credit rating Paul High [30..40]? Credit Rating? University of Alberta 39

Classification Methods Decision Tree Induction Neural Networks Bayesian Classification Association-Based Classification K-Nearest Neighbour Genetic Algorithms Etc. University of Alberta 40

What is a Decision Tree? Atr=? CL A decision tree is a flow-chart-like tree structure. Internal node denotes a test on an attribute Branch represents an outcome of the test All tuples in branch have the same value for the tested attribute. Leaf node represents class label or class label Atr=? distribution. Atr=? Atr=? Atr=? CL CL CL CL CL CL CL University of Alberta 41

Training Dataset An Example from Quinlan s ID3 Outlook Tempreature Humidity W indy Class sunny hot high false N sunny hot high true N overcast hot high false P rain mild high false P rain cool normal false P rain cool normal true N overcast cool normal true P sunny mild high false N sunny cool normal false P rain mild normal false P sunny mild normal true P overcast mild high true P overcast hot normal false P rain mild high true N University of Alberta 42

A Sample Decision Tree sunny Humidity? Outlook? overcast P rain Windy? Outlook Tempreature Humidity Windy Class sunny hot high false N sunny hot high true N overcast hot high false P rain mild high false P rain cool normal false P rain cool normal true N overcast cool normal true P sunny mild high false N sunny cool normal false P rain mild normal false P sunny mild normal true P overcast mild high true P overcast hot normal false P rain mild high true N high normal true false N P N P University of Alberta 43

Decision-Tree Classification Methods The basic top-down decision tree generation approach usually consists of two phases: 1. Tree construction At the start, all the training examples are at the root. Partition examples recursively, based on selected attributes. 2. Tree pruning Aiming at removing tree branches that may reflect noise in the training data and lead to errors when classifying test data improve classification accuracy. University of Alberta 44

Decision Tree Construction Atr=? Recursive process: Tree starts a single node representing all data. Recursion stops when: a) Sample in node belong to the same class; b) There are no remaining attributes on which to split; a) & b) node becomes a leaf labeled with the majority class label. There are no samples with attribute value. CL Otherwise, select suitable attribute partition the data according to the attribute values of the selected attribute into subsets. For each of these subsets: create a new child node under the current parent node and recursively apply the method to the new child nodes. University of Alberta 45

Partitioning the Data at a Given Node Split criterion: Use a goodness/impurity function to determine the attribute that results in the purest subsets with respect to the class label. Different goodness functions exist: information gain, gini index, etc. Branching scheme: binary splitting (numerical attributes, gini index) versus many splitting (categorical attributes, information gain). University of Alberta 46

Example for Algorithm (ID3) All attributes are categorical Create a node N; if samples are all of the same class C, then return N as a leaf node labeled with C. if attribute-list is empty then return N as a leaf node labeled with the most common class. Select split-attribute with highest information gain label N with the split-attribute for each value A i of split-attribute, grow a branch from Node N let S i be the branch in which all tuples have the value A i for split- attribute if S i is empty then attach a leaf labeled with the most common class. Else recursively run the algorithm at Node S i Until all branches reach leaf nodes University of Alberta 47

How to use a tree? Directly test the attribute values of an unknown sample against the tree. A path is traced from root to a leaf which holds the label. Indirectly decision tree is converted to classification rules. one rule is created for each path from the root to a leaf. IF-THEN rules are easier for humans to understand. University of Alberta 48