Classification. Instructor: Wei Ding

Similar documents
Part I. Instructor: Wei Ding

CSE4334/5334 DATA MINING

Classification: Basic Concepts, Decision Trees, and Model Evaluation

CS Machine Learning

Example of DT Apply Model Example Learn Model Hunt s Alg. Measures of Node Impurity DT Examples and Characteristics. Classification.

Data Mining Concepts & Techniques

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

Extra readings beyond the lecture slides are important:

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Classification Salvatore Orlando

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification - Part 1 -

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

Classification and Regression

DATA MINING LECTURE 9. Classification Basic Concepts Decision Trees Evaluation

Data Mining D E C I S I O N T R E E. Matteo Golfarelli

Lecture Notes for Chapter 4

Lecture 7: Decision Trees

DATA MINING LECTURE 9. Classification Decision Trees Evaluation

Given a collection of records (training set )

CSE 5243 INTRO. TO DATA MINING

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

Lecture outline. Decision-tree classification

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

COMP 465: Data Mining Classification Basics

Data Warehousing & Data Mining

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

ARTIFICIAL INTELLIGENCE (CS 370D)

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Nesnelerin İnternetinde Veri Analizi

Classification and Regression Trees

CS229 Lecture notes. Raphael John Lamarre Townshend

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

ISSUES IN DECISION TREE LEARNING

Data Preprocessing. Slides by: Shree Jaswal

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

Decision tree learning

Decision Tree CE-717 : Machine Learning Sharif University of Technology

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

CS 229 Midterm Review

Classification: Basic Concepts and Techniques

Business Club. Decision Trees

Algorithms: Decision Trees

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Classification and Regression Trees

Classification: Decision Trees

DATA ANALYSIS I. Types of Attributes Sparse, Incomplete, Inaccurate Data

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Credit card Fraud Detection using Predictive Modeling: a Review

9 Classification: KNN and SVM

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

Implementierungstechniken für Hauptspeicherdatenbanksysteme Classification: Decision Trees

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Lazy Decision Trees Ronny Kohavi

Classification with Decision Tree Induction

Logical Rhythm - Class 3. August 27, 2018

Data Mining Practical Machine Learning Tools and Techniques

Machine Learning in Real World: C4.5

Dynamic Load Balancing of Unstructured Computations in Decision Tree Classifiers

Classification. Instructor: Wei Ding

7. Decision or classification trees

Decision Tree Learning

Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Nearest neighbor classification DSE 220

List of Exercises: Data Mining 1 December 12th, 2015

8. Tree-based approaches

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Data Mining in Bioinformatics Day 1: Classification

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Notes based on: Data Mining for Business Intelligence

Data Mining and Knowledge Discovery: Practice Notes

Network Traffic Measurements and Analysis

Introduction to Machine Learning

Exam Advanced Data Mining Date: Time:

Machine Learning. Decision Trees. Manfred Huber

Cse352 Artifficial Intelligence Short Review for Midterm. Professor Anita Wasilewska Computer Science Department Stony Brook University

International Journal of Software and Web Sciences (IJSWS)

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

Classification/Regression Trees and Random Forests

KDD E A MINERAÇÃO DE DADOS. Daniela Barreiro Claro

Data Mining and Knowledge Discovery: Practice Notes

Computer Vision Group Prof. Daniel Cremers. 8. Boosting and Bagging

Data Mining. 3.3 Rule-Based Classification. Fall Instructor: Dr. Masoud Yaghini. Rule-Based Classification

Statistics 202: Statistical Aspects of Data Mining

2. Data Preprocessing

Improved Post Pruning of Decision Trees

10 Classification: Evaluation

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Data Mining Part 5. Prediction

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

Discriminate Analysis

Data Mining Lecture 8: Decision Trees

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

Outline. RainForest A Framework for Fast Decision Tree Construction of Large Datasets. Introduction. Introduction. Introduction (cont d)

Transcription:

Classification Decision Tree Instructor: Wei Ding Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Preliminaries Each data record is characterized by a tuple (x, y), where x is the attribute t set and y is a special attribute, designated as the class label. Input Attribute set x Classification Model Output Class label y Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2

10 10 Predictive Modeling A classification model can be used to predict the class label l of unknown records. Most suited for predicting or describing data sets with binary or nominal categories. Less effective ect e for ordinal categories es because they do not consider the implicit order among the categories. es For example, e, to classify a person as a member of high-, medium-, or low-income group. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Learning Algorithm Key objective of the learning algorithm: to build models with good generalization capability; i.e., models that accurately predict the class labels of previously unknown records. Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Learn Model Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? 15 No Large 67K? Apply Model Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 4

Evaluation of a Classification Model Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Confusion Matrix 9=7+2 5=2+3 Accuracy: 9/14 Error rate: 5/14 Confusion matrix: a table depicts the counts of test records correctly and incorrectly predicated by the model. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6

10 10 Decision Tree Classification Task Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Learn Model Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? 15 No Large 67K? Apply Model Decision Tree Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 How to Build a Decision Tree There are exponentially many decision trees that can be constructed from a given set of attributes. Finding the optimal tree is computationally infeasible because of the exponential size of the search space. Goal: to find efficient algorithm to induce a reasonable accurate, albeit suboptimal, decision tree in a reasonable amount of time. Usually employ a greedy strategy that grows a decision tree by making a serious of locally optimum decisions about which attribute to use for partitioning the data. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8

10 Decision Tree Induction Many Algorithms: Hunt s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINTSPRINT Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 General Structure of Hunt s Algorithm Let D t be the set of training records that reach a node t General Procedure: If D t contains records that belong the same class y t, then t is a leaf node labeled as y t If D t is an empty set, then t is a leaf node labeled l by the same class label as the majority class of training records associated with its parent node. If D t contains records that belong to more than one class, use an attribute t test t to split the data into smaller subsets. Recursively apply the procedure to each subset. Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes? D t Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10

10 Hunt s Algorithm Tid Refund Marital Status Taxable Income Don t Yes Don t Refund Single, Divorced Refund Yes Don t No Marital Status Married Don t No Don t Yes Don t Refund Single, Divorced Taxable Income No Marital Status < 80K >= 80K Married Don t 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Don t Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12

Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 How to Specify Test Condition? Depends on attribute types Nominal Ordinal Continuous Depends on number of ways to split 2-way split Multi-way split Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14

Splitting Based on Nominal Attributes Multi-way split: Use as many partitions as distinct values. CarType Family Luxury Sports Binary split: Divides values into two subsets. Need to find optimal partitioning. {Sports, Luxury} CarType {Family} OR {Family, Luxury} CarType {Sports} Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15 Splitting Based on Ordinal Attributes Multi-way split: Use as many partitions as distinct values. Size Small Medium Large Binary split: Divides id values into two subsets. Need to find optimal partitioning. {Small, Medium} Size {Large} OR {Medium, Large} Size {Small} {Small, What about this split? { Large} Size {Medium} Cannot violate the order property Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16

Splitting Based on Continuous Attributes Different ways of handling Discretization to form an ordinal categorical attribute Static discretize once at the beginning Dynamic ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Binary Decision: (A < v) or (A v) consider all possible splits and finds the best cut Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Splitting Based on Continuous Attributes Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18

Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Which test condition is the best? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20

How to determine the Best Split Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: Non-homogeneous, High degree of impurity Homogeneous, Low degree of impurity Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 Measures of Node Impurity Gini Index Entropy Misclassification error Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22

How to Find the Best Split Before Splitting: C0 N00 C1 N01 M0 A? B? Yes No Yes No Node N1 Node N2 Node N3 Node N4 C0 C1 N10 N11 C0 C1 N20 N21 C0 C1 N30 N31 C0 C1 N40 N41 M1 M2 M3 M4 M12 Gain = M0 M12 vs M0 M34 M34 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 Measure of Impurity: GINI Gini Index for a given node t : GINI( t) = 1 j [ p( j t)] (NOTE: p( j t) is the relative frequency of class j at node t). Maximum (1-1/n c ) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information 2 C1 0 C2 6 Gini=0.000 C1 1 C2 5 Gini=0.278 C1 2 C2 4 Gini=0.444 C1 3 C2 3 Gini=0.500 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24

Examples for computing GINI GINI( t) = 1 [ p( j t)] j 2 C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 C2 6 Gini = 1 P(C1) 2 P(C2) 2 = 1 0 1 = 0 C1 1 P(C1) = 1/6 P(C2) = 5/6 C2 5 Gini = 1 (1/6) 2 (5/6) 2 = 0.278 C1 2 C2 4 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 (2/6) 2 (4/6) 2 = 0.444 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Splitting Based on GINI Used in CART, SLIQ, SPRINT. When a node p is split into k partitions (children), the quality of split is computed as, k ni GINI split = GINI ( i) n i= 1 where, n i = number of records at child i, n = number of records at node p. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26

Binary Attributes: Computing GINI Index Splits into two partitions Effect of Weighing partitions: Larger and Purer Partitions are sought for. Gini(N1) = 1 (5/7) 2 (2/7) 2 = 0.408163 Gini(N2) = 1 (1/5) 2 (4/5) 2 = 032 0.32 B? Yes No Node N1 Node N2 N1 N2 C1 5 1 C2 2 4 Gini=0.333 Parent C1 6 C2 6 Gini = 0.500 Gini(Children) =7/12*0 0.408163 + 5/12 * 0.32 = 0.371428 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 Categorical Attributes: Computing Gini Index For each distinct value, gather counts for each class in the dataset Use the count matrix to make decisions Multi-way split Two-way split (find best partition of values) CarType Family Sports Luxury CarType {Sports, Luxury} {Family} C1 3 1 CarType {Sports} {Family, Luxury} C1 2 2 C1 1 2 1 C2 4 1 1 C2 2 4 C2 1 5 Gini 0393 0.393 Gini 0.400 Gini 0419 0.419 Note: the values of Gini index on this slide are incorrect! Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28

10 Continuous Attributes: Computing Gini Index Use Binary Decisions based on one value Several Choices for the splitting value Number of possible splitting values = Number of distinct values Each splitting value has a count matrix associated with it Class counts in each of the partitions, A < v and A v Simple method to choose best v For each v, scan the database to gather count matrix and compute its Gini index Computationally Inefficient! Repetition of work. Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 Continuous Attributes: Computing Gini Index... For efficient computation: for each attribute, Sort the attribute on values Linearly scan these values, each time updating the count matrix and computing gini index Choose the split position that has the least gini index No No No Yes Yes Yes No No No No Sorted Values Split Positions Taxable Income 60 70 75 85 90 95 100 120 125 220 55 65 72 80 87 92 97 110 122 172 230 <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > <= > Yes 0 3 0 3 0 3 0 3 1 2 2 1 3 0 3 0 3 0 3 0 3 0 No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0 Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420 Note: the values of Gini index on this slide are incorrect! Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30

Alternative Splitting Criteria based on INFO Entropy at a given node t: Entropy ( t) = p( j t)log p( j t) j (NOTE: p( j t) is the relative frequency of class j at node t). Measures homogeneity of a node. Maximum (log n c ) when records are equally distributed among all classes implying least information Minimum (0.0) when all records belong to one class, implying most information Entropy based computations are similar to the GINI index computations Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Examples for computing Entropy Entropy t) = p( j t)log p( j t) j ( j 2 C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 C2 6 Entropy = 0 log 0 1 log 1 = 0 0 = 0 C1 1 P(C1) = 1/6 P(C2) = 5/6 C2 5 Entropy = (1/6) log 2 (1/6) (5/6) log 2 (1/6) = 0.65 C1 2 C2 4 P(C1) = 2/6 P(C2) = 4/6 Entropy = (2/6) log 2 (2/6) (4/6) log 2 (4/6) = 0.92 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32

Splitting Based on INFO... Information Gain: k ni GAIN = Entropy( p) Entropy( i) split i= 1 n Parent Node, p is split into k partitions; n i is number of records in partition i Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) Used in ID3 and C4.5 Disadvantage: Tends to prefer splits that result in large number of partitions, each being small but pure. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33 Splitting Based on INFO... Gain Ratio: GainRATIO GAIN Split k = split i= SplitINFO ni SplitINFO = log 1 n ni n Parent Node, p is split into k partitions n i is the number of records in partition i Adjusts Information Gain by the entropy of the partitioning (SplitINFO). Higher entropy partitioning (large number of small partitions) is penalized! Used in C4.5 Designed to overcome the disadvantage of Information Gain Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34

Splitting Criteria based on Classification Error Classification error at a node t : Error( t) = 1 max P( i t) i Measures misclassification error made by a node. Maximum (1-1/n c ) when records are equally distributed among all classes, implying py least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 Examples for Computing Error Error( t ) = 1 max P ( i t ) ( i C1 0 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 C2 6 Error = 1 max (0, 1) = 1 1 = 0 C1 1 P(C1) = 1/6 P(C2) = 5/6 C2 5 Error = 1 max (1/6, 5/6) = 1 5/6 = 1/6 C1 2 C2 4 P(C1) = 2/6 P(C2) = 4/6 Error = 1 max (2/6, 4/6) = 1 4/6 = 1/3 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36

Comparison among Splitting Criteria For a 2-class problem: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37 Misclassification Error vs Gini Yes A? Parent No C1 7 C2 3 Node N1 Node N2 Gini i = 0.42 Gini(N1) N1 N2 = 1 (3/3) 2 (0/3) 2 Gini(Children) C1 3 4 = 0 =3/10*0 0 C2 0 3 + 7/10 * 0.489 Gini(N2) Gini=0.361 = 0.342 =1 (4/7) 2 (3/7) 2 = 0.489 Gini improves!! Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38

Tree Induction Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 Stopping Criteria for Tree Induction Stop expanding a node when all the records belong to the same class Stop expanding a node when all the records have similar attribute values Early termination (to be discussed later) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40

Decision Tree Based Classification Advantages: Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41 Example: C4.5 (Weka Version J48) Simple depth-first construction. Uses Information Gain Sorts Continuous Attributes at each node. Needs entire data to fit in memory. Unsuitable for Large Datasets. Needs out-of-core sorting. Use Weka J48 or you can download the software from: http://www.cse.unsw.edu.au/~quinlan/c4.5r8.tar.gz Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42