Data Mining Concepts & Techniques

Similar documents
Classification: Basic Concepts, Decision Trees, and Model Evaluation

CSE4334/5334 DATA MINING

Part I. Instructor: Wei Ding

Classification and Regression

CS Machine Learning

Classification. Instructor: Wei Ding

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Classification Salvatore Orlando

Example of DT Apply Model Example Learn Model Hunt s Alg. Measures of Node Impurity DT Examples and Characteristics. Classification.

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

DATA MINING LECTURE 9. Classification Basic Concepts Decision Trees Evaluation

Machine Learning. Decision Trees. Le Song /15-781, Spring Lecture 6, September 6, 2012 Based on slides from Eric Xing, CMU

Extra readings beyond the lecture slides are important:

Data Mining Classification - Part 1 -

DATA MINING LECTURE 9. Classification Decision Trees Evaluation

Data Mining D E C I S I O N T R E E. Matteo Golfarelli

Lecture Notes for Chapter 4

Given a collection of records (training set )

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

CISC 4631 Data Mining

CSE 5243 INTRO. TO DATA MINING

Lecture 7: Decision Trees

Data Warehousing & Data Mining

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

List of Exercises: Data Mining 1 December 12th, 2015

10 Classification: Evaluation

Lecture outline. Decision-tree classification

Data Mining. 3.2 Decision Tree Classifier. Fall Instructor: Dr. Masoud Yaghini. Chapter 5: Decision Tree Classifier

ARTIFICIAL INTELLIGENCE (CS 370D)

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Network Traffic Measurements and Analysis

โครงการอบรมเช งปฏ บ ต การ น กว ทยาการข อม ลภาคร ฐ (Government Data Scientist)

Nesnelerin İnternetinde Veri Analizi

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

KDD E A MINERAÇÃO DE DADOS. Daniela Barreiro Claro

Data Mining Course Overview

COMP 465: Data Mining Classification Basics

Classification with Decision Tree Induction

Exam Advanced Data Mining Date: Time:

9 Classification: KNN and SVM

A Systematic Overview of Data Mining Algorithms

Decision Tree Learning

Business Club. Decision Trees

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

CSE 494/598 Lecture-11: Clustering & Classification

Classification. Instructor: Wei Ding

Probabilistic Classifiers DWML, /27

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Classification and Regression Trees

CS 584 Data Mining. Classification 1

PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning

.. Cal Poly CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. for each element of the dataset we are given its class label.

Data Mining: Concepts and Techniques Classification and Prediction Chapter 6.1-3

COMP90049 Knowledge Technologies

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Data Mining Concepts & Techniques

Classification: Basic Concepts and Techniques

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

ISSUES IN DECISION TREE LEARNING

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Credit card Fraud Detection using Predictive Modeling: a Review

Classification and Regression Trees

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Chapter 3: Supervised Learning

Logical Rhythm - Class 3. August 27, 2018

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Artificial Intelligence. Programming Styles

Data Mining in Bioinformatics Day 1: Classification

INTRO TO RANDOM FOREST BY ANTHONY ANH QUOC DOAN

INTRODUCTION TO DATA MINING. Daniel Rodríguez, University of Alcalá

CS249: ADVANCED DATA MINING

Supervised Learning Classification Algorithms Comparison

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Data Preprocessing UE 141 Spring 2013

Nearest neighbor classification DSE 220

CS145: INTRODUCTION TO DATA MINING

Decision Tree CE-717 : Machine Learning Sharif University of Technology

8. Tree-based approaches

Statistics 202: Statistical Aspects of Data Mining

CS 584 Data Mining. Classification 3

7. Decision or classification trees

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

CISC 4631 Data Mining

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Evaluating Classifiers

CS570: Introduction to Data Mining

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

Chapter ML:III. III. Decision Trees. Decision Trees Basics Impurity Functions Decision Tree Algorithms Decision Tree Pruning

Performance Analysis of Classifying Unlabeled Data from Multiple Data Sources

Data Mining: Data. What is Data? Lecture Notes for Chapter 2. Introduction to Data Mining. Properties of Attribute Values. Types of Attributes

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Data Mining and Knowledge Discovery Practice notes 2

Machine Learning. A. Supervised Learning A.7. Decision Trees. Lars Schmidt-Thieme

Part II: A broader view

Evaluating Classifiers

Large Scale Data Analysis Using Deep Learning

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Transcription:

Data Mining Concepts & Techniques Lecture No. 03 Data Processing, Data Mining Naeem Ahmed Email: naeemmahoto@gmail.com Department of Software Engineering Mehran Univeristy of Engineering and Technology Jamshoro

Data Mining: Classification Outline Basic concepts Decision Tree Hunt s Algorithm Acknowledgements: Introduction to Data Mining Tan, Steinbach, Kumar

Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class Find a model for class attribute as a function of the values of other attributes Goal: previously unseen records should be assigned a class as accurately as possible A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it

Classification: Definition Classification is a task of learning a target function f that maps each attribute set x to one of the predefined class label y. The target function is formally known as Classification Model, having purposes: Descriptive Modeling: serves as explanatory tool to distinguish between objects of different classes E.g., Mammal Class for (human, whale) Predictive Modeling: serves to predict the class label of unknown records E.g., Predicting possible Remarks for a student based on known patterns of other students

10 10 Classification Tid Attrib1 Attrib2 Attrib3 Class 1 Yes Large 125K No 2 No Medium 100K No 3 No Small 70K No 4 Yes Medium 120K No 5 No Large 95K Yes 6 No Medium 60K No 7 Yes Large 220K No 8 No Small 85K Yes 9 No Medium 75K No 10 No Small 90K Yes Training Set Tid Attrib1 Attrib2 Attrib3 Class 11 No Small 55K? 12 Yes Medium 80K? 13 Yes Large 110K? 14 No Small 95K? 15 No Large 67K? Test Set Induction Deduction Learning algorithm Learn Model Apply Model Model

Evaluation Metrics Precision: exactness what % of records that classifier labeled as positive as actually positive Recall: completeness what % of positive records did classifier labeled as positive. Perfect score 1.0 F-measure: harmonic mean of precision and recall Higher the F-measure value, better is the classifier classifies

Evaluation Metrics TP: If the instance is positive and it is classified as positive, it is counted as a true positive (TP) Confusion Matrix FN: If the instance is positive and it is classified as negative, it is counted as a false negative (FN). TN: If the instance is negative and it is classified as negative, it is counted as a true negative (TN). FP: If the instance is negative and it is classified as positive, it is counted as a false positive (FN). N is the total true Negatives and P is the total true positives

Classification: Examples Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc

Classification Techniques Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines

Decision Tree An inductive learning task Use particular facts to make more generalized conclusions A predictive model based on a branching series of Boolean tests These smaller Boolean tests are less complex than a one-stage classifier

10 Decision Tree Tid Refund Marital Status Taxable Income Cheat Splitting Attributes 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes Refund Yes No MarSt Single, Divorced TaxInc < 80K > 80K YES Married Training Data Model: Decision Tree

10 Decision Tree: Example Start from the root of tree. Test Data Refund Marital Status Taxable Income Cheat Refund No Married 80K? Yes No Single, Divorced MarSt Married TaxInc < 80K > 80K YES

10 Decision Tree: Example Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? Single, Divorced MarSt Married TaxInc < 80K > 80K YES

10 Decision Tree: Example Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? Single, Divorced MarSt Married TaxInc < 80K > 80K YES

10 Decision Tree: Example Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? Single, Divorced MarSt Married TaxInc < 80K > 80K YES

10 Decision Tree: Example Test Data Refund Marital Status Taxable Income Cheat Yes Refund No No Married 80K? Single, Divorced MarSt Married TaxInc < 80K > 80K YES

10 Decision Tree: Example Test Data Refund Marital Status Taxable Income Cheat Refund No Married 80K? Yes No Single, Divorced MarSt Married Assign Cheat to No TaxInc < 80K > 80K YES

Decision Tree Algorithms The basic idea behind any decision tree algorithm is as follows: Choose the best attribute(s) to split the remaining instances and make that attribute a decision node Repeat this process for recursively for each child Stop when: All the instances have the same target attribute value There are no more attributes There are no more instances

Decision Tree Algorithms Many Algorithms: Hunt s Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT

10 Hunt s Algorithm Let D t be the set of training records that reach a node t General Procedure: If D t contains records that belong the same class y t, then t is a leaf node labeled as y t If D t is an empty set, then t is a leaf node labeled by the default class, y d If D t contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Cheat 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes? D t

10 Hunt s Algorithm Don t Cheat Refund Yes Don t Cheat Single, Divorced Cheat Refund Yes Don t Cheat No Marital Status Married Don t Cheat No Don t Cheat Refund Yes Don t Cheat Single, Divorced Don t Cheat Taxable Income No Marital Status < 80K >= 80K Cheat Married Don t Cheat Tid Refund Marital Status Taxable Income 1 Yes Single 125K No 2 No Married 100K No 3 No Single 70K No 4 Yes Married 120K No Cheat 5 No Divorced 95K Yes 6 No Married 60K No 7 Yes Divorced 220K No 8 No Single 85K Yes 9 No Married 75K No 10 No Single 90K Yes

Tree Induction Greedy strategy Split the records based on an attribute test that optimizes certain criterion Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitting

How to Specify Test Condition? Depends on attribute types Nominal Ordinal Continuous Depends on number of ways to split 2-way split Multi-way split

Splitting Based on Nominal Attributes Multi-way split: Use as many partitions as distinct values Family CarType Sports Luxury Binary split: Divides values into two subsets. Need to find optimal partitioning {Sports, Luxury} CarType {Family} OR {Family, Luxury} CarType {Sports}

Splitting Based on Ordinal Attributes Multi-way split: Use as many partitions as distinct values Small Size Medium Large Binary split: Divides values into two subsets Need to find optimal partitioning {Small, Medium} Size {Large} OR {Medium, Large} Size {Small} What about this split? {Small, Large} Size {Medium}

Splitting Based on Continuous Attributes Different ways of handling Discretization to form an ordinal categorical attribute Static discretize once at the beginning Dynamic ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Binary Decision: (A < v) or (A v) consider all possible splits and finds the best cut can be more compute intensive

Splitting Based on Continuous Attributes Taxable Income > 80K? Taxable Income? < 10K > 80K Yes No [10K,25K) [25K,50K) [50K,80K) (i) Binary split (ii) Multi-way split

How to determine the Best Split Before Splitting: 10 records of class 0, 10 records of class 1 Own Car? Car Type? Student ID? Yes No Family Luxury c 1 c 10 c 20 Sports c 11 C0: 6 C1: 4 C0: 4 C1: 6 C0: 1 C1: 3 C0: 8 C1: 0 C0: 1 C1: 7 C0: 1 C1: 0... C0: 1 C1: 0 C0: 0 C1: 1... C0: 0 C1: 1 Which test condition is the best?

How to determine the Best Split Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurity: C0: 5 C1: 5 Non-homogeneous, High degree of impurity C0: 9 C1: 1 Homogeneous, Low degree of impurity

Gini Index Measures of Node Impurity Entropy Misclassification error

How to Find the Best Split A? Before Splitting: C0 C1 N00 N01 M0 B? Yes No Yes No Node N1 Node N2 Node N3 Node N4 C0 N10 C0 N20 C0 N30 C0 N40 C1 N11 C1 N21 C1 N31 C1 N41 M1 M2 M3 M4 M12 Gain = M0 M12 vs M0 M34 M34

Measure of Impurity: GINI Gini Index for a given node t : GINI( t) = 1 j [ p( j t)] 2 (TE: p( j t) is the relative frequency of class j at node t). Maximum (1-1/n c ) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information C1 0 C2 6 Gini=0.000 C1 1 C2 5 Gini=0.278 C1 2 C2 4 Gini=0.444 C1 3 C2 3 Gini=0.500

Examples for computing GINI C1 0 C2 6 GINI( t) = 1 j [ p( j t)] P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 P(C1) 2 P(C2) 2 = 1 0 1 = 0 2 C1 1 C2 5 P(C1) = 1/6 P(C2) = 5/6 Gini = 1 (1/6) 2 (5/6) 2 = 0.278 C1 2 C2 4 P(C1) = 2/6 P(C2) = 4/6 Gini = 1 (2/6) 2 (4/6) 2 = 0.444

Splitting Based on GINI Used in CART, SLIQ, SPRINT When a node p is split into k partitions (children), the quality of split is computed as, GINI split = k i= 1 ni n GINI( i) where, n i = number of records at child i, n = number of records at node p

Binary Attributes: Computing GINI Index Splits into two partitions Effect of Weighing partitions: Larger and Purer Partitions are sought for B? Parent Yes No C1 6 C2 6 Gini(N1) = 1 (5/6) 2 (2/6) 2 = 0.194 Gini(N2) = 1 (1/6) 2 (4/6) 2 = 0.528 Node N1 Node N2 N1 N2 C1 5 1 C2 2 4 Gini=0.333 Gini = 0.500 Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528 = 0.333

Stopping Criteria for Tree Induction Stop expanding a node when all the records belong to the same class Stop expanding a node when all the records have similar attribute values Early termination (to be discussed later)

Decision Tree based Classification Advantages Inexpensive to construct Extremely fast at classifying unknown records Easy to interpret for small-sized trees Accuracy is comparable to other classification techniques for many simple data sets Practical issues Underfitting and Overfitting Missing Values Costs of Classification