Classification. Slide sources:

Similar documents
Algorithms: Decision Trees

Lecture 7: Decision Trees

Evaluating Classifiers

Network Traffic Measurements and Analysis

Evaluating Classifiers

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Lecture 6: May 31, 2007

Features: representation, normalization, selection. Chapter e-9

Classification and Regression

Data Mining and Knowledge Discovery Practice notes 2

Data Mining and Knowledge Discovery: Practice Notes

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Machine Learning (CSE 446): Decision Trees

List of Exercises: Data Mining 1 December 12th, 2015

CS4491/CS 7265 BIG DATA ANALYTICS

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

Artificial Neural Networks (Feedforward Nets)

CS145: INTRODUCTION TO DATA MINING

CSE 446 Bias-Variance & Naïve Bayes

Probabilistic Classifiers DWML, /27

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Data Mining and Knowledge Discovery: Practice Notes

Machine Learning in Telecommunications

DATA MINING LECTURE 11. Classification Basic Concepts Decision Trees Evaluation Nearest-Neighbor Classifier

Introduction to Machine Learning

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Feature Selection. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Part I. Classification & Decision Trees. Classification. Classification. Week 4 Based in part on slides from textbook, slides of Susan Holmes

Metrics for Performance Evaluation How to evaluate the performance of a model? Methods for Performance Evaluation How to obtain reliable estimates?

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Applying Supervised Learning

Classification: Basic Concepts, Decision Trees, and Model Evaluation

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

INF 4300 Classification III Anne Solberg The agenda today:

CS 229 Midterm Review

Machine Learning. Chao Lan

Information Management course

Machine Learning Classifiers and Boosting

Data Mining Classification: Alternative Techniques. Imbalanced Class Problem

Introduction to Machine Learning

Information theory methods for feature selection

SOCIAL MEDIA MINING. Data Mining Essentials

DATA MINING LECTURE 9. Classification Basic Concepts Decision Trees Evaluation

CS249: ADVANCED DATA MINING

DATA MINING LECTURE 9. Classification Decision Trees Evaluation

Logical Rhythm - Class 3. August 27, 2018

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

Data Mining Concepts & Techniques

Classification Algorithms in Data Mining

Classification. Instructor: Wei Ding

Part II: A broader view

Data Mining. Part 2. Data Understanding and Preparation. 2.4 Data Transformation. Spring Instructor: Dr. Masoud Yaghini. Data Transformation

Data Mining Classification - Part 1 -

Machine Learning in Biology

CSE4334/5334 DATA MINING

CS Machine Learning

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Weka ( )

Machine Learning for. Artem Lind & Aleskandr Tkachenko

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

10-701/15-781, Fall 2006, Final

Random Forest A. Fornaser

Classification and Regression Trees

What is Learning? CS 343: Artificial Intelligence Machine Learning. Raymond J. Mooney. Problem Solving / Planning / Control.

Contents. Preface to the Second Edition

MSA220 - Statistical Learning for Big Data

Artificial Intelligence. Programming Styles

Predictive modelling / Machine Learning Course on Big Data Analytics

Large Scale Data Analysis Using Deep Learning

Variable Selection 6.783, Biomedical Decision Support

Data Mining. Lecture 03: Nearest Neighbor Learning

Regularization and model selection

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Logistic Regression: Probabilistic Interpretation

Decision Trees Dr. G. Bharadwaja Kumar VIT Chennai

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Generative and discriminative classification

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Feature Selection in Knowledge Discovery

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Data Mining Classification: Bayesian Decision Theory

The exam is closed book, closed notes except your one-page cheat sheet.

Chapter 12 Feature Selection

10601 Machine Learning. Model and feature selection

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

More Learning. Ensembles Bayes Rule Neural Nets K-means Clustering EM Clustering WEKA

Trade-offs in Explanatory

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation. Lecture Notes for Chapter 4. Introduction to Data Mining

Lecture outline. Decision-tree classification

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Generative and discriminative classification techniques

Lecture Notes for Chapter 4

Chapter 3: Supervised Learning

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Generative and discriminative classification techniques

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Transcription:

Classification Slide sources: Gideon Dror, Academic College of TA Yaffo Nathan Ifill, Leicester MA4102 Data Mining and Neural Networks Andrew Moore, CMU : http://www.cs.cmu.edu/~awm/tutorials 1

Outline Problem setting Examples Classification algorithms Performance measures Performance assessment Generalization and overfitting Dimensional reduction and feature selection 2

Problem setting Input: training set (x 1, y 1 ),... (x m, y m ) X Y sampled from some distribution D. A pair (x i, y i ) is called a training example Y: a discrete set of class labels X is normally R n x i =(x i1, x in ). x ij are called features We assume existence of a function f* : X Y that maps data to correct labels. Goal of classification: Find f* Often f* doesn t exist (insufficient information in X / noise) Then want a best f e.g., minimizing E D ( f(x) y) ). x i -s are also called inputs, y i - outputs 3

Learning to Classify Learning of binary classification Given: a set of m examples (x i,y i ) i = 1,2 m sampled from some distribution D, where x i R n and y i {-1,+1} Find: a function f f: R n {-1,+1} that classifies well examples x i sampled from D. comments f is usually a statistical model, whose parameters are learnt from the set of examples. y i =+1 : positive examples. -1 negative examples. 4

Examples Gene expression data Face detection Customer discovery Spam detection Many more. 5

GE data separate malignant from healthy tissues based on the mrna expression profile of the tissue. 6

Face detection discriminate human faces from non faces.

Other examples Customer discovery - predict whether a customer is likely to purchase a certain good according to a customer profile. Spam detection predict whether a mail message is a spam or a legitimate message. Fraud detection verify whether a credit card transaction is fraudulent or not 8

Classification problem x 2 x 1 9

Classification algorithms Fisher linear discriminant KNN Decision tree Neural networks SVM Naïve Bayes Adaboost Many many more. Each one has its properties wrt bias, speed, accuracy, transparency 10

Fisher Linear Discriminant Find the direction w that maximizes interclass variability and minimizes intraclass variability x 2 w No hyperparameters x 1 11

KNN K nearest neighbors Find the k nearest neighbors of the test example, and infer its class using their known classes. E.g. K=3, n = 2 x 2 1. Compute distances d(x, x 0 ) for all x X 2. Keep k nearest x 3. Check labels of k nearest x 4. Class of new sample x 0 is majority label of k nearest x x 1 12

Example

KNN properties Non parametric no model assumed (or constructed) Usually works very well when there is a natural distance between examples. When the training set is large, the calculation is time-consuming. A single hyper-parameter k. Choice of k is important: Large k: stable estimate, may use far elements Small k: instable estimates, only close elements are used In general low k gives very irregular decision boundaries 14

Disadvantages Classes with more frequent examples dominate predictions of unknown instances. Assigning weights helps to remove this problem. The algorithm can be computationally intensive depending on the size of the training set.

Choosing k Both low and high values of k have their advantages. The best value of k is dependent on the data. Cross-validation can be used to compare k s.

Decision Boundaries low and high k 1 Nearest Neighbor Classifier 15 Nearest Neighbor Classifier 17

Decision tree leaves represent classifications and branches represent tests on features that lead to those classifications x 2 YES X 1 >α 1 NO X 2 >α 2 α 2 YES NO α 1 x 1 18

Example Given real valued data, predict if miles per gallon (MPG) of car models is good or bad mpg cylinders displacemen horsepower weight acceleration modelyear maker good 4 97 75 2265 18.2 77 asia bad 6 199 90 2648 15 70 america bad 4 121 110 2600 12.8 77 europe bad 8 350 175 4100 13 73 america bad 6 198 95 3102 16.5 74 america bad 4 108 94 2379 16.5 73 asia bad 4 113 95 2228 14 71 asia bad 8 302 139 3570 12.8 78 america : : : : : : : : : : : : : : : : : : : : : : : : good 4 120 79 2625 18.6 82 america bad 8 455 225 4425 10 70 america good 4 107 86 2464 15.5 76 europe bad 5 131 103 2830 15.9 78 europe Copyright Andrew W. Moore Slide 19

Splitting data by a threshold Suppose X is real valued. Define the information gain for predicting the outcome Y due to splitting X at value t: IG(Y X:t) =H(Y) - H(Y X:t) Where H(Y X:t)=H(Y X < t)p(x < t) + H(Y X t)p(x t) For categorical data use P(X=cat), P(X cat) Then define IG*(Y X) = max t IG(Y X:t) For each attribute, use IG*(Y X) for assessing its suitability as a split Copyright Andrew W. Moore Slide 20

Computational Issues You can compute IG*(Y X) in time R log R + 2 R n y Where R is the number of records in the node under consideration n y is the arity (number of distinct values of) Y How Sort records according to increasing values of X. Then create a 2xn y contingency table corresponding to computation of IG(Y X:x min ). Then iterate through the records, testing for each threshold between adjacent values of X, incrementally updating the contingency table as you go. For a minor additional speedup, only test between values of Y that differ. Copyright Andrew W. Moore Slide 21

Example with MPG Copyright Andrew W. Moore Slide 22

Unpruned tree using reals Copyright Andrew W. Moore Slide 23

Pruned tree using reals Copyright Andrew W. Moore Slide 24

Basic Decision Tree Building BuildTree(DataSet,Output) If all output values are the same in DataSet, return a leaf node that says predict this unique output If all input values are the same, return a leaf node that says predict the majority output Else find attribute X with highest Info Gain Numerical attribute: Compute the value t corresponding to IG*(Y X) Create and return a non-leaf node with two children. Let Ds 1 = all records in DataSet for which X <t. Let DS 2 the rest The i th child is built by calling BuildTree(DS,Output) i Categorical attribute: If X has n X distinct values (i.e. X has arity n X ). Create and return a non-leaf node with n X children. The i th child should be built by calling BuildTree(DS,Output) i Where Ds i = all those records in DataSet for which X = ith distinct value of X. Prune the tree to avoid overfitting Copyright Andrew W. Moore Slide 25

Decision tree learning Building the most compact tree compatible with training examples is NP-hard Many heuristic methods for constructing good trees. ID3, C4.5, CART. Most methods use some greedy rule (e.g. find the feature that best separates positive/negative examples) The simplest decision tree algs have no hyperparameters. 26

Neural network Find the best separating plane between two classes Find an optimal curve separating the two classes. x 2 Complicated structure, with many parameters and several hyper-parameters, non trivial to tune. Prone to overfitting. x 1 27

Performance measures Simple measures, based on threshold Error rate/accuracy ACC = (TP + TN)/ N Balanced error rate BER = (TP/(TP+FN) + TN/(FP+TN))/2 Sensitivity / Recall / TP rate SEN = TP/(TP+FN) Specificity SPE = TN/(TN+FP) Note: BER = (SEN+SPE)/2 Precision PRE = TP/(TP+FP) FP rate FPR = FP/(TN+FP) = 1 - SPE Predicted Class Pos Neg True Class Pos Neg TP FP FN TN 28

The ROC Curve Avoid the threshold: ROC curve: (TPR, FPR) as a function of the threshold Compute area under ROC curve (AUC) True Class Pos Neg Measures the probability that for a random pair (pos, neg) the classifier will assign a higher score to the pos example Predicted Class Pos Neg TP FN FP TN 29

Performance assessment Now that we have performance measures, what should we apply them to Resubstitution estimation: compute error rate/auc on the learning set Problem: downward bias Test set estimation: partition the training set into two sets, L 1 and L 2 ; classifier built using L 1, error rate computed on L 2. L 1 and L 2 must be iid. Problem: reduced effective sample size 30

Performance assessment (II) m-fold cross-validation (CV) estimation: Randomly divide the training set into m subsets of (nearly) equal size. Repeat x m: Build classifiers leaving one set out; compute error rates on left out set. Average the error rates Very popular method. Is typically used also for tuning hyper-parameters. 31

Generalization and overfitting x 2 x 1 32

Control on model complexity Regularization is intended to reduce the complexity of the model in order to have better generalization Regularization in decision trees (pruning, ensembling) Regularization in neural networks (penalty term) Regularization in SVM 33

Dimensionality Reduction and Feature Selection 34

Why dimensionality reduction May improve performance of classification algorithm by removing irrelevant features Defying the curse of dimensionality - simpler models result in improved generalization Classification algorithm may not scale up to the size of the full feature set either in space or time Allows us to better understand the domain Cheaper to collect and store data based on reduced feature set Approaches to dim reduction: Feature construction, Feature selection 35

Feature construction Transform the n features into l<<n informative features Linear methods PCA ICA Fisher linear discriminant. Non-linear methods tsne Non linear component analysis (NLCA) Kernel PCA Local linear embedding (LLE). 36

Feature selection Given examples (x i,y i ) where x i R n, select a minimal subset of features that maximizes performance, e.g. accuracy. Exhaustive search is computationally prohibitive, except for a small n. An optimization problem, where the classification error is the function to be minimized. Typically hard to solve exactly. Heuristics are used. 37

Feature selection methods Filter methods Feature selection classifier Wrapper methods Iteratively revise feature set based on classifier performance Feature selection classifier Embedded methods Selection is embedded in the classification. No separated into two iteration phases. classifier 38

Filtering Order all features according to strength of association with the target y i Various measures of association may be used: Pearson correlation R(X i ) = cov(x i,y)/σ Xi σ Y χ 2 (discrete variables X i ) Fisher criterion F(X i ) = µ + Xi- µ - Xi / (σ + Xi 2 + σ - Xi 2 ) Mutual information MI(X i,y) =Σp(X i,y)log(p(x i,y)/p(x i )p(y) Choose the first k features Feed them to the classifier 39

Filtering pros and cons Usually works well when features are independent, since each feature is considered in isolation. When the dependencies between features and the targets are important filtering will not perform very well. For example, with the XOR problem Estimated independently of classifier Still, on many problems filtering is very effective. 40

The broader context https://www.mathworks.com/help/stats 41