Classification. Slide sources:

Classification Slide sources: Gideon Dror, Academic College of TA Yaffo Nathan Ifill, Leicester MA4102 Data Mining and Neural Networks Andrew Moore, CMU : http://www.cs.cmu.edu/~awm/tutorials 1

Outline Problem setting Examples Classification algorithms Performance measures Performance assessment Generalization and overfitting Dimensional reduction and feature selection 2

Problem setting Input: training set (x 1, y 1 ),... (x m, y m ) X Y sampled from some distribution D. A pair (x i, y i ) is called a training example Y: a discrete set of class labels X is normally R n x i =(x i1, x in ). x ij are called features We assume existence of a function f* : X Y that maps data to correct labels. Goal of classification: Find f* Often f* doesn t exist (insufficient information in X / noise) Then want a best f e.g., minimizing E D ( f(x) y) ). x i -s are also called inputs, y i - outputs 3

Learning to Classify Learning of binary classification Given: a set of m examples (x i,y i ) i = 1,2 m sampled from some distribution D, where x i R n and y i {-1,+1} Find: a function f f: R n {-1,+1} that classifies well examples x i sampled from D. comments f is usually a statistical model, whose parameters are learnt from the set of examples. y i =+1 : positive examples. -1 negative examples. 4

Examples Gene expression data Face detection Customer discovery Spam detection Many more. 5

GE data separate malignant from healthy tissues based on the mrna expression profile of the tissue. 6

Face detection discriminate human faces from non faces.

Other examples Customer discovery - predict whether a customer is likely to purchase a certain good according to a customer profile. Spam detection predict whether a mail message is a spam or a legitimate message. Fraud detection verify whether a credit card transaction is fraudulent or not 8

Classification problem x 2 x 1 9

Classification algorithms Fisher linear discriminant KNN Decision tree Neural networks SVM Naïve Bayes Adaboost Many many more. Each one has its properties wrt bias, speed, accuracy, transparency 10

Fisher Linear Discriminant Find the direction w that maximizes interclass variability and minimizes intraclass variability x 2 w No hyperparameters x 1 11

KNN K nearest neighbors Find the k nearest neighbors of the test example, and infer its class using their known classes. E.g. K=3, n = 2 x 2 1. Compute distances d(x, x 0 ) for all x X 2. Keep k nearest x 3. Check labels of k nearest x 4. Class of new sample x 0 is majority label of k nearest x x 1 12

Example

KNN properties Non parametric no model assumed (or constructed) Usually works very well when there is a natural distance between examples. When the training set is large, the calculation is time-consuming. A single hyper-parameter k. Choice of k is important: Large k: stable estimate, may use far elements Small k: instable estimates, only close elements are used In general low k gives very irregular decision boundaries 14

Disadvantages Classes with more frequent examples dominate predictions of unknown instances. Assigning weights helps to remove this problem. The algorithm can be computationally intensive depending on the size of the training set.

Choosing k Both low and high values of k have their advantages. The best value of k is dependent on the data. Cross-validation can be used to compare k s.

Decision Boundaries low and high k 1 Nearest Neighbor Classifier 15 Nearest Neighbor Classifier 17

Decision tree leaves represent classifications and branches represent tests on features that lead to those classifications x 2 YES X 1 >α 1 NO X 2 >α 2 α 2 YES NO α 1 x 1 18

Example Given real valued data, predict if miles per gallon (MPG) of car models is good or bad mpg cylinders displacemen horsepower weight acceleration modelyear maker good 4 97 75 2265 18.2 77 asia bad 6 199 90 2648 15 70 america bad 4 121 110 2600 12.8 77 europe bad 8 350 175 4100 13 73 america bad 6 198 95 3102 16.5 74 america bad 4 108 94 2379 16.5 73 asia bad 4 113 95 2228 14 71 asia bad 8 302 139 3570 12.8 78 america : : : : : : : : : : : : : : : : : : : : : : : : good 4 120 79 2625 18.6 82 america bad 8 455 225 4425 10 70 america good 4 107 86 2464 15.5 76 europe bad 5 131 103 2830 15.9 78 europe Copyright Andrew W. Moore Slide 19

Splitting data by a threshold Suppose X is real valued. Define the information gain for predicting the outcome Y due to splitting X at value t: IG(Y X:t) =H(Y) - H(Y X:t) Where H(Y X:t)=H(Y X < t)p(x < t) + H(Y X t)p(x t) For categorical data use P(X=cat), P(X cat) Then define IG*(Y X) = max t IG(Y X:t) For each attribute, use IG*(Y X) for assessing its suitability as a split Copyright Andrew W. Moore Slide 20

Computational Issues You can compute IG*(Y X) in time R log R + 2 R n y Where R is the number of records in the node under consideration n y is the arity (number of distinct values of) Y How Sort records according to increasing values of X. Then create a 2xn y contingency table corresponding to computation of IG(Y X:x min ). Then iterate through the records, testing for each threshold between adjacent values of X, incrementally updating the contingency table as you go. For a minor additional speedup, only test between values of Y that differ. Copyright Andrew W. Moore Slide 21

Example with MPG Copyright Andrew W. Moore Slide 22

Unpruned tree using reals Copyright Andrew W. Moore Slide 23

Pruned tree using reals Copyright Andrew W. Moore Slide 24

Basic Decision Tree Building BuildTree(DataSet,Output) If all output values are the same in DataSet, return a leaf node that says predict this unique output If all input values are the same, return a leaf node that says predict the majority output Else find attribute X with highest Info Gain Numerical attribute: Compute the value t corresponding to IG*(Y X) Create and return a non-leaf node with two children. Let Ds 1 = all records in DataSet for which X <t. Let DS 2 the rest The i th child is built by calling BuildTree(DS,Output) i Categorical attribute: If X has n X distinct values (i.e. X has arity n X ). Create and return a non-leaf node with n X children. The i th child should be built by calling BuildTree(DS,Output) i Where Ds i = all those records in DataSet for which X = ith distinct value of X. Prune the tree to avoid overfitting Copyright Andrew W. Moore Slide 25

Decision tree learning Building the most compact tree compatible with training examples is NP-hard Many heuristic methods for constructing good trees. ID3, C4.5, CART. Most methods use some greedy rule (e.g. find the feature that best separates positive/negative examples) The simplest decision tree algs have no hyperparameters. 26

Neural network Find the best separating plane between two classes Find an optimal curve separating the two classes. x 2 Complicated structure, with many parameters and several hyper-parameters, non trivial to tune. Prone to overfitting. x 1 27

Performance measures Simple measures, based on threshold Error rate/accuracy ACC = (TP + TN)/ N Balanced error rate BER = (TP/(TP+FN) + TN/(FP+TN))/2 Sensitivity / Recall / TP rate SEN = TP/(TP+FN) Specificity SPE = TN/(TN+FP) Note: BER = (SEN+SPE)/2 Precision PRE = TP/(TP+FP) FP rate FPR = FP/(TN+FP) = 1 - SPE Predicted Class Pos Neg True Class Pos Neg TP FP FN TN 28

The ROC Curve Avoid the threshold: ROC curve: (TPR, FPR) as a function of the threshold Compute area under ROC curve (AUC) True Class Pos Neg Measures the probability that for a random pair (pos, neg) the classifier will assign a higher score to the pos example Predicted Class Pos Neg TP FN FP TN 29

Performance assessment Now that we have performance measures, what should we apply them to Resubstitution estimation: compute error rate/auc on the learning set Problem: downward bias Test set estimation: partition the training set into two sets, L 1 and L 2 ; classifier built using L 1, error rate computed on L 2. L 1 and L 2 must be iid. Problem: reduced effective sample size 30

Performance assessment (II) m-fold cross-validation (CV) estimation: Randomly divide the training set into m subsets of (nearly) equal size. Repeat x m: Build classifiers leaving one set out; compute error rates on left out set. Average the error rates Very popular method. Is typically used also for tuning hyper-parameters. 31

Generalization and overfitting x 2 x 1 32

Control on model complexity Regularization is intended to reduce the complexity of the model in order to have better generalization Regularization in decision trees (pruning, ensembling) Regularization in neural networks (penalty term) Regularization in SVM 33

Dimensionality Reduction and Feature Selection 34

Why dimensionality reduction May improve performance of classification algorithm by removing irrelevant features Defying the curse of dimensionality - simpler models result in improved generalization Classification algorithm may not scale up to the size of the full feature set either in space or time Allows us to better understand the domain Cheaper to collect and store data based on reduced feature set Approaches to dim reduction: Feature construction, Feature selection 35

Feature construction Transform the n features into l<<n informative features Linear methods PCA ICA Fisher linear discriminant. Non-linear methods tsne Non linear component analysis (NLCA) Kernel PCA Local linear embedding (LLE). 36

Feature selection Given examples (x i,y i ) where x i R n, select a minimal subset of features that maximizes performance, e.g. accuracy. Exhaustive search is computationally prohibitive, except for a small n. An optimization problem, where the classification error is the function to be minimized. Typically hard to solve exactly. Heuristics are used. 37

Feature selection methods Filter methods Feature selection classifier Wrapper methods Iteratively revise feature set based on classifier performance Feature selection classifier Embedded methods Selection is embedded in the classification. No separated into two iteration phases. classifier 38

Filtering Order all features according to strength of association with the target y i Various measures of association may be used: Pearson correlation R(X i ) = cov(x i,y)/σ Xi σ Y χ 2 (discrete variables X i ) Fisher criterion F(X i ) = µ + Xi- µ - Xi / (σ + Xi 2 + σ - Xi 2 ) Mutual information MI(X i,y) =Σp(X i,y)log(p(x i,y)/p(x i )p(y) Choose the first k features Feed them to the classifier 39

Filtering pros and cons Usually works well when features are independent, since each feature is considered in isolation. When the dependencies between features and the targets are important filtering will not perform very well. For example, with the XOR problem Estimated independently of classifier Still, on many problems filtering is very effective. 40

The broader context https://www.mathworks.com/help/stats 41