Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B 1 Supervised learning Catogarized / labeled data Objects in a picture: chair, desk, person, 2 Classification Fons van der Sommen Video Coding and Architectures Research group, TU/e ( f.v.d.sommen@tue.nl ) Handwritten digits: 3, 6, 5 Medical diagnosis: OCT image T2 colon cancer Goal: identify the class of a new data point Statistical modeling and machine learning Disctinctive properties (features) 3 4 Supervised learning: example (1/5) Supervised learning: example (2/5) Separate lemons from oranges Separate lemons from oranges Color: orange Shape: sphere Ø: ±8 cm Weigth: ±0.1 kg Color: yellow Shape: elipsoid Ø: ±8 cm Weigth: ±0.1 kg Color Use color and shape as features Shape 5 6 Supervised learning: example (3/5) Supervised learning: example (4/5) Separate lemons from oranges Model the given - training - data Separate lemons from oranges New data point Color Oranges Lemons Color Oranges Lemons Classifier: It s an orange! Shape Shape 1

Supervised learning: example (5/5) 7 Supervised learning 8 Diameter What if we had chosen the wrong features? Diameter New data point??? Summary Choose distinctive features Make a model based on labeled data (a.k.a. supervised learning) Use the learned model to predict the class of new, unseen data points Weight Weight Models for classification 9 (1) 10 Random Forests k Nearest Neighbours (k-nn) Boosting Neural Networks Convolutional Neural Networks (CNN) & Deep learning Find Stanford lectures CS231n on YouTube (19 videos)! Find a hyperplane that separates the classes with a maximum margin Margin (2) 11 (3) 12 Based on emperical risk minimization (1960s) Non-linearity added in 1992 (Boser, Guyon & Vapnik) Soft-margin SVM introduced in 1995 (Cortes & Vapnik) Has become very popular since then Easy to use, a lot of open libraries available Fast learning and very fast classification Good generalization properties How to find the optimal hyperplane? Optimal? No! 2

(4) 13 (4) 14 How to find the optimal hyperplane? How to find the optimal hyperplane? m wx b 1 b w wx b 0 wx b 1 Width of the margin: b 1 b 1 2 m w w w Maximize margin: Support vectors Vectors for which the constraint is exactly met. These vectors support the dividing hyperplane. Removal of one of those vectors typically leads to a different optimal hyperplane. (5) 15 (6) 16 We can rewrite the optimization problem to The data is usually not linearly separable Introduce slack variables Formulate as a Quadratic Programming problem: Put a cost C on crossing the margin, so the optimization problem becomes: Efficient methods available to solve this problem! Non-linear SVMs (1) A more complex extension: non-linear SVMs Basic idea: map the data to a higher-dimensional space, in which we can apply a linear SVM 17 Non-linear SVMs (2) Map the data to a higher dimension Example: Add a new dimension as a function of the data 18 Not linearly separable 3

Non-linear SVMs (3) Problems with mapping phi How to find a good mapping? Number of dimensions can blow up! Computationally expensive Data becomes sparser in a higher dimensional space Solution: Kernel functions! Mapping only occurs in the dual problem as inner product 19 Non-linear SVMs (4) Kernel functions Do not define explicitly, only define inner product, (Kernel function) Note that can be infinite dimensional, but since we only use the inner product,, it is not more computationally expensive! Can we choose any Kernel function we like? No! It needs to satisfy Mercers condition. Typically you pick your kernel from a set of commonly-used options. 20 Popular kernel functions Polynomial Radial Basis Functions (RBF) Sigmoid Non-linear SVMs (5), 1, exp 2, tanh 21 Find optimal parameters using cross-validation on training data Classification of new data Linear SVM Straightforward: we have function and from training we know and. The used constraints enforced that 1 for positive samples and 1 for negative samples. For soft-margin SVM, these constraints can be voilated. Hence, we can use the sign of to predict the label: sign 22 Classification of new data Non-linear SVM Classification function: Problem: exists in some high-dimensional space Typically the mapping to this space is unkown, since we use a kernel function,. Solution: can be written as, hence: K, Representer theorem 23 Classification of new data Non-linear SVM Classify new point using sign Lagrange multipliers only non-zero for support vectors Required at test time: only and support vectors K, 24 4

Cost parameter & generalization (1) 25 Cost parameter & generalization (2) 26 Optimal hyperplane for C=100 SVM decision for C=100 Optimal hyperplane for C=10 SVM decision for C=10 Cost parameter & generalization (3) 27 Cost parameter & generalization (4) 28 Optimal hyperplane for C=1 SVM decision for C=1 Optimal hyperplane for C=0.1 SVM decision for C=0.1 Non-linear SVM examples 29 Summary Fast and efficient method for binary classification Splits the classes based on maximizing the margin Optimal hyperplane can be computed using Quadratic Programming Cost-parameter for points crossing the margin Non-linear SVM can also handle more complex class distributions by mapping the data to another space Kernel functions: typically increase complexity 30 5

Random Forest (1) Build decision trees on subsets of the data Let the trees vote on the class of a new sample Orange (60%) Lemon (40%) Orange (95%) Lemon (5%) Orange (72%) Lemon (28%) Orange (35%) Lemon (65%) Orange (84%) Lemon (16%) 31 Random Forest (2) General model for machine learning Density estimation, regression, classification, Robustness through randomness A random subset is used to train each tree For training a tree, each node receives a random set of split options Probabilistic output: model uncertainty! Automatic feature selection Naturally multi-class Runs efficiently trees can run in parallel 32 33 34 A forest consists of trees Start at the root node True/false question at each split node Stop when a leaf node is reached: prediction internal (split) node A general tree structure 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 terminal (leaf) node root node 0 1 2 Is it a male? Does he have a beard? Does he wear 3 4 5 glasses? 6 7 8 9 10 11 12 13 14 Jake, Joshua, Mike or Justin Example: GUESS WHO* *Credits to Mark Janse 35 36 How to build a tree? Special type of graph: collection of nodes and edges Directed Acyclic Graph (DAG) Internal (split) nodes and terminal (leaf) nodes The upper/start node is called the root Each internal node has one incoming edge and two outgoing edges All nodes (exept the root) have exactly one incoming edge Mathematical notation Data point: v,,,, label: Features:, dimensionality: Binary split function:, : 0,1 Split parameters of node : Set of possible parameters Training points reaching node : Complete training set Left Right 6

37 38 How to split the data? How to split the data? Split function, depends on parameters,, Feature selection function Geometric primitive (e.g. a line) Thresholds Axis aligned hyperplane Note that setting either or corresponds to using only one threshold. Axis aligned hyperplane Oriented hyperplane Quadratic surface, For 2D example: 1 e.g. 1 0 Parameter space contains the options that we have for parameters, and., For 2D example: 1 e.g. 1 0, 1, e.g. 3 2 4, 1 representing a conic 39 40 How to determine the best split? Maximize information gain*:,,,,, What is the best split? 1.386 Information gain, 0.419 Shannon s entropy: log Node training arg max, Node *One of many options. Other popular choices: (1) Gini s diversity index, (2) Misclassification error 48, 52 0.943 0.990 41 42 What is the best split? What is the best split? 1.386 Information gain BEST SPLIT!, 0.693 50, 50 0.693 0.693, 0.419, 0.693.. of these two options 7

What is the best split? The one that yields the highest information gain from a given set of candidate splits Node training argmax, Split function parameters Limited set of parameter settings 43 Randomized Node Optimization (RNO) 44 Bagging How to train a decision tree? Start with a random subset of all the data at the root node Find the split parameters from a set of randomly chosen options that maximize some split metric Repeat this for the outgoing nodes and stop growing a certain branch untill one of the following two criteria holds: A pre-defined tree depth D is reached (# nodes of a branch) Alternatively: untill a pre-defined total number of nodes is reached All training samples in the node are from the same class Example: growing a tree (1) 45 Example: growing a tree (2) 46 Let s grow a tree with depth D = 2: Option 1 Option 2 Option 3 Subset of all availabe data Start at the root node Example: growing a tree (3) 47 Example: growing a tree (4) 48 Option 1 Option 2 Option 3 Option 1 Option 2 Option 3 8

Example: growing a tree (5) 49 Example: classify a new data point (1) 50 Resulting tree New data point v: Random Forests Example: classify a new data point (2) 51 Example: classify a new data point (3) 52 New data point v: New data point v: 53 54 Decision forest model Decision forest model Node test parameters Features / split function / thresholds Node objective function e.g., (Energy) function to minimize Node weak learner e.g.,, Split node test function Leaf predictor model e.g. Point estimate / full distribution Randomness model e.g. Bagging, RNO Methods for inserting randomness Stopping criteria e.g. Max tree drepth When to stop splitting the data Forest size Number of trees in the forest A collection of trees: a forest! Ensemble model e.g. 1 How to combine the output of all the trees in the forest 9

Decision forest model 55 Decision forest model 56 How to add randomness? (Randomness model) 1. Bagging (randomized training set) Subset of all data points per tree 2. Randomized Node Optimization (RNO) Features chosen with selection function Split function depending on weak learner orientation How to add randomness? (1) Bagging : Full training set : Randomly sampled subset for training tree Forest training Thresholds given in Decision forest model How to add randomness? (2) Randomized Node Optimization (RNO) : Full set of all possible node test parameter values : :, low randomness Node test parameter,, Set of randomly sampled parameter values to train node Randomness control parameter 1, high randomness 57 Decision forest model How to compute a prediction from a trained tree? Probability distribution at leaf: Point-estimate, e.g. M.A.P.: argmax Generally the full distribution is perserved untill the decision moment to incoroporate uncertainty 58 Decision forest model 59 Decision forest model 60 How to combine tree output? Tree 1 Tree 2 Tree T How to combine tree output? 1 1 Averaging: Multiplication: where is a partitioning function to ensure probabilistic normalization Overconfident and less robust to noise 10

Training points 61 62 Example: generalization Example: the effect of randomness Weak learner: axis aligned Weak learner: oriented line Weak learner: conic section Weak learner: Axis aligned Weak learner: Oriented line Weak learner: Conic section D = 5 D = 13 Example: the effect of randomness 63 Example: the effect of randomness 64 Weak learner: axis aligned Weak learner: oriented line Weak learner: conic section Weak learner: axis aligned Weak learner: oriented line Weak learner: conic section D = 13 D = 13 D = 5 D = 5 Random Forests Classification example (1) 65 Random Forests Classification example (2) 66 2 classes in feature space Random forest decision 4 classes in feature space Random forest decision N = 100 trees, max number of nodes = 5, # candidate splits per node = 3 N = 100 trees, max number of nodes = 4, # candidate splits per node = 3 11

Random Forests Classification example (3) 67 Example: handwritten digit classification 68 4 classes in feature space Random forest decision N = 100 trees, max number of nodes = 10, # candidate splits per node = 8 The MNIST Database of Handwritten Digit Images for Machine Learning Research, DOI:10.1109/MSP.2012.2211477. Example: handwritten digit classification 69 Example: handwritten digit classification 70 Task: classify handwritten digits 1, 2, 3, 4, 5 500 samples per digit: 250 for training, 250 for testing HOG* features as data points, 144 Forest parameters: Forest size 300 trees Axis aligned weak learner Randomness parameter 5, with 10 Selection functin randomly samples dimensions from Precision = 0.96 / Recall = 0.97 Prediction confidence *Histogram of Oriented Gradients, Dalal & Triggs, CVPR 2005 Forest predictions for class 1, sorted by confidence. Inverted digits are wrongly classified. Example: handwritten digit classification 71 Example: handwritten digit classification 72 Prediction confidence Prediction confidence Precision = 0.99 / Recall = 0.92 Forest predictions for class 2, sorted by confidence. Inverted digits are wrongly classified. Precision = 0.96 / Recall = 0.97 Forest predictions for class 3, sorted by confidence. Inverted digits are wrongly classified. 12

Example: handwritten digit classification 73 Example: handwritten digit classification 74 Prediction confidence Prediction confidence Precision = 0.97 / Recall = 1.00 Forest predictions for class 4, sorted by confidence. Inverted digits are wrongly classified. Precision = 0.95 / Recall = 0.97 Forest predictions for class 5, sorted by confidence. Inverted digits are wrongly classified. Recommended literature Decision Forests for Computer Vision and Medical Image Analysis, A. Criminisi, 2013 Ch.3: Introduction Ch.4: Classification Forests C++ library: Sherwood http://research.microsoft.com/enus/projects/decisionforests/ Breiman L., "Random forests, Mach. Learn. 45(1), 2001. doi:10.1023/a:1010933404324 doi:10.1007/978-1-4471-4929-3 75 Conclusions Random Forests offer an attractive method classification Inherently multi-class, probablistic output, efficient implementations available.. A forest is a collection of decision trees Each tree is trained with a different subset of the training data (Bagging) A tree is a collection of nodes and edges Each internal node splits the incoming data using node split function, encompasses selection function, geometric primitive and thresholds Each node receives a random subset of the parameter space for training (RNO) Randomness increases robustness Randomness control parameter determines the ammount of randomness Maximum randomness when, minumum randomness when Tree depth controls the forest confidence, hence a high can lead to overfitting 76 So, now we have model, how good is it? We have labeled data (ground truth), so we can validate! Model validation: Separate sets for training and testing the model Train the model using the training set Use the test set to evaluate the performance Compute figures of merit, which indicate the performance What is a good performance metric? And how should we split the data? 77 Some popular figures of merit: Number of samples Accuracy (#TP + #TN) / (#TP + #FN +#TN + #FP) Sensitivitiy (#TP) / (#TP + #FN) a.k.a. True Positive Rate Specificity (#TN) / (#TN + #FP) a.k.a. True Negative Rate Where True Positive (TP): positive sample classified as positive True Negative (TN): negative sample classified as negative False Positive (FP): negative sample classified as positive False Negative (FN): positive sample classified as negative 78 13

79 80 Receiver Operating Characteristic (ROC) Sensitivity / specificity give the performance for just one possible setting (i.e. decition threshold) of the model We can vary this threshold and recompute these performance metrics This yields a curve of possible combinations of sensitivity and specificity, called the ROC curve Generally true: sensitivity specificity and vice versa How to compute the ROC curve? For each sample we have a predicted class and a score Sort the samples according to score and move the threshold Negative Model Prediction Positive Predicted score Sensitivitiy = 5 / (5+0) = 1.00 Specificity = 3 / (3+2) = 0.60 Sensitiviy 1 - Specificity 81 82 How to compute the ROC curve? For each sample we have a predicted class and a score Sort the samples according to score and move the threshold How to compute the ROC curve? For each sample we have a predicted class and a score Sort the samples according to score and move the threshold Model Prediction Model Prediction Negative Positive Negative Positive Predicted score Sensitiviy Predicted score Sensitiviy Sensitivitiy = 4 / (4+1) = 0.80 Specificity = 3 / (3+2) = 0.60 1 - Specificity Sensitivitiy = 4 / (4+1) = 0.80 Specificity = 4 / (4+1) = 0.80 1 - Specificity 83 84 How to compute the ROC curve? For each sample we have a predicted class and a score Sort the samples according to score and move the threshold Model Prediction Large data set: randomly sample half the samples for training and half for testing Training and testing is time consuming for large datasets The test set is probably a good reflection of the training set Predicted score Sensitivitiy = 0 / (0+5) = 0.00 Specificity = 5 / (5+0) = 1.00 Sensitiviy Area Under the Curve (AUC) 1 - Specificity AUC = 0.84 Data set Labels Training data Test data Predicted labels MODEL Compare Ground truth labels Performance 14

How should we split the data? Different choices might lead to different results K-fold cross-validation Split the data in K equally sized parts Use K-1 parts for training and use the -out part of the data for testing, repeat this for each part and average: Data set K equal parts training testing average 85 Performance Leave-One-Out Cross-Validation Leave one sample out of the complete set and use the remaining set to train the model Test the model on the -out sample Repeat this for all samples. Best performance indication for small data set You want to use as much of the little data you have for training the model 86 EXAMPLE: 4-fold cross validation (1) 87 EXAMPLE: 4-fold cross validation (2) 88 Test set Training set Fold 1 Split in 4 equally-sized partitions Fold 1: Accuracy = 0.86 EXAMPLE: 4-fold cross validation (3) 89 EXAMPLE: 4-fold cross validation (4) 90 Test set Training set Test set Training set Fold 1 Fold 1 Fold 2 Fold 2 Fold 3 Fold 2: Accuracy = 0.86 Fold 3: Accuracy = 0.84 15

EXAMPLE: 4-fold cross validation (5) 91 EXAMPLE: 4-fold cross validation (6) 92 Test set Training set Test set Training set Fold 1 Fold 1 Acc. = 0.86 Fold 2 Fold 3 Fold 2 Fold 3 Acc. = 0.86 Acc. = 0.84 4-fold cross-validation accuracy = 0.86 ± 0.016 Fold 4 Fold 4: Accuracy = 0.88 Fold 4 Acc. = 0.88 (mean ± stdev) Generalization: under- and overfitting 93 Generalization: under- and overfitting 94 Why don t we evaluate on the training set? Example: Why don t we evaluate on the training set? Example: Is this a good classifier? No errors on the training set!!! 100% accuracy NO! Very poor generalization On new, identically distributed data: 81% accuracy Overfitting! Generalization: under- and overfitting Why don t we evaluate on the training set? Example: Is this a good classifier? Many errors on the training set 86% accuracy NO! Model complexity too low! Underfitting! On new, identically distributed data: 84% accuracy ( train acc.!) 95 Generalization: under- and overfitting Why don t we evaluate on the training set? Example: Is this a good classifier? Accuracy on trianing set: 94% Accuracy on test set: 95% Approximately equal train and test error Good generalization! YES! 96 16

Generalization: under- and overfitting Model complexity: what is a good model? A model with good generalization! Prediction error Sufficient complexity Model complexity Training error 97 Good prediciton accuracy on both the training and the test set! Generalization: under- and overfitting Model complexity: what is a good model? Example: Non-linear SVM Fixed cost parameter C Complexity increases with reducing the size of the kernel scale (flexibility) 10-fold cross validation to estimate the test error Validate on training set for computing the train error Prediction error (%) Low complexity Model complexity 98 High complexity Summary: In supervised learning the ground truth is available, so we can evaluate the prediction performance of the model. Split the data in two sets (training set and test set). Use figures of merit for measuring the performance: Accuracy, Sensitivity, Specificity, AUC, Use K-fold cross-validation for reliable evaluation. Increasing the model complexity may lead to overfitting! Poor generalization: Low training set error, high test set error. 99 17