Advanced Video Content Analysis and Video Imaging (5LSH0), Module 09. Sequential data / Introduction (1) Sequential data / Introduction (2)

Size: px

Start display at page:

Download "Advanced Video Content Analysis and Video Imaging (5LSH0), Module 09. Sequential data / Introduction (1) Sequential data / Introduction (2)"

Bertram Barnett
5 years ago
Views:

Advanced Video Content Analysis and Video Imaging (5LSH0), Module 09 Semantic-level content analysis and classification II Sveta Zinger & Fons van der Sommen Video Coding and Architectures Research

1 Advanced Video Content Analysis and Video Imaging (5LSH0), Module 09 Semantic-level content analysis and classification II Sveta Zinger & Fons van der Sommen Video Coding and Architectures Research group, TU/e ( s.zinger@tue.nl ) 1 Sequential data / Introduction (1) When do we encounter sequential data? Measurements of time series Rainfall measurements on successive days Daily values of a currency exchange rate Acoustic features used for speech recognition Sequence of DNA elements Sequence of characters in a language 2 Sequential data / Introduction (2) 3 Sequential data / Introduction (3) 4 Example of sequential data: spectrogram of the spoken words Bayes theorem Sequential distributions Stationary data evolves in time, but the distribution from which it is generated remains the same Nonstationary not treated here Distribution itself is evolving with time Sequential data / Introduction (4) 5 Markov models (1) 6 Prediction of the next value in time series Recent observations are likely to be more informative than more historical observations It is impractical to consider a general dependence of future observations on all previous observations Number of observations increases => complexity of the model grows Markov models assume that future predictions are independent of all but the most recent observations If we ignore sequential aspects in data Treat the observations as i.i.d. (independent and identically distributed) Corresponds to a graph without links Fail to exploit sequential patterns in the data correlations between observations that are close in the sequence 1

2 Markov models (2) 7 Markov models (3) 8 Example: sequential patterns in data We observe a binary variable denoting whether it rained on a particular day We want to predict whether it will rain on the next day If we treat the data as i.i.d. => we have only relative frequency of rainy days In practice, weather exhibits trends that may last for several days => knowing that it rains today helps predicting rain for tomorrow The product rule applied to the joint distribution of a sequence of observations p N ( x1,..., xn ) = p( xn x1,..., xn 1) n= 1 If each of the conditional distributions on the right-hand side is independent of all previous observations except the most recent, we obtain the first-order Markov chain Markov models (4) { x n } p A first-order markov chain of observations ( ) x n x n 1 of the previous observation of a particular observation x n 1 x n in which the distribution is conditioned on the value 9 Markov models (5) First-order Markov chain defines joint distribution for a sequence of N observations p N ( x1,..., xn ) = p( x1 ) p( xn xn 1) n= 2 given all observations, conditional distribution of one of them p( xn x1,..., xn 1) = p( xn xn 1) distribution of predictions depends only on the value of the immediately preceding observation and is independent of all earlier observations 10 Markov models (6) Homogeneous Markov chain assumes stationary time series => conditional distributions constrained to be equal If the conditional distributions depend on adjustable parameters, then all of the conditional distributions in the chain share the same values of those parameters 11 Markov models (7) Higher-order Markov chains trends in the data over several successive observations will provide important information in predicting the next value First-order Markov chain is still very restrictive => move to higher-order Markov chains M th order Markov chains increase flexibility, but also increase the number of parameters in the model => impractical for large values of M 12 2

3 Markov models (8) Second-order Markov chain Joint distribution is given by p N ( x1,..., xn ) = p( x1 ) p( x2 x1 ) p( xn xn 1, xn 2 ) n= 3 Conditional distribution of a particular observation depends on the values of the two previous observations 13 Markov models (9) How to build a model not limited by the Markov assumption and with a limited number of parameters? Latent variables permit a rich class of models to be constructed from simple components For each observation x n introduce a corresponding latent variable z n and now assume that it is the latent variables that form a Markov chain 14 Markov models (10) Representation of sequential data using a Markov chain of latent variables, with each observation conditioned on the state of the corresponding latent variable: foundation for HMM (Hidden Markov Model) and for linear dynamical systems 15 Markov models (11) Markov chain with latent variables there is always a path connecting any two observed variables via the latent variables This path is never blocked Predictions depend on all previous observations Observed variables do not satisfy the Markov property of any order 16 Hidden Markov Models (1) 17 Hidden Markov Models (2) 18 Hidden Markov Model (HMM) Can be viewed as a Markov chain with discrete latent variables Examine a single slice of HMM It corresponds to a mixture distribution with component densities given by p( x z) The choice of mixture component depends on the choice made for the previous observation Applications of HMM Speech recognition Natural language modeling On-line handwriting recognition Analysis of biological sequences (DNA) 3

Hidden Markov Models (3) Definition of transition probabilities we allow the probability distribution of z n to depend on the state of the previous latent variable z n-1 through a conditional

) jk nk 1 n 1, j = 19 Hidden Markov Models (4) Transition matrix can be illustrated diagrammatically by drawing the states as nodes Transition diagram shows a model whose latent variables have three

4 Hidden Markov Models (3) Definition of transition probabilities we allow the probability distribution of z n to depend on the state of the previous latent variable z n-1 through a conditional distribution p( z n z n 1 ) Latent variables are K-dimensional binary => conditional distribution corresponds to a table of transition probabilities, with elements of the table given by A p z = z 1 ( ) jk nk 1 n 1, j = 19 Hidden Markov Models (4) Transition matrix can be illustrated diagrammatically by drawing the states as nodes Transition diagram shows a model whose latent variables have three possible states corresponding to the three boxes. The black lines denote the elements of the transition matrix 20 Hidden Markov Models (5) If we unfold the state transition diagram, we obtain a lattice, or trellis, representation of the latent states Each column of this diagram corresponds to one of the latent variables z n 21 Hidden Markov Models (6) Joint probability distribution over both latent and observed variables N N p( X, ZΘ) = p( z1 π ) p( zn zn 1, A) p( xm zm, φ) n= 2 m= 1 where p(z 1 ) marginal distribution of the initial latent variable, π vector of probabilities for initial latent variable ( ) p x m z m,φ conditional distribution of the observed variables emission probabilities governed by distribution parameters φ z 1 22 Hidden Markov Models (7) 23 Hidden Markov Models (8) 24 Variants of the standard HMM model obtained for instance by imposing constraints on the form of the transition matrix Left-to-right HMM model of particular practical importance sets the elements A jk of the transition matrix to zero if k < j Example of the state transition diagram for a three-state left-to-right hidden Markov model Left-to-right HMM is used for speech recognition, on-line character recognition 4

Hidden Markov Models (9) Lattice diagram for a three-state left-to-right hidden Markov model in which the state index k is allowed to increase by at most 1 at each transition 25 Hidden Markov Models

segments having one of 16 possible angles Model parameters are are optimized using 25 iterations of EM 26 Hidden Markov Models (11) Top row: examples of on-line handwritten digits Bottom row:

5 Hidden Markov Models (9) Lattice diagram for a three-state left-to-right hidden Markov model in which the state index k is allowed to increase by at most 1 at each transition 25 Hidden Markov Models (10) Example of left-to-right HMM: handwritten digits On-line data: digit is represented as a trajectory of a pen as a function of time Train HMM on 45 examples of the digit 2 16 states: line segments having one of 16 possible angles Model parameters are are optimized using 25 iterations of EM 26 Hidden Markov Models (11) Top row: examples of on-line handwritten digits Bottom row: synthetic digits sampled generatively from a left-to-right HMM that has been trained on 45 handwritten digits 27 Summary and conclusions Markov chains assumes dependency inside a fixed neighborhood with latent variables => widely used hidden Markov fields Hidden Markov models Powerful method to model sequential data Requires its parameters to be estimated Invariant to some degree to local warping (compression and stretching) of the time axis speech recognition: variations of speed of speech => warping of the time axis => HMM can accommodate such a distortion and not penalize it too heavily 28 References 29 Supervised learning 30 Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer, 2002 Chapter 8 Chapter 13 Catogarized / labeled data Objects in a picture: chair, desk, person, Handwritten digits: 3, 6, 5 Medical diagnosis: OCT image T2 colon cancer Goal: identify the class of a new data point Statistical modeling and machine learning Disctinctive properties (features) 5

Supervised learning: example (1/5) 31 Supervised learning: example (2/5) 32 Separate lemons from oranges Separate lemons from oranges Color:

1 kg Color Use color and shape as features Shape Supervised learning: example (3/5) 33 Supervised learning: example (4/5) 34 Separate lemons from

an orange! Shape Shape Supervised learning: example (5/5) 35 Supervised learning 36 Diameter What if we had chosen the wrong features?

6 Supervised learning: example (1/5) 31 Supervised learning: example (2/5) 32 Separate lemons from oranges Separate lemons from oranges Color: orange Shape: sphere Ø: ± 8 cm Weigth: ±0.1 kg Color: yellow Shape: elipsoid Ø: ± 8 cm Weigth: ±0.1 kg Color Use color and shape as features Shape Supervised learning: example (3/5) 33 Supervised learning: example (4/5) 34 Separate lemons from oranges Model the given -training - data Separate lemons from oranges New data point Color Oranges Lemons Color Oranges Lemons Classifier: It s an orange! Shape Shape Supervised learning: example (5/5) 35 Supervised learning 36 Diameter What if we had chosen the wrong features? Diameter New data point??? Summary Choose distinctive features Make a model based on labeled data (a.k.a. supervised learning) Use the learned model to predict the class of new, unseen data points Weight Weight 6

7 Models for classification 37 (1) 38 Random Forests k Nearest Neighbours (k-nn) Boosting Neural Networks Convolutional Neural Networks (CNN) & Deep learning Find Stanford lectures CS231n on YouTube (19 videos)! Find a hyperplane that separates the classes with a maximum margin Margin (2) 39 (3) 40 Based on emperical risk minimization (1960s) Non-linearity added in 1992 (Boser, Guyon & Vapnik) Soft-margin SVM introduced in 1995 (Cortes & Vapnik) Has become very popular since then Easy to use, a lot of open libraries available Fast learning and very fast classification Good generalization properties How to find the optimal hyperplane? Optimal? No! (4) 41 (4) 42 How to find the optimal hyperplane? How to find the optimal hyperplane? m w x+ b = 1 b w w x+ b= 0 w x+ b= 1 Width of the margin: b + 1 b 1 2 m = = w w w Maximize margin: Support vectors Vectors for which the constraint is exactly met. These vectors support the dividing hyperplane. Removal of one of those vectors typically leads to a different optimal hyperplane. 7

8 (5) 43 (6) 44 We can rewrite the optimization problem to The data is usually not linearly separable Introduce slack variables Formulate as a Quadratic Programming problem: Put a cost C on crossing the margin, so the optimization problem becomes: Efficient methods available to solve this problem! SFvd1 Non-linear SVMs (1) A more complex extension: non-linear SVMs Basic idea: map the data to a higher-dimensional space, in which we can apply a linear SVM 45 Non-linear SVMs (2) Map the data to a higher dimension Example: Add a new dimension as a function of the data 46 Not linearly separable Non-linear SVMs (3) 47 Non-linear SVMs (4) 48 Problems with mapping phi How to find a good mapping? Number of dimensions can blow up! Computationally expensive Data becomes sparser in a higher dimensional space Solution: Kernel functions! Mapping only occurs in the dual problem as inner product Kernel functions Do not define explicitly, only define inner product, (Kernel function) Note that can be infinite dimensional, but since we only use the inner product,, it is not more computationally expensive! Can we choose any Kernel function we like? No! It needs to satisfy Mercers condition. Typically you pick your kernel from a set of commonly-used options. 8

9 Slide 45 SFvd1 If possible, add a more intuitive example, e.g. circular data and a cone to rise the surrounding ring-shaped cluster Sommen, F. van der, 8/24/2016

Polynomial Radial Basis Functions (RBF) Sigmoid Non-linear SVMs (5) Popular kernel functions, 1, exp 2, tanh 49 Find optimal parameters using cross-validation on training data Classification of new

10 Polynomial Radial Basis Functions (RBF) Sigmoid Non-linear SVMs (5) Popular kernel functions, 1, exp 2, tanh 49 Find optimal parameters using cross-validation on training data Classification of new data Linear SVM Straightforward: we have function! " and from training we know and ". The used constraints enforced that #1 for positive samples and $ 1 for negative samples. For soft-margin SVM, these constraints can be voilated. Hence, we can use the sign of to predict the label: %& sign * + 50 Classification of new data Non-linear SVM Classification function:! " Problem: exists in some high-dimensional space Typically the mapping to this *+ space is unkown, since we use a kernel function, * +.. Solution: can be written as / -% * +, hence:. 0- % /. "0- % K, " / Representer theorem 51 Classification of new data Non-linear SVM Classify new point using Lagrange multipliers 2 only non-zero for support vectors Required at test time: only 2 and support vectors. %& sign 0- % K, / " 52 Cost parameter & generalization (1) 53 Cost parameter & generalization (2) 54 Optimal hyperplane for C=100 SVM decision for C=100 Optimal hyperplane for C=10 SVM decision for C=10 9

Cost parameter & generalization (3) 55 Cost parameter & generalization (4) 56 Optimal hyperplane for C=1 SVM decision for C=1 Optimal hyperplane for C=0.1 SVM decision for C=0.

Programming Cost-parameter for points crossing the margin Non-linear SVM can also handle more complex class distributions by mapping the data to another space Kernel functions: typically increase

Orange (35%) Lemon (65%) Orange (84%) Lemon (16%) 59 Random Forest (2) General model for machine learning Density estimation, regression, classification, Robustness through randomness A random subset

11 Cost parameter & generalization (3) 55 Cost parameter & generalization (4) 56 Optimal hyperplane for C=1 SVM decision for C=1 Optimal hyperplane for C=0.1 SVM decision for C=0.1 Non-linear SVM examples 57 Summary Fast and efficient method for binary classification Splits the classes based on maximizing the margin Optimal hyperplane can be computed using Quadratic Programming Cost-parameter for points crossing the margin Non-linear SVM can also handle more complex class distributions by mapping the data to another space Kernel functions: typically increase complexity 58 Random Forest (1) Build decision trees on subsets of the data Let the trees vote on the class of a new sample Orange (60%) Lemon (40%) Orange (95%) Lemon (5%) Orange (72%) Lemon (28%) Orange (35%) Lemon (65%) Orange (84%) Lemon (16%) 59 Random Forest (2) General model for machine learning Density estimation, regression, classification, Robustness through randomness A random subset is used to train each tree For training a tree, each node receives a random set of split options Probabilistic output: model uncertainty! Automatic feature selection Naturally multi-class Runs efficiently trees can run in parallel 60 10

61 62 A forest consists of trees Start at the root node True/false question at each split node Stop when a leaf node is reached: prediction internal (split) node A general tree structure 0 1 2 3 4 5

6 7 8 9 10 11 12 13 14 Jake, Joshua, Mike or Justin Example: GUESS WHO* *Credits to Mark Janse 63 64 How to build a tree?

12 61 62 A forest consists of trees Start at the root node True/false question at each split node Stop when a leaf node is reached: prediction internal (split) node A general tree structure terminal (leaf) node root node Is it a male? Does he have a beard? Does he wear glasses? Jake, Joshua, Mike or Justin Example: GUESS WHO* *Credits to Mark Janse How to build a tree? Special type of graph: collection of nodes and edges Directed Acyclic Graph (DAG) Internal (split) nodes and terminal (leaf) nodes The upper/start node is called the root Each internal node has one incoming edge and two outgoing edges All nodes (exept the root) have exactly one incoming edge Mathematical notation Data point: v,,, 5 7 5, label: % Features:, dimensionality: 8 Binary split function: 9 :,; :7 5 => 0,1 Split parameters of node A: ; > Set of possible parameters > Training points reaching node A: B B C B E Complete training set B F Left Right How to split the data? Axis aligned hyperplane 9 :,; J LG : HLJ For 2D example: H B G : 1! e.g. H 1 0 M N Split function 9 :,; depends on parameters ; G :,H,I Feature selection function G : Geometric primitive H (e.g. a line) Thresholds I Note that setting either J or J corresponds to using only one threshold. Parameter space > contains the options that we have for parameters G :, H and I. B C B AO 65 B E How to split the data? Axis aligned hyperplane 9 :,; J LG : HLJ For 2D example: H G : 1! e.g. H 1 0 M N Oriented hyperplane 9 :,; J LG : HLJ G : 1! H H 7 N, e.g. H H Quadratic surface 9 :,; J LG! : H G : LJ G : 1! H 7 N=N representing a conic 66 11

67 68 How to determine the best split? Maximize information gain*: R B,S T B 0 B T B B C,E Z %^ `a,`bcde^, UfbU_^,"_f^ What is the best split? T B 1.386 B Information gain R B,S 0.

13 67 68 How to determine the best split? Maximize information gain*: R B,S T B 0 B T B B C,E Z %^ `a,`bcde^, UfbU_^,"_f^ What is the best split? T B B Information gain R B,S Shannon s entropy: T B 0 U V log U V Y Z Node training ;argmax ; > ] R B,; B NodeA B C B E *One of many options. Other popular choices: (1) Gini s diversity index, (2) Misclassification error B C 48, B E 52 B C B E T B C T B E What is the best split? What is the best split? T B Information gain BESTSPLIT! B R B,S B C 50, B E 50 T B C T B E B C B E R B,S R B,S of these two options What is the best split? The one that yields the highest information gain from a given set of candidate splits Node training ; argmaxr B,; ; > ] Split function parameters ; Limited set of parameter settings > 71 Randomized Node Optimization (RNO) 72 Bagging How to train a decision tree? Start with a random subset of all the data at the root node Find the split parameters ; from a set of randomly chosen options> >that maximize some split metric Repeat this for the outgoing nodes and stop growing a certain branch untill one of the following two criteria holds: A pre-defined tree depth D is reached (# nodes of a branch) Alternatively: untill a pre-defined total number of nodes is reached All training samples in the node are from the same class 12

14 Example: growing a tree (1) 73 Example: growing a tree (2) 74 Let s grow a tree with depth D = 2: Option 1 Option 2 Option 3 Subset of all availabe data Start at the root node Example: growing a tree (3) 75 Example: growing a tree (4) 76 Option 1 Option 2 Option 3 Option 1 Option 2 Option 3 Resulting tree Example: growing a tree (5) left right 77 Example: classify a new data point (1) New data point v: 78 right right right left right right left left left left 13

Random Forests Example: classify a new data point (2) 79 Example: classify a new data point (3) 80 New data point v: New data point v: right left right right right left right right left left left

R R B,S (Energy) function to minimize Node weak learner e.g. 9 :,; lbf^, c_m^ Split node test function Leaf predictor model e.g. U V : Point estimate / full distribution Randomness model e.g. Bagging, RNO Methods for inserting randomness Stopping criteria e.

q/ How to combine the output of all the trees in the forest Decision forest model 83 Decision forest model 84 How to add randomness? (Randomness model) 1.

15 Random Forests Example: classify a new data point (2) 79 Example: classify a new data point (3) 80 New data point v: New data point v: right left right right right left right right left left left left Decision forest model 81 Decision forest model 82 Node test parameters ; > Features / split function / thresholds Node objective function e.g. R R B,S (Energy) function to minimize Node weak learner e.g. 9 :,; lbf^, c_m^ Split node test function Leaf predictor model e.g. U V : Point estimate / full distribution Randomness model e.g. Bagging, RNO Methods for inserting randomness Stopping criteria e.g. Max tree drepth o When to stop splitting the data Forest size p Number of trees in the forest A collection of trees: a forest! Ensemble model e.g. U V : 1 p 0U q V :! q/ How to combine the output of all the trees in the forest Decision forest model 83 Decision forest model 84 How to add randomness? (Randomness model) 1. Bagging (randomized training set) Subset of all data points per tree 2. Randomized Node Optimization (RNO) Features chosen with selection function G r Split function depending on weak learner orientation H How to add randomness? (1) Bagging s F : s F q s F : Full training set Randomly sampled subset for training tree l s F s F Forest training s F N s F u Thresholds given in I 14

Decision forest model How to add randomness?

values > >: Set of randomly sampled parameter values to train node A v > : vw >, low

randomness 85 Decision forest model How to compute a prediction from a trained tree?

Generally the full distribution is perserved untill the decision moment to

88 How to combine tree output? Tree 1 Tree 2 Tree T How to combine tree output?

V : Averaging: Multiplication: U V :! U! q/ q V : U V :!

Overconfident and less robust to noise Training points 89 90 Example: generalization

16 Decision forest model How to add randomness? (2) Randomized Node Optimization (RNO) >: Full set of all possible node test parameter values > >: Set of randomly sampled parameter values to train node A v > : vw >, low randomness Randomness control parameter Node test parameter ; G :,H,I > v1, high randomness 85 Decision forest model How to compute a prediction from a trained tree? Probability distribution at leaf: U V : Point-estimate, e.g. M.A.P.: V argmaxu V : Y Generally the full distribution is perserved untill the decision moment to incoroporate uncertainty : 86 U V : V Decision forest model 87 Decision forest model 88 How to combine tree output? Tree 1 Tree 2 Tree T How to combine tree output? U V : 1! p 0U q V : U V : 1! { U q V : q/ q/ U V : U V : U! V : Averaging: Multiplication: U V :! U! q/ q V : U V :! U y q/ q V : where { is a partitioning function to ensure probabilistic normalization Overconfident and less robust to noise Training points Example: generalization Example: the effect of randomness Weak learner: axis aligned Weak learner: oriented line Weak learner: conic section Weak learner: Axis aligned Weak learner: Oriented line Weak learner: Conic section }~ D = 13 D = 5 15

Example: the effect of randomness 91 Example: the effect of randomness 92 Weak learner: axis aligned Weak learner: oriented line Weak learner: conic section Weak learner: axis aligned Weak learner:

forest decision 4 classes in feature space Random forest decision N = 100 trees, max number of nodes = 5, # candidate splits per node = 3 N = 100 trees, max number of nodes = 4, # candidate splits

17 Example: the effect of randomness 91 Example: the effect of randomness 92 Weak learner: axis aligned Weak learner: oriented line Weak learner: conic section Weak learner: axis aligned Weak learner: oriented line Weak learner: conic section D = 13 D = 13 }~ }~ D = 5 D = 5 Random Forests Classification example (1) 93 Random Forests Classification example (2) 94 2 classes in feature space Random forest decision 4 classes in feature space Random forest decision N = 100 trees, max number of nodes = 5, # candidate splits per node = 3 N = 100 trees, max number of nodes = 4, # candidate splits per node = 3 Random Forests Classification example (3) 95 Example: handwritten digit classification 96 4 classes in feature space Random forest decision N = 100 trees, max number of nodes = 10, # candidate splits per node = 8 The MNIST Database of Handwritten Digit Images for Machine Learning Research, DOI: /MSP

97 98 Example: handwritten digit classification Example: handwritten digit classification Task: classify handwritten digits 1, 2, 3, 4, 5

trees Axis aligned weak learner H Randomness parameter v 5, with > 10 Selection functin G : randomly samples 8 dimensions from : Precision

97 Prediction confidence *Histogram of Oriented Gradients, Dalal & Triggs, CVPR 2005 Forest predictions for class 1, sorted by confidence.

99 100 Example: handwritten digit classification Example: handwritten digit classification Prediction confidence Prediction confidence

Precision 97 Forest predictions for class 3, sorted by confidence.

18 97 98 Example: handwritten digit classification Example: handwritten digit classification Task: classify handwritten digits 1, 2, 3, 4, samples per digit: 250 for training, 250 for testing HOG* features as data points : 7 5, 8144 Forest parameters: Forest size p 300 trees Axis aligned weak learner H Randomness parameter v 5, with > 10 Selection functin G : randomly samples 8 dimensions from : Precision = 0.96 / Recall = 0.97 Prediction confidence *Histogram of Oriented Gradients, Dalal & Triggs, CVPR 2005 Forest predictions for class 1, sorted by confidence. Inverted digits are wrongly classified Example: handwritten digit classification Example: handwritten digit classification Prediction confidence Prediction confidence Precision = 0.99 / Recall = 0.92 Forest predictions for class 2, sorted by confidence. Inverted digits are wrongly classified. Precision = 0.96 / Recall = 0.97 Forest predictions for class 3, sorted by confidence. Inverted digits are wrongly classified Example: handwritten digit classification Example: handwritten digit classification Prediction confidence Prediction confidence Precision = 0.97 / Recall = 1.00 Forest predictions for class 4, sorted by confidence. Inverted digits are wrongly classified. Precision = 0.95 / Recall = 0.97 Forest predictions for class 5, sorted by confidence. Inverted digits are wrongly classified. 17

Recommended literature Decision Forests for Computer Vision and Medical Image Analysis, A. Criminisi, 2013 Ch.3: Introduction Ch.4: Classification Forests C++ library: Sherwood http://research.

19 Recommended literature Decision Forests for Computer Vision and Medical Image Analysis, A. Criminisi, 2013 Ch.3: Introduction Ch.4: Classification Forests C++ library: Sherwood Breiman L., "Random forests, Mach. Learn. 45(1), doi: /a: doi: / Conclusions Random Forests offer an attractive method classification Inherently multi-class, probablistic output, efficient implementations available.. A forest is a collection of decision trees Each tree l is trained with a different subset B F q of the training data (Bagging) A tree is a collection of nodes and edges Each internal node splits the incoming data using node split function 9 :,; ; encompasses selection function G :, geometric primitive H and thresholds I Each node A receives a random subset > of the parameter space > for training (RNO) Randomness increases robustness Randomness control parameter } determines the ammount of randomness Maximum randomness when }, minumum randomness when } ƒ Tree depth o controls the forest confidence, hence a high o can lead to overfitting 104 So, now we have model, how good is it? We have labeled data (ground truth), so we can validate! Model validation: Separate sets for training and testing the model Train the model using the training set Use the test set to evaluate the performance Compute figures of merit, which indicate the performance What is a good performance metric? And how should we split the data? 105 Some popular figures of merit: Accuracy (#TP + #TN) / (#TP + #FN +#TN + #FP) Sensitivitiy (#TP) / (#TP + #FN) a.k.a. True Positive Rate Specificity (#TN) / (#TN + #FP) a.k.a. True Negative Rate Where True Positive (TP): True Negative (TN): False Positive (FP): False Negative (FN): positive sample classified as positive negative sample classified as negative negative sample classified as positive positive sample classified as negative Number of samples 106 Receiver Operating Characteristic (ROC) Sensitivity / specificity give the performance for just one possible setting (i.e. decition threshold) of the model We can vary this threshold and recompute these performance metrics This yields a curve of possible combinations of sensitivity and specificity, called the ROC curve Generally true: sensitivity specificity and vice versa 107 How to compute the ROC curve? For each sample we have a predicted class and a score Sort the samples according to score and move the threshold Negative Model Prediction Positive Predicted score Sensitivitiy = 5 / (5+0) = 1.00 Specificity = 3 / (3+2) = 0.60 Sensitiviy 1 -Specificity

20 How to compute the ROC curve? For each sample we have a predicted class and a score Sort the samples according to score and move the threshold Negative Model Prediction Positive Predicted score Sensitivitiy = 4 / (4+1) = 0.80 Specificity = 3 / (3+2) = 0.60 Sensitiviy 1 -Specificity How to compute the ROC curve? For each sample we have a predicted class and a score Sort the samples according to score and move the threshold Negative Model Prediction Predicted score Sensitivitiy = 4 / (4+1) = 0.80 Specificity = 4 / (4+1) = 0.80 Positive Sensitiviy 1 -Specificity How to compute the ROC curve? For each sample we have a predicted class and a score Sort the samples according to score and move the threshold Model Prediction Large data set: randomly sample half the samples for training and half for testing Training and testing is time consuming for large datasets The test set is probably a good reflection of the training set Predicted score Sensitivitiy = 0 / (0+5) = 0.00 Specificity = 5 / (5+0) = 1.00 Sensitiviy Area Under the Curve (AUC) 1 -Specificity AUC = 0.84 Data set Labels Training data Test data Predicted labels MODEL Compare Ground truth labels Performance How should we split the data? Different choices might lead to different results K-fold cross-validation Split the data in K equally sized parts Use K-1 parts for training and use the left-out part of the data for testing, repeat this for each part and average: Data set K equal parts training testing average 113 Performance Leave-One-Out Cross-Validation Leave one sample out of the complete set and use the remaining set to train the model Test the model on the left-out sample Repeat this for all samples. Best performance indication for small data set You want to use as much of the little data you have for training the model

EXAMPLE: 4-fold cross validation (1) 115 EXAMPLE: 4-fold cross validation (2) 116 Test set Training set Fold 1 Split in 4 equally-sized partitions Fold 1: Accuracy = 0.

21 EXAMPLE: 4-fold cross validation (1) 115 EXAMPLE: 4-fold cross validation (2) 116 Test set Training set Fold 1 Split in 4 equally-sized partitions Fold 1: Accuracy = 0.86 EXAMPLE: 4-fold cross validation (3) 117 EXAMPLE: 4-fold cross validation (4) 118 Test set Training set Test set Training set Fold 1 Fold 1 Fold 2 Fold 2 Fold 3 Fold 2: Accuracy = 0.86 Fold 3: Accuracy = 0.84 EXAMPLE: 4-fold cross validation (5) 119 EXAMPLE: 4-fold cross validation (6) 120 Test set Training set Test set Training set Fold 1 Fold 1 Acc. = 0.86 Fold 2 Fold 3 Fold 2 Fold 3 Acc. = 0.86 Acc. = fold cross-validation accuracy = 0.86 ± Fold 4 Fold 4: Accuracy = 0.88 Fold 4 Acc. = 0.88 (mean ± stdev) 20

Generalization: under- and overfitting 121 Generalization: under- and overfitting 122 Why don t we evaluate on the training set? Example: Why don t we evaluate on the training set?

Generalization: under- and overfitting Why don t we evaluate on the training set? Example: Is this a good classifier? Many errors on the training set 86% accuracy NO! Model complexity too low!

22 Generalization: under- and overfitting 121 Generalization: under- and overfitting 122 Why don t we evaluate on the training set? Example: Why don t we evaluate on the training set? Example: Is this a good classifier? No errors on the training set!!! 100% accuracy NO! Very poor generalization On new, identically distributed data: 81% accuracy Overfitting! Generalization: under- and overfitting Why don t we evaluate on the training set? Example: Is this a good classifier? Many errors on the training set 86% accuracy NO! Model complexity too low! Underfitting! On new, identically distributed data: 84% accuracy ( train acc.!) 123 Generalization: under- and overfitting Why don t we evaluate on the training set? Example: Is this a good classifier? Accuracy on trianing set: 94% Accuracy on test set: 95% Approximately equal train and test error Good generalization! YES! 124 Generalization: under- and overfitting Model complexity: what is a good model? A model with good generalization! Prediction error Sufficient complexity Model complexity Training error 125 Good prediciton accuracy on both the training and the test set! Generalization: under- and overfitting Model complexity: what is a good model? Example: Non-linear SVM Fixed cost parameter C Complexity increases with reducing the size of the kernel scale (flexibility) 10-fold cross validation to estimate the test error Validate on training set for computing the train error Prediction error (%) Low complexity Model complexity 126 High complexity 21

23 Summary: In supervised learning the ground truth is available, so we can evaluate the prediction performance of the model. Split the data in two sets (training set and test set). Use figures of merit for measuring the performance: Accuracy, Sensitivity, Specificity, AUC, Use K-fold cross-validation for reliable evaluation. Increasing the model complexity may lead to overfitting! Poor generalization: Low training set error, high test set error

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B 1 Supervised learning Catogarized / labeled data Objects in a picture: chair, desk, person, 2 Classification Fons van der Sommen