Supervised Learning for Image Segmentation

Size: px

Start display at page:

Download "Supervised Learning for Image Segmentation"

Damon Rice
5 years ago
Views:

1 Supervised Learning for Image Segmentation Raphael Meier Raphael Meier MIA / 52

2 References A. Ng, Machine Learning lecture, Stanford University. A. Criminisi, J. Shotton, E. Konukoglu, Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning, Foundations and Trends in Computer Graphics and Computer Vision, A. Criminisi, Decision Forests for Computer Vision and Medical Image Analysis, Tutorial, projects/decisionforests/. S. J. D. Prince, Computer vision: Models, Learning and Inference, Cambridge University Press, D. Barber, Bayesian Reasoning and Machine Learning, T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning: Data Mining, Inference and Prediction, Springer, Raphael Meier MIA / 52

3 Part I Supervised Learning Expert knowledge (manual segmentation) Training General rule H(x) Training data Fully automatic segmentation Testing

4 Brain Tumor Segmentation Brain tumors: Glioma (Glioblastoma) Clinical guidelines Bidimensional measures (RANO/AvaGlio) Desired: Tumor Volumetry (manual segmentation, takes hours) Future: Fully-automatic segmentation

5 Bidimensional measures fail (Reuter et al., 2014) Raphael Meier MIA / 52

6 Motivation (Menze et al., 2014) Raphael Meier MIA / 52

7 The Learning Problem Training data Hypothesis H(x) New data (x) Prediction (y) Training set: S Input: x Output: y Hypothesis: H(x) : x y Raphael Meier MIA / 52

Application: Image segmentation Aim: Partition image into disjoint, semantically meaningful image regions can be seen as a learning (classification)

8 Application: Image segmentation Aim: Partition image into disjoint, semantically meaningful image regions can be seen as a learning (classification) problem Input: Image(s) consisting of voxels Output: Regions, indicated by voxel-wise numbers (usually integers: 1,2,3, ) Raphael Meier MIA / 52

9 Image representation - Features Definition: Measurable attributes of image data Can be either hand-crafted or automatically learned (e.g. via Restricted Boltzmann Machine) Raphael Meier MIA / 52

10 Taxonomy of Learning Scenarios Defined by nature of training data Unsupervised Learning: Given a set of unlabeled feature vectors Su = { x (i) : i = 1,..., m } Supervised Learning: Given a set of fully-labeled feature vectors Sl = {( x (i), y (i)) : i = 1,..., m } Semi-supervised Learning: Given a set of partially labeled feature vectors S = S u S l Raphael Meier MIA / 52

11 Taxonomy of Learning Problems Defined by the learning scenario and nature of the output Unsupervised Learning: Given Su, find interesting structure (clustering, density estimation) Given Su with x R n, find H(x) = x such that ñ n (dimensionality reduction, manifold learning) Supervised Learning: Given Sl, find H(x) : x y with x R n and y {1, 2, 3, } (classification) Given S l, find H(x) : x y with x R n and y R (regression) Raphael Meier MIA / 52

12 Image segmentation via Classification Expert knowledge (manual segmentation) General rule H(x) Training data Fully automatic segmentation Raphael Meier MIA / 52

13 Training and Testing phase Expert knowledge (manual segmentation) Training General rule H(x) Training data Fully automatic segmentation Testing Raphael Meier MIA / 52

14 Learning (Training) Algorithm Aim: Construct a hypothesis H which relates a feature vector x to its most probable label y. Output: Hypothesis (model) parametrized by set of parameters θ Assume we know p(y x, θ), then the mapping H(x) : x y can be realized via (MAP-rule): How do we obtain p(y x, θ)? ŷ = arg max p(y x, θ). (1) y Raphael Meier MIA / 52

15 Generative vs. Discriminative Models Bayes rule: p(y x, θ) = p(x, y θ) p(x θ) = p(x y, θ)p(y θ). (2) p(x θ) Generative models: Estimate p(y x) via likelihood p(x y) and prior distribution p(y). Discriminative models: Estimate posterior distribution p(y x) directly can be also non-probabilistic (e.g. Support Vector Machines) Raphael Meier MIA / 52

= hθ (x) and p(y = 0 x; θ) = 1 h θ (x) (Bernoulli) More compactly: p(y x; θ) = (h θ (x)) y (1 h θ (x)) 1 y

16 Logistic regression A Classic (1940s) Used extensively, 1415 hits on pubmed Supervised learning Solves binary classification problems (y {0, 1}) Discriminative approach, we model p(y x) directly: p(y = 1 x; θ) = hθ (x) and p(y = 0 x; θ) = 1 h θ (x) (Bernoulli) More compactly: p(y x; θ) = (h θ (x)) y (1 h θ (x)) 1 y (3) Linear model, hence: h θ (x) = g(θ T x) y x, θ Bernoulli(h θ (x)) (4) Raphael Meier MIA / 52

17 Logistic regression Sigmoid Function Logit functon: Previously, z = θ T x. g(z) = ez 1 + e z = e z (5) Motivation: Restrict values of our hypothesis to be between zero and one (probability)

18 Logistic regression Decision Boundary Set of points x for which p(y = 1 x; θ) = p(y = 0 x; θ) = 0.5 holds. Given by the hyperplane: θ T x = 0 (6) For θ T x > 0, feature vectors are classified as 1 s. For θ T x < 0, feature vectors are classified as 0 s. Raphael Meier MIA / 52

19 Learning θ Maximum Likelihood Given a set of i.i.d. training pairs S = {( x (i), y (i)) : i = 1,..., m } θ ML = arg max θ = arg max θ L(θ) = arg max θ m p(y (i) x (i), θ) (7) i=1 m (h θ (x (i) )) y (i) (1 h θ (x (i) )) 1 y (i) (8) i=1 For simplification, we maximize log L(θ): m l(θ) = log L(θ) = y (i) log h(x (i) ) + (1 y (i) ) log(1 h(x (i) )) (9) i=1 Raphael Meier MIA / 52

20 Learning θ Maximum Likelihood II No closed-form solution to maximize log-likelihood l(θ) l(θ) l(θ) θ θ However: l(θ) is concave Global maximum Allows optimization via gradient ascent Ascent method: θ (t+1) := θ (t) + α θ l(θ (t) ) with l(θ (t+1) ) > l(θ (t) ) Derivative w.r.t. θ j : l(θ) θ j = m ( i=1 y (i) h(x (i) ) ) x (i) j Raphael Meier MIA / 52

21 Learning algorithm Gradient ascent initialization; while convergence criteria not satisfied do for j = 0 to n do θ j := θ j + α m i=1 (y (i) h θ (x (i) ))x (i) j ; end end Algorithm 1: Gradient ascent Convergence: θ l(θ) 0 Magnitude of update is proportional to error in prediction: (y (i) h θ (x (i) )) Raphael Meier MIA / 52

22 Multiple classes Logistic regression can be generalized to situations with y {1,, K} Hypothesis changes (softmax function): p(y = k x) = exp(θ T k x) K i=1 exp(θt i x) (10) Raphael Meier MIA / 52

23 Binary Image Segmentation using Logistic Regression Preprocessing Feature Extraction Logistic Regression Spatial Regularization

24 Generalization Model complexity Errors in prediction due to: Bias (Wrong assumptions in our model) Variance (Limited sample size, sensitivity of model to changes in training data) Raphael Meier MIA / 52

25 Generalization Bias-Variance trade-off Generalization error = bias + variance + irreducible error How can we minimize generalization error? First: Employ appropriate error measure Second: Vary complexity of model, choose the one with minimum error Raphael Meier MIA / 52

26 Generalization Number of samples Generalization error decreases with increasing number of training samples m Dilemma: Acquisition of training data (ground truth) is usually expensive Raphael Meier MIA / 52

27 Model evaluation Strategies Always best: Training (2/3) and Testing (1/3) set K-fold cross-validation on full data set: Popular choices for K are 5 or 10 Alternative: Leave-one-out cross-validation (LOOCV) CV often used for tuning of hyperparameters Raphael Meier MIA / 52

28 Model evaluation Real-World example: BRATS 2013 Overfitted on training data Raphael Meier MIA / 52

29 Part II Decision Forests for Image Classification

30 Linear vs. Non-linear Logistic regression: Linear Classifier Real problems are very often non-linear! Raphael Meier MIA / 52

31 Transitioning from linear to non-linear classifier x x T h(x) g( 0 x) T h(x) g( 0 x) Idea: Combination of simple classifiers to more complex ones 1 h(x) g( x) 1 e T 0 T 0 x p(y 'red' x) h(x) p(y 'blue' x) 1 h(x) T h(x) g( 1 x) T h(x) g( 2 x) x x x T h(x) g( 0 x) T h(x) g( 1 x) T h(x) g( 2 x) x Final decision boundary is non-linear! Raphael Meier MIA / 52

32 Decision tree Raphael Meier MIA / 52

33 How to decide? Weak Learner Simple model which performs only slightly better than flipping a coin Can be represented as (1 { } is indicator function): h θ (x) = 1 {g(x, θ) > τ} (11) Linear model: g(x, θ) = φ(x) T θ (homogenous coordinates) φ(x) selects a random subset of features (Randomized Node Optimization), θ defines geometric primitive Raphael Meier MIA / 52

34 Examples of weak learner Weak learner: axis aligned Weak learner: oriented line Weak learner: conic section Raphael Meier MIA / 52

35 How to predict? Leaf prediction model Feature vector is passed down a tree and will end up in a leaf Leaf stores p(y x) (class label histogram) Apply MAP-rule on p(y x) Raphael Meier MIA / 52

36 How to predict? Leaf prediction model Raphael Meier MIA / 52

37 Testing phase T x x x Final prediction given by: p (y x) = 1 T T p t (y x). (12) t=1 Raphael Meier MIA / 52

38 How to train? Information gain high information gain low information gain Raphael Meier MIA / 52

39 How to train? Information gain where Optimization of information gain: IG = H(S) i {L,R} S i S H(Si ) (13) H(S) = y Y p(y) log p(y). (14) Minimizes impurity of child-distributions θ j = arg max θ j Θ IG j. (15) Optimization procedure: Exhaustive search over Θ Raphael Meier MIA / 52

40 How does all of this make sense? Bias-Variance trade-off Decision tree is a low-bias high-variance model Two key aspects (Breiman, 2001): Randomized Node Optimization (and Bagging) de-correlates trees Averaging of tree predictions Variance of average prediction given by: ρσ ρ T σ2 (16) Hence, grow randomized trees sufficiently deep and combine them into an ensemble Raphael Meier MIA / 52

41 Forest hyperparameters Number of trees T Depth of trees D Number of candidate weak learners H Number of candidate thresholds T How to tune them? Gridsearch (cross-validation) Raphael Meier MIA / 52

42 Forest hyperparameters Raphael Meier MIA / 52

43 Forest hyperparameters Raphael Meier MIA / 52

44 Toy example Raphael Meier MIA / 52

45 (Binary) Image Segmentation using Decision Forest Preprocessing Feature Extraction Decision Forest Spatial Regularization

46 Real-world examples MICCAI 2014 Raphael Meier MIA / 52

47 Real-world examples Brain Tumor Segmentation

48 Real-world examples Feature Importance I M Voxel-wise intensity value extracted from modality M Depth Tumor vs. healthy Healthy tissues Tumor core 1 I FLAIR I T 1 I T 1c 2 I T 2 I T 1c I T 1 I T 2 3 I T 1 I FLAIR I T 1c I T 1 I T 1c 4 I T 2 I T 1 I T 2 I T 2 5 I FLAIR I T 1c I FLAIR 6 I FLAIR I T 1c I T 1c 7 I T 1c I T 1c I FLAIR 8 I T 2 I T 1c I T 1c 9 I T 1c I T 1c I FLAIR 10 I T 1c I T 1c I T 1c 11 I T 1c I T 1c I FLAIR 12 I T 1c I T 1c I T 1c 13 I T 1c I T 1c I T 1c 14 I T 1c I T 1c I T 2 15 I T 2 I T 1c I T 1c 16 I T 2 I T 1c I T 2 17 I T 2 I T 1c I T 1c 18 I T 2 I T 1c I T 1c

49 Summary Decision Forest Discriminative model Decision Forest has two main degrees of freedom: Weak learner Objective function (information gain) Training: Generation of de-correlated trees based on maximizing information gain Testing: New input is pushed down each tree, prediction is performed based on model stored in leaf Raphael Meier MIA / 52

50 A last note... Decision forests are a flexible multi-purpose framework Can solve also regression problems, density estimation and manifold learning Raphael Meier MIA / 52

51 Connection to deep learning

52 Thank you! Raphael Meier MIA / 52

Context-sensitive Classification Forests for Segmentation of Brain Tumor Tissues

Context-sensitive Classification Forests for Segmentation of Brain Tumor Tissues D. Zikic, B. Glocker, E. Konukoglu, J. Shotton, A. Criminisi, D. H. Ye, C. Demiralp 3, O. M. Thomas 4,5, T. Das 4, R. Jena