Understanding multivariate pattern analysis for neuroimaging applications

Size: px

Start display at page:

Download "Understanding multivariate pattern analysis for neuroimaging applications"

Jasper Cannon
6 years ago
Views:

Bioengineering, Ecole Polytechnique Fédérale de Lausanne Department of

1 Understanding multivariate pattern analysis for neuroimaging applications Dimitri Van De Ville Medical Image Processing Lab, Institute of Bioengineering, Ecole Polytechnique Fédérale de Lausanne Department of Radiology & Medical Informatics, Université de Genève April 23, 2015

2 Overview Brain decoding: motivation and applications Limitations of regular SPM (GLM) Hall of fame of brain decoding Pattern recognition principles for brain decoding Basics Linear classifier Discriminative versus generative model Cross-validation Training phase Evaluation phase Testing phase Support vector machine (SVM) Primal and dual form Kernel trick 2

3 whole-brain scan 2x2x2mm 3, slices every 2-4 sec GLM Stats time intracranial voxels timepoints t-value 3

4 Limitations of the GLM approach Massively univariate approach Hypothesis testing is applied voxel-wise All voxels are treated independently, but are obviously not independent Textbook multivariate approaches are not practical E.g., estimating spatial covariance matrix ( x ) by timepoints will be rank-deficient fmri data are complicated Data-driven: let the data speak Group statistics are great but the ultimate goal is often to know how well the results generalizes (prediction) condition 1 condition 2 voxel 1 voxel 2 Learn from training data, apply to the (unseen) test data voxel 1 voxel 2 4

5 Forward versus backwards inference stimuli conditions disease disorder... encoding generative model data phrenology stimuli conditions disease disorder... decoding discriminative model data blind reading 5

6 Exploiting subtle spatial patterns example in in 2D 9D feature space [Cox et al., 2003; Haynes and Rees, 2006]

7 Emotions in voice-sensitive cortices How is emotional information treated in aud. cortex? 22 healthy right-handed subjects 10 actors pronounced ne kalibam sout molem (anger, sadness, neutrality, relief, joy) normalization for mean sound intensity; rated by 24 (other) subjects Classifier approach on aud. cortex localizer: human voices > animal sounds 20 stimuli for each category (5); 5 classifiers (one-versus-others) p<0.05 (FWE) [Ethofer, VDV, Scherer, Vuilleumier, Current Biology, 2009; collaboration with LABNIC/CISA, UniGE] 7

8 Anger Sadness Neutral Relief Joy z=0 z=+3 z=+6 Ne kalibam sout molem [Ethofer, VDV, Scherrer, Vuilleumier, Current Biology, 2009] 8

Decoding from the emotional brain Results for bilateral voice-sensitive cortices solid: smoothed (10mm FWHM) dashed: non-smoothed dotted: spatial average Chance level: 20% Classifier: linear SVM

9 Decoding from the emotional brain Results for bilateral voice-sensitive cortices solid: smoothed (10mm FWHM) dashed: non-smoothed dotted: spatial average Chance level: 20% Classifier: linear SVM Evidence that auditory cues for emotional processing are spatially encoded; bilaterial representation Comprehension of emotional prosody is crucial for social functioning psychiatric disorders (schizophrenia, bipolar affective disorder,...) [Ethofer, VDV, Scherer, Vuilleumier, Current Biology, 2009] 9

10 Brain decoding of visual processing distributed patterns subtle patterns in V1 advanced models for receptive fields [Haxby et al., 2001] [Kamitani and Tong, 2005, 2006] [Haynes, Rees., 2006] [Miyawaki et al., 2008] 10

11 Decoding video sequences reverse statistical inference ( decode stimulus from data) [Nishimoto et al, 2011] 11

12 Basics of pattern recognition D features feature vector xi model function machine classifier... label yi (e.g., -1/1, control/patient) Machine learning aims at getting the model right based on a series of examples (supervised learning) learning/training phase: from examples {(x1,y1),...,(xn,yn)} learn model f such that f(xi)=yi testing phase: for unseen data xs, predict label f(xs) 12

13 Basics of pattern recognition N instances D features model feature matrix X labels y Classification as a regression problem Linear model: y = w T X + b + noise where the vector (w,b) characterizes the full model Learning the model by least-squares fitting: (w,b) = arg min (w,b) y w T X b 2 Underdetermined (D >N): (w,b) = arg min (w,b) y w T X b 2 + kwk 2 solution: w = pinv(x T )y T 13

14 Basics of pattern recognition N instances D features model feature matrix X labels y Evaluate classifier on unseen data Avoid overfitting: increasingly complex models will succeed better to fit (any) training data! Classify new data x 0 by comparing w T x 0 + b against 0? 14

representation of decision function multivariate pattern: no local inference! new example x 1 = 0.

15 Interpretation of the weight vector examples of class 1 weight vector w Voxel 1 Voxel 2 Voxel 1 Voxel 2 examples of class 2 Training w 1 = +5 w 2 = - 3 Voxel 1 Voxel 2 Voxel 1 Voxel 2 spatial representation of decision function multivariate pattern: no local inference! new example x 1 = 0.5 x 2 = 0.8 Tes6ng predicted label: (w 1.x 1 +w 2.x 2 )+b = ( )+0 = 0.1 posi?ve value - > class 1 15

16 Basics of pattern recognition separating hyperplane Linear model projects instance on line parallel to w Decision boundary is a hyperplane in the feature space If feature dimensions correspond to voxels, then w corresponds to a map Bias b allows to shift decision boundary along w [adapted from Jaakkola] 16

17 Basics of pattern recognition D features N instances p(x y) model P (y x) feature matrix X labels y Classification and decision theory Assume class-conditional densities p(x y) and class probabilities P (y) known Minimizing the overall probability of error corresponds to posterior y 0 = arg max P (y x 0 ) y = arg max y = arg max y p(x 0 y)p (y) p(x 0 ) p(x 0 y)p (y) likelihood (Bayes rule) 17

18 Basics of pattern recognition Wonderful link between discriminative model (posterior) and generative model (likelihood) posterior P (y x) / p(x y)p (y) likelihood Discriminative model; e.g., linear classifier, SVM Generative model; e.g., naïve Bayes p(x y) = DY p(x i y) i=1 where the features are modeled as independent random variables 18

19 Basics of pattern recognition D features N instances model truth estimated class 1 class 2 class 1 A B class 2 C D feature matrix X Tune labels y Full Dataset Split EV TR TE Error Train Evaluate Test OK? Final Performance Measure TR EV TE Leave-one-out cross-validation Repeatedly train on all instances except one, evaluate on left-out instance Accuracy statistics: class 1 = A A+B ; class 2= D C+D ; average = A+D A+B+C+D Sensitivity = D C+D = TP FN+TP ; specificity = A A+B = TN FP+TN 19

20 Testing classifiers: the ROC curve Receiver Operating Characteristic (ROC) curve Allow to see compromise over the operating range of a classifier Can be used for classifier comparison e.g., compute the Area Under Curve (AUC) as a summary measure of performance sensitivity classifier 1 AUC 0.79 classifier 2 AUC specificity 20

21 Learning in high dimensions Dimensionality of feature vector (e.g., #voxels) often exceeds number of instances model is underdetermined and requires additional assumptions (e.g., regularization) Other solutions feature selection & extraction gray-matter mask; region-of-interest (ROI); atlas regions only most-active voxels (inside training phase!) searchlight approach [Kriegeskorte et al., 2006] ROI is sliding around brain multivariate capabilities (long-range) are limited dimensionality reduction PCA, ICA,... kernel methods 21

22 Kernel matrix Kernel function K similarity measure (scalar) between two feature vectors simplest example: dot product (linear kernel) N brain scans (feature vectors) instances dot product: (4.-2) +(1.3) = instances 22

23 Support vector machine Maximum-margin classifier Assume data are linearly separable: 9w 2 R D, 9b 2 R : y i (w T x i + b) > 0,i=1,...,N w and b can be rescaled such that closest points to hyperplane satisfy w T x i + b =1 Distance x of a point x to hyperplane is HARD MARGIN w T x + b = 1 w T x + b = 0 w T x + b = -1 x = wt x i + b kwk Margin of hyperplane = 1/ kwk [Vapnik, 1963, 1992] 23

24 Support vector machine (primal form) Quadratic programming optimization problem arg min (w,b) 1 2 wt w subject to y i (w T x i + b) 1,i=1,...,N Constrained problem can be formulated as an unconstrained one using Lagrange multipliers: arg min max (w,b) >0 NX i y i (w T x i + b) kwk2 i=1 {z } L Points for which y i (w T x i + b) > 1 do not matter and should have i =0 Points on the margin have y i (w T x i + b) =1and are called support vectors; these have i > 0 i = y i (w T x i + b) 1=0! b = w T x i y i convex criterion so no local minima 24

25 Support vector machine (primal form) Lagrangian functional arg min max (w,b) >0 Solving L for i, b, w leads to NX i y i (w T x i + b) kwk2 i=1 {z i = y i (w T x i + =! b = w T x i y i (for support vector) NX y i i =0 = w X i=1! w = i y i x i =0 NX i y i x i i=1 25

Support vector machine (dual form) Lagrangian functional can be rewritten as L = 1 2 NX i=1 NX i y i j=1 x T i x j y j j {z } K(x i,x j ) N X i=1 i, s.t. NX y i i =0 i=1 Minimization of this dual form requires to find i space instead of D dimensional one!

26 Support vector machine (dual form) Lagrangian functional can be rewritten as L = 1 2 NX i=1 NX i y i j=1 x T i x j y j j {z } K(x i,x j ) N X i=1 i, s.t. NX y i i =0 i=1 Minimization of this dual form requires to find i space instead of D dimensional one! 0 in an N dimensional Weight vector w can be found as a linear combination of the instances w = XN SV i 0 =1 i 0y i 0x i 0 Efficient: sparse solution where only the N SV support vectors (with i 0 > Limited 0) contribute! number of support vectors; rest of data can be ignored Prediction of unseen data point x according to f(x) =sign XN SV i 0 =1 y i 0 i 0x T i 0 x + b! 26

27 Support vector machine Soft-margin classifier Introduce slack variables i to relax separation: i > 0! x i has margin less than 1 New optimization problem (primal) arg min (w,b), i wt w + C NX i=1 i subject to y i (w T x i + b) 1 i 2 SOFT MARGIN w T x + b = 1 1 w T x + b = 0 w T x + b = -1 Which leads to the dual form: arg min subject to NX i=1 NX i y i x T i x j y j j j=1 X N i i=1 NX y i i =0and 0 apple i apple C i=1 box constraint : allows for capacity control C = 0 : very large safety margin C = Inf: close to hard-margin constraint [Shawe-Taylor and Cristianini, 2004] 27

28 Kernel trick Linear classification (by hyperplane) can be limiting Project into (even) higher-dimensional feature space V where linear separation corresponds to non-linear separation in original space 28

29 Kernel trick Linear classification (by hyperplane) can be limiting Project into (even) higher-dimensional feature space V where linear separation corresponds to non-linear separation in original space However, this new feature space V is implicit because in practice we use generalized kernel functions: K(x 1, x 2 )=h'(x 1 ),'(x 2 )i V where h, i V should be a proper inner product, that is, K(, ) should satisfy Mercer s condition: NX NX K(x i, x j )c i c j 0 i=1 j=1 for all finite sequences of points (x 1,...,x N ) and real-valued coefficients (c 1,...,c N ). This also means that the kernel matrix [K(x i, x j )] i,j should be positive semi-definite (i.e., all eigenvalues 0) 29

30 Kernel trick Commonly used kernel functions K(x 1, x 2 )=h'(x 1 ),'(x 2 )i: Polynomial (homogeneous): K(x 1, x 2 )=(x T 1 x 2 ) p Polynomial (inhomogeneous): K(x 1, x 2 )=(x T 1 x 2 + 1) p Gaussian radial basis kernel: K(x 1, x 2 )=exp( kx 1 x 2 k 2 ), >0 Example: polynomial kernel K(x 1, x 2 )=(x T 1 x 2 ) 2 For D =2, developing the kernel function leads to (binomial theorem) K(x 1, x 2 ) = (x 1,1 x 2,1 + x 1,2 x 2,2 ) 2 = (x 1,1 x 2,1 ) 2 +2x 1,1 x 2,1 x 1,2 x 2,2 +(x 1,2 x 2,2 ) 2 D p p E = [x 2 1,1 2x1,1 x 1,2 x 2 1,2], [x 2 2,1 2x2,1 x 2 2,2] From which we can identify ' :(x 1,x 2 ) 7! (x 2 1, p 2x 1 x 2,x 2 2) For general D, this kernel maps to D+1 theoream) 2 = D(D+1) 2 dimensions (multinomial 30

31 Kernel trick The transform ' is implicit, but the weight vector is defined as w = X i i y i '(x i ) Dot products with w can be computed using the kernel trick: w T '(x) = X i i y i K(x i, x) However, in general there exists no w 0 such that w T '(x) =K(w 0, x) which means there is no equivalent weight map 31

32 Linear and non-linear kernels Neuroimaging applications often work best with linear kernels Very high-dimensional data with small sample sizes (D >> N); increased complexity of non-linear kernels (more parameters) cannot be turned into an advantage Linear models have less risk of overfitting Weight vector of linear model has direct interpretation as an image; i.e., discrimination map 32

33 What s the best classifier? Model complexity (no free lunch) high reproducibility low predictability low reproducibility high predictability Ensemble learning combine many (weak) classifiers to get better results bagging: classifiers on random training subset, equal vote boosting: adding classifiers that deal with previously misclassified data [Hasbie et al., 2009] 33

34 Recommended reading Pereira et al., Machine learning classifiers and fmri: A tutorial overview, NeuroImage 45, 2009 Richiardi et al., Machine learning with brain graphs, IEEE Signal Processing Magazine 30, 2013 Haufe et al., On the interpretation of weight vectors of linear models in multivariate neuroimaging, NeuroImage, in press 34

35 Classifier selection Is classifier A better than classifier B? Hypothesis testing framework Do the two sets of decisions of classifiers A and B represent two different populations? test sample A B Careful: Samples (sets of decisions) are dependent (same evaluation/testing dataset), and measurement type is categorical binary (for a two-class problem) Use McNemar test H0: there is no difference between the accuracies of the two classifiers 35

36 Selection with the McNemar test Compute relationships between classifier decisions B B A N11 N10 A N01 N00 If H0 is true, we should have N01 = N10 = 0.5(N01+N10) We can measure the following test-static based on the observed counts: Then, compare x 2 to critical value of χ 2 (1 DOF) at a level of significance α (requires N01+N10>25) x 2 = ( N 01 N 10 1) 2 N 01 + N 10 After [Dietterich, 1998] 36

Understanding multivariate pattern analysis for neuroimaging applications

Understanding multivariate pattern analysis for neuroimaging applications Maria Giulia Preti Dimitri Van De Ville Medical Image Processing Lab, Institute of Bioengineering, Ecole Polytechnique Fédérale