Can we interpret weight maps in terms of cognitive/clinical neuroscience? Jessica Schrouff

Size: px

Start display at page:

Download "Can we interpret weight maps in terms of cognitive/clinical neuroscience? Jessica Schrouff"

Austen Barker
5 years ago
Views:

1 Can we interpret weight maps in terms of cognitive/clinical neuroscience? Jessica Schrouff Computer Science Department, University College London, UK Laboratory of Behavioral and Cognitive Neuroscience, Stanford University, US

2 Pattern Recognition Models v 1 Linear predictive models (classifier or regression) are parameterized by a weight vector w and a bias term b. w t 1 t 2 t 4 t 3 The weight vector w can be expressed as a linear combination of training examples x i. The general equation for making predictions for a test example x * is: w = N å i=1 a i x i v 2 f ( x*) w x* b 25/06/2017 J. Schrouff - OHBM 2017 PR4NI 2

3 How is w computed? w is estimated by solving an optimization problem consisting of a data fit term E and a penalty/regularization term J. min wîr p { E(w)+ lj(w) } Regularization parameter Data fit term or loss function (E): denotes the price we pay when we make mistakes in the predictions (e.g. squared loss, Hinge loss). Regularization term (J): favours certain properties (e.g. sparsity) and improves the generalisation over unseen examples (e.g. L2-norm, L1-norm). Many machine learning algorithms are particular choices of E and J.

4 Weight map or predictive pattern w has the same dimensionality of the input data and can be plotted as an image. Might provide potential insights into brain function or structure that drives the prediction, but interpretation should be done with care!

5 BOLD signal Pattern recognition analysis labels labels labels labels labels Training Predictive function or Predictive pattern: f New subject Testing Predicted label Very different meaning! Mass-univariate analysis (e.g. General Linear Model, GLM) Time Voxel-wise GLM model estimation Independent statistical test at each Voxel Correction for Multiple comparisons Statistical Parametric Map

Using the weights for prediction Weight map (w) Predictive function f (x * ) = w x * +b 5 2-6 -1 New example (x*) 2 1 2-1 f ( x * f ( x * ) (5 2)

6 Using the weights for prediction Weight map (w) Predictive function f (x * ) = w x * +b New example (x*) f ( x * f ( x * ) (5 2) (2 1) ( 6 2) ( 1 1) 0 ) f(x * ) is the predicted score for regression or the distance to the decision boundary for classification models.

7 Thresholding the weight map Weight map (w) Predictive function f (x * ) = w x * +b New example (x*) f ( x * f ( x * ) (5 2) (0 1) ( 6 2) (0 1) 0 ) f(x * ) truncated does not correspond to f(x * )!

8 What do (voxel) weights represent? Signal: - Voxel 1: s n + d(n) - Voxel 2: d(n) Weights: - Voxel 1: w = 1 - Voxel 2: w = -1 s n d(n) c 1 c 2 Haufe et al.,

9 What can affect the magnitude of w? High absolute coefficient/weight in a voxels might be due to: an association between the voxel s signal and the labels multivariate properties (e.g. correlated signal or correlated noise) Zero coefficient/weight in a voxels might be due to: no association between the voxel s signal and the labels a particular type of regularization If the goal is to make local inference about the association between underlying brain signal in a voxel (e.g. activity) and the labels (e.g. experimental condition) further analysis are recommended (e.g. mass univariate GLM) Sato et al and Haufe et al 2014

10 How to interpret the weight vector w? Spatial representation of the predictive function. Shows the contribution of each feature/voxel to the prediction. Multivariate pattern -> All voxels with weights different from zero contribute to the final prediction (no arbitrary threshold should be applied). Mixture of signal of interest and noise

11 Potential strategies to improve interpretability of whole brain predictive patterns A priori 1. Masking 2. Searchlight mapping During model estimation 3. Feature selection 4. Sparse algorithms 5. Atlas based Multiple Kernel Learning (MKL) 6. Using weight stability in model selection A posteriori 7. Atlas based weight summarization 8. Permutation test 9. Transforming weights into activation patterns

12 1. Masking: ROI based on a priori knowledge e.g. Choosing voxels/features based on literature or on previous univariate results (independent dataset). 12

13 1. Masking: ROI based on a priori knowledge Advantages Good accuracy? Anatomically/ functionally relevant Limitations Need an anatomical hypothesis Cannot threshold obtained map Arbitrary choice of region number and size Multiple models = multiple comparisons 13

14 2. Searchlight Mapping Scans the imaged volume with a "searchlight" (e.g. sphere centered at each voxel) and performs a multivariate analysis at each location in the brain (Kriegeskorte et al, 2006) local multivariate pattern

15 2. Searchlight Mapping Advantages Mapping approach -> identifies voxels/neighborhoods that contain predictive or multivariate information Limitations Locally multivariate (does not account for the whole brain network) Arbitrary choice of neighborhood size (e.g. radios of the sphere) Multiple models -> multiple comparison issue Computationally expensive

16 3. Feature selection Selects a subset of informative features according to a specific criterion. Example of feature selection approaches: Univariate t-test Recursive Feature Elimination (RFE, De Martino et al, 2008) RONDINA et al.: SCORS A METHOD BASED ON STABILITY FOR FEATURE SELECTION AND APPING IN NEUROIMAGING 95 Stability selection (SCoRS, Rondina et al, 2014) TABLE V BRAIN REGIONS IMPORTANT TO DISCRIMINATE THE GROUPS FOUND BY SCORS Voxels in white were selected by SCoRS, RFE and t-test (Rondina et al, 2014)

17 3. Feature selection Data driven Advantages Identifies the most relevant voxels for prediction Limitations Computationally expensive (nested cross-validation needs to be implemented to avoid double dipping!)

18 4. Sparse algorithms Choose a regularization term that enforces sparsity or structured sparsity. Examples of sparse and structured sparse approaches: LASSO (Tibshirani, 1996) Elastic-net (Zou and Hastie, 2005) Total Variation (Michel et al, 2011) Graph Laplacian Elastic Net (GraphNET, Grosenick et al, 2011) Sparse Total Variation (Baldassare et al, 2011) Grosenick et al, 2011

19 How regularization affects the weight map? LASSO 86.31% Elastic Net 88.02% Total Variation (TV) 85.79% Weight maps for classifying fmri images during visualization of pleasant vs. unpleasant pictures. All models used a square loss + regularization. Laplacian (LAP) 83.71% Sparse TV 85.86% Sparse LAP 87.05% Baldassarre et al., 2017

20 4. Sparse algorithms Data-driven Advantages Selection embedded in model Limitations Potential instability of the selected features (Baldassarre et al., 2012): different regularization terms = different patterns and similar accuracies. Identifies a subset of predictive voxels

5. Atlas based Multiple Kernel Learning (MKL) Parcellation using anatomical atlas Multiple Kernel Learning 116 å f (x * ) = d m f m (x * ) m=1 Region 1 Region 116 The MKL learns and combines

21 5. Atlas based Multiple Kernel Learning (MKL) Parcellation using anatomical atlas Multiple Kernel Learning 116 å f (x * ) = d m f m (x * ) m=1 Region 1 Region 116 The MKL learns and combines different models, represented by different kernels (regions). Sparsity on the kernel combination (e.g. SimpleMKL, Rakotomamonjy, et al. 2008) -> selects a subset of regions. Schrouff, Monteiro et al.(submitted)

22 5. Atlas based Multiple Kernel Learning (MKL) PRoNTo (

23 5. Atlas based Multiple Kernel Learning (MKL) PRoNTo (

24 5. Atlas based Multiple Kernel Learning (MKL) Advantages Limitations Anatomically/ functionally driven L1 regularization -> might not select regions with correlated information Selects a subset of predictive regions Shows the contribution of each region (region weight) and each voxel (voxel weight) for the predictive function Potential instability of the selected regions Atlas dependent

25 6. Using weight stability in model selection Including the weight stability (Baldassarre et al., 2017) or interpretability (reproducibility x representativeness, Kia et al, 2017) in the optimization of hyper-parameters: w 1 w 1 w w 2 2 ~ ς: model performance to optimize δ: model generalization (e.g. accuracy) η: stability or interpretability of weight map 25

26 6. Using weight stability in model selection Advantages Selects model based on both generalization and stability Embedded in model selection Limitations Applies mostly to sparse methods Various parameters to set: how to define η, set w 1 and w 2 Computationally expensive Needs more work to define in which cases it is interesting (e.g. when no trade-off between η and δ)

Average the absolute value of the weights within each regions to create a rank

27 7. Atlas based weight summarization Summarize whole brain weights using a pre-defined atlas (Schrouff et al, 2013). Average the absolute value of the weights within each regions to create a rank according to their contribution to the decision function. NW ROI = å vîroi w v #{v Î ROI} AAL atlas (Tzourio-Mazoyer, 2002)

28 7. Atlas based weight summarization PRoNTo (

29 7. Atlas based weight summarization Advantages Anatomically/ functionally driven Ranks the regions according to their proportional contribution to the predictive function Limitations Not embedded in the model (a posteriori summarization) All regions with weight different from zero contribute to the prediction -> no arbitrary threshold should be applied Atlas dependent

8. Permutation test Use permutation to estimate the null distribution of the weights and test the null hypothesis at each voxel to threshold the weights (Mourao-Miranda et al, 2005).

30 8. Permutation test Use permutation to estimate the null distribution of the weights and test the null hypothesis at each voxel to threshold the weights (Mourao-Miranda et al, 2005). Use analytic approximation of the null distribution (Gaonkar and Davatzikos, 2013). p-maps obtained from experimental and analytical permutation tests (Gaonkar and Davatzikos, 2013)

31 8. Permutation test Data driven Advantages Limitations Needs to be corrected for multiple comparison Provides a thresholded weight map -> identifies the most relevant voxels for prediction Computationally expensive

32 9. Deriving activation patterns from weights Derive activation patterns from weights (extraction filters) by transforming backward models into forward models (Haufe et al., 2014) Backward model Forward model (discriminative model) (generative model) W T x(n) = ŝ(n) x(n) = Aŝ(n)+e(n) W : extraction filters A : activation patterns (columns of W) (columns of A) x(n) : data e(n) : noise ŝ(n) : latent factors The activation patterns show in which voxels (channels) the latent factors are reflected.

33 9. Deriving activation patterns from weights Advantages Like in the GLM activation patterns enable neurophysiological interpretation of the coefficients Limitations Depends on the assumptions used to estimate the forward and backward models. The activation patterns do not represent the predictive function (they don t show which voxels/regions are relevant for prediction)

34 Summary Interpretation of machine learning based models can be complex. There are different strategies to improve interpretation, each one has advantages and limitations: Data driven vs. anatomically/functionally driven Multivariate power (local vs. global, sensitivity vs. specificity) Stability of the patterns Computational time

35 Acknowledgements Janaina Mourão-Miranda, UCL, UK Joao M. Monteiro, UCL, UK Maria Joao Rosa, UCL, UK Josef Parvizi and team, Stanford, US PRoNTo team Marie Sklodowska-Curie Actions

36 Useful references Baldassarre, L., Pontil, M. & Mourão-Miranda, J.. «Sparsity is better with stability: combining accuracy and stability for model selection in brain decoding.» Frontiers in Neuroscience: Brain Imaging Methods. (2017) De Martino, F., Valente, G., Staeren, N., Ashburner, J., Goebel, R., & Formisano, E. «Combining multivariate voxel selection and support vector machines for mapping and classification of fmri spatial patterns.» NeuroImage. (2008) 43(1), Gaonkar, B. & Davatzikos, C. «Analytic estimation of statistical significance maps for support vector machine based multivariate image analysis and classification.» NeuroImage. (2013) 78: Grosenick, L., Klingenberg, B., Katovich, K., Knutson, B. & Taylor, J.E. «Interpretable whole-brain prediction analysis with GraphNet.» NeuroImage. (2013) 72, Haufe, S., Meinecke, F., Go rgen, K., Daḧne, S., Haynes, J. D., Blankertz, B. & Bießmann, F. «On the interpretation of weight vectors of linear models in multivariate neuroimaging.» NeuroImage. (2014) 8 7, Kia, S.M., Vega-Pons, S., Weisz, N. & Passerini, A. «Interpretability of Multivariate Brain Maps in Linear Brain Decoding: Definition, and Heuristic Quantification in Multivariate Analysis of MEG Time-Locked Effects.» Frontiers in Neuroscience. (2017) /fnins Kriegeskorte, N., Rainer, G. & Bandettini, P. «Information-based functional brain mapping.» PNAS 103 (2006):

37 Useful references Michel, V., Gramfort, A., Varoquaux, G., Eger, E. & Thirion, B. «Total variation regularization for fmri-based prediction of behavior.» IEEE Trans Med Imaging. (2011) 30, Rakotomamonjy, A., Bach, F., Canu, S. & Grandvalet, Y. «SimpleMKL» Journal of Machine Learning 9 (2008): Rondina J., Hahn T., de Oliveira L., Marquand A., Dresler T., Leitner T., Fallgatter A., Shawe-Taylor J. & Mourao-Miranda J. «SCoRS - a method based on stability for feature selection and mapping in neuroimaging.» IEEE Trans Med Imaging. (2014) Jan:33(1). Sato, J.R., Mourao-Miranda, J., Morais Martin Mda G., Amaro E. Jr., Morettin P.A. & Brammer M.J. «The impact of functional connectivity changes on support vector machines mapping of fmri data.» J Neurosci Methods. (2008) 172(1): Schrouff, J., Cremers, J., Garraux, G., Baldassarre, L., Mourão-Miranda, J. & Phillips, C. «Localizing and comparing weight maps generated from linear kernel machine learning models.» Proceedings of the 3rd workshop on Pattern Recognition in NeuroImaging Tibshirani, R. «Regression shrinkage and selection via the lasso.» Journal of the Royal Statistical Society. 58 (1996): Tzourio-Mazoyer, N., et al. «Automated Anatomical Labeling of activations in SPM using a Macroscopic Anatomical Parcellation of the MNI MRI single-subject brain.» NeuroImage 15 (2002): Zou, H., & Hastie, T. «Regularization and variable selection via the elastic net.» J. R. Statist. Soc. B 67 (2005):

Interpreting predictive models in terms of anatomically labelled regions

Interpreting predictive models in terms of anatomically labelled regions Data Feature extraction/selection Model specification Accuracy and significance Model interpretation 2 Data Feature extraction/selection