MultiVariate Bayesian (MVB) decoding of brain images

MultiVariate Bayesian (MVB) decoding of brain images Alexa Morcom Edinburgh SPM course 2015 With thanks to J. Daunizeau, K. Brodersen for slides

stimulus behaviour encoding of sensorial or cognitive state? experimental variables (design matrix) encoding neuronal measures (fmri data) X v decoding Y n, n v What if neuronal responses are distributed (over space)? Haxby et al., 2001

1 Introduction Overview of the talk 1.1 Lexicon 1.2 Decoding : so what? 1.3 Multivariate: so what? 1.4 Preliminary statistical considerations 2 Multivariate Bayesian decoding 2.1 From classical encoding to Bayesian decoding 2.2 Hierarchical priors on patterns 2.3 Probabilistic inference 3 Example 4 Summary

Lexicon the jargon to swallow 1 Encoding or decoding? An encoding model (or generative model) relates context (independent variable) to brain activity (dependent variable). X Y A decoding model (or recognition model) relates brain activity (independent variable) to context (dependent variable). Y X 2 Univariate or multivariate? In a univariate model, brain activity is the signal measured in one voxel. Y In a multivariate model, brain activity is the signal measured in many voxels (NB: decoding ill-posed problem). Y n, n v 3 Regression or classification? In a regression model, the dependent variable is continuous. In a classification model, the dependent variable is categorical (typically binary). X or X { 1, 1} n Y

Decoding : so what? The seminal approach: classification fmri time series 1 feature extraction e.g., voxels voxels trials classification accuracy e.g., leave-one-out 2 train: learn mapping Y X test: cross-validate Y train Y test X train A A B A B A A B A A A B A A A X test?

Decoding : so what? Reversing the X-Y mapping: target questions (1) X-Y mapping overall reliability (2) X-Y mapping spatial deployment Accuracy [%] 80% 100 % 50 % Left or right button? Truth or lie? Healthy or diseased? Classification task 55% (3) X-Y mapping temporal evolution (4) X-Y mapping: subtle issues Accuracy [%] 100 % Participant indicates decision functionally selective vs segregated representations degenerate (many-to-one) structure-function mappings 50 % Accuracy rises above chance Intra-trial time

Multivariate: so what? Well, we might need it. Multivariate approaches can reveal information jointly encoded by several voxels. This is because multivariate distance measures can take into account correlations among voxels

Multivariate: so what? Why we might need it: subvoxel processing. Multivariate approaches can exploit a sampling bias in voxelized images. Such subvoxel processing is unlikely to be detected by univariate methods. Boynton 2005 Nature Neuroscience

Preliminary statistical considerations lessons from the Neyman-Pearson lemma Do neuronal responses encode some sensorial or cognitive state of the subject? Null assumption: there is no dependency between Y and X So what? Well 0 : H p Y X p Y Neyman-Pearson lemma: the likelihood ratio (or Bayes factor) p Y X p X Y u p Y p X is the most powerful test of size a to test the null choose threshold u such that 1 2 All we have to do is compare a model that links Y to X with a model that does not. The link can be from X to Y or from Y to X. From the point of view of inferring a link exists, its direction is not important (but ).

Preliminary statistical considerations prediction and inference Some confusion about the roles of prediction and inference may arise from the use of classification accuracy to infer a significant relationship between X and Y. This is because «cross-validation» relies on the predictive density: Note: new new,, new new,, p X Y X Y p X Y p X Y d where are unknown parameters of the mapping Y X to check the «generalization error» of the inferred mapping. 1 2 The only situation that legitimately requires us to predict a new target is when we do not know it, e.g.: - brain-computer interface - automated diagnostic classification When used in the context of experimental neuroscience, standard classifiers provide suboptimal inference on the mapping Y X

Preliminary statistical considerations prediction and inference 1 2 * * Although MVB alone does not provide this

From classical encoding to Bayesian decoding MVB: inferring on the multivariate X-Y mapping Multivariate analyses in SPM are not implemented in terms of the classification schemes outlined in the previous section. Instead, SPM brings decoding into the conventional inference framework of hierarchical models and their inversion (c.f. Neyman-Pearson lemma). MVB can be used to address two questions: Overall significance of the X-Y mapping (as with classical SPM or classifiers) using probabilistic inference (model comparison, cross-validation) Inference on the form of the X-Y mapping (no other alternative), e.g. 1 1. Identify the spatial structure of the X-Y mapping (smooth, sparse, etc ) 2 2. Tell whether the X-Y mapping is degenerate (many-to-one regions-to-function). 3. Disambiguate between category-specific representations that are functionally selective (with overlap) and functionally segregated (without). 3

From classical encoding to Bayesian decoding reversing the standard GLM Encoding models X as a cause X: scalar psychological target variable Decoding models X as a consequence X Y: noisy measurements of A A: underlying neuronal activity in n voxels T: BOLD convolution of A/X A X A A X G: confounds (null space of c) scaled by unknown Y TA G Y TA G Y g ( ) : X Y TX G TX g ( ) :Y X X A Y G

Hierarchical priors on patterns spatial deployment of the X-Y mapping Decoding models are typically ill-posed: there is an infinite number of equally likely solutions. We therefore require constraints or priors to estimate the voxel weights. MVB specifies several alternative coding hypotheses in terms of empirical spatial priors on voxel weights. project onto spatial basis function set: U patterns cov( ) Ucov( ) U T null: sparse vectors: smooth vectors: compact vectors: support vectors: U U I U( x, x i UDV T j U RY ) T exp( RY T 1 2 xi x 2 2 ( j) ) (compact: previously singular, now SVD of the smooth vectors)

Hierarchical priors on patterns Expectation-Maximization and the greedy search In pattern space: A nested set of pattern weights with initially equal variance Select partitions with largest weights until no increase in free energy F Simplified EM algorithm (see Friston et al., 2008)

Probabilistic inference classical inference with cross-validation p-values from a standard leave-one-out scheme can t be used for inference (train and test data are not independent) Recall compact form for the decoding model: WX RY W RT R I GG target variable weighting matrix: temporal convolution + confounds removal residual forming matrix: confounds removal Use train/test k-fold data features that are linearly independent: ˆ ( k ) Y( k ) train (identify mapping) test (measure generalization error) Y R Y ( k) ( k) R( k ) I G( k ) G( k ) ( k ) ( k ) R I G G ( k ) ( k ) ( k ) G G I ( k ) Xˆ WX? Xˆ ( k ) ˆ ( k ) R( k ) Y ( k ) G( k ) G I I

Example finger tapping dataset SPM{T} : left > right fixation cross + pace (right) >> record response SPM{T} : right > left 400 events (100 left, 100 right, 100 left & right, 100 null) average ITI = 2 sec block design (10 trials/block) TR = 1.3 sec

Example MVB in SPM: decoding within a search volume target: TX Xc design matrix contrast confounds: G X I cc Gc 0 (confounds = null space of the contrast)

Example predicted responses from left & right motor cortices MVB-based predictions closely match the observed responses. But crucially, they don t perfectly match them. Perfect match would indicate overfitting. target and (MVB) predicted responses over scans target versus (MVB) predicted responses

Example pattern sparsity The highest model evidence is achieved by a model that recruits 7 partitions. The weights attributed to each voxel in the sphere are sparse and bidirectional. This suggests sparse coding (in pattern space) log-evidence of partitions (relative to null) distribution of estimated weights

Example model comparison illustration The best model corresponds to a sparse representation of motion ; as one would expect from functional segregation. Better evidence for spatially sparse than smooth (clustered) coding log-evidence of X-Y bilateral mappings: effect of spatial deployment Sparse spatial model also better than compact (SVD on smooth model) or support models see also Morcom & Friston (2012), Neuroimage

Example model comparison illustration Model comparison between regions (given optimal sparse model of spatial deployment) log-evidence of X-Y sparse mappings: effect of lateralization The right-lateralized model is better than the left-lateralized model The bilateral model is better than the right-lateralized model Consistent with non-redundant (joint) coding of finger tapping

Example cross-validation : k-fold scheme k = 8 p-value < 0.0001 classification accuracy = 65.8% R-squared = 20.7% absolute correlation among k-fold feature weights test predictions versus test k-fold features maximum intensity projection: k P 0 Y( k )

Example episodic memory, MTL Distinct spatial patterns for overlapping memories only in hippocampus CA3, consistent w/ pattern separation Overlapping episodic memories Same events, different contexts (scenes) Use MVB to fit sparse patterns, then measure their correlation over voxels in MTL sub-regions A form of representational similarity analysis Chadwick, Bonnici, Maguire (2014, PNAS)

Summary 11. Inference on the form of the X-Y mapping rests on model comparison, using the marginal likelihood of competing models. The marginal likelihood derives from the specification of a generative model prescribing the form of the joint density over observations (X,Y) and model parameters (). 22. Multivariate models can map from experimental variables (X) to brain responses (Y) or from Y to X. In the latter case (i.e., decoding), identifying the mapping is an ill-posed problem, which is resolved with appropriate constraints or priors on model parameters. These constraints are part of the model and can be evaluated using model comparison. 33. Cross-validation is not necessary for decoding brain activity but generalization error is a proxy for testing whether the observed X-Y mapping is unlikely to have occured by chance. This can be useful when the null distribution of the likelihood ratio (i.e. Bayes factor) is not evaluated easily.

References Chadwick, M. J., Bonnici, H. M. and Maguire, E. a. (2014) CA3 size predicts the precision of memory recall, Proceedings of the National Academy of Sciences, 111(29), pp. 10720 5. doi: 10.1073/pnas.1319641111. Friston, K., Chu, C., Mouruo-Miranda, J., Hulme, O., Rees, G., Penny, W. and Ashburner, J. (2008) Bayesian decoding of brain images, NeuroImage, 39(1), pp. 181 205. doi: 10.1016/j.neuroimage.2007.08.013. Friston, K., Mattout, J., Trujillo-Barreto, N., Ashburner, J. and Penny, W. (2007) Variational free energy and the Laplace approximation, NeuroImage, 34(1), pp. 220 234. doi: 10.1016/j.neuroimage.2006.08.035. Morcom, A. M. and Friston, K. J. (2012) Decoding episodic memory in ageing: A Bayesian analysis of activity patterns predicting memory, NeuroImage, 59(2). doi: 10.1016/j.neuroimage.2011.08.071.