x 1 SpaceNet: Multivariate brain decoding and segmentation Elvis DOHMATOB (Joint work with: M. EICKENBERG, B. THIRION, & G. VAROQUAUX) 75 x y=20

Size: px

Start display at page:

Download "x 1 SpaceNet: Multivariate brain decoding and segmentation Elvis DOHMATOB (Joint work with: M. EICKENBERG, B. THIRION, & G. VAROQUAUX) 75 x y=20"

Nora Harrison
5 years ago
Views:

1 SpaceNet: Multivariate brain decoding and segmentation Elvis DOHMATOB (Joint work with: M. EICKENBERG, B. THIRION, & G. VAROQUAUX) L R 75 x y= x 1

2 1 Introducing the model 2

3 1 Brain decoding We are given: n = # scans; p = number of voxels in mask design matrix: X R n p (brain images) response vector: y R n (external covariates) Need to predict y on new data. Linear model assumption: y X w We seek to estimate the weights map, w that ensures best prediction / classification scores 3

4 1 The need for regularization 4 ill-posed problem: high-dimensional (n p) Typically n and p We need regularization to reduce dimensions and encode practioner s priors on the weights w

5 1 Why spatial priors? 3D spatial gradient (a linear operator) : w R p ( x w, y w, z w) R p 3 penalize image grad w regions Such priors are reasonable since brain activity is spatially correlated more stable maps and more predictive than unstructured priors (e.g SVM) [Hebiri 2011, Michel 2011, Baldassare 2012, Grosenick 2013, Gramfort 2013] 5

6 1 SpaceNet SpaceNet is a family of structure + sparsity priors for regularizing the models for brain decoding. SpaceNet generalizes TV [Michel 2001], Smooth-Lasso / GraphNet [Hebiri 2011, Grosenick 2013], and TV-L1 [Baldassare 2012, Gramfort 2013]. 6

7 2 Methods 7

8 2 The SpaceNet regularized model 8 y = X w + error Optimization problem (regularized model): minimize 1 2 y Xw penalty(w) 1 2 y Xw 2 2 is the loss term, and will be different for squared-loss, logistic loss,...

9 2 The SpaceNet regularized model 9 penalty(w) = αω ρ (w), where Ω ρ (w) := ρ w 1 + (1 ρ) 1 2 w 2, for GraphNet w TV, for TV-L1... α (0 < α < + ) is total amount regularization ρ (0 < ρ 1) is a mixing constant called the l 1 -ratio ρ = 1 for Lasso

10 2 The SpaceNet regularized model penalty(w) = αω ρ (w), where Ω ρ (w) := ρ w 1 + (1 ρ) 1 2 w 2, for GraphNet w TV, for TV-L1... α (0 < α < + ) is total amount regularization ρ (0 < ρ 1) is a mixing constant called the l 1 -ratio ρ = 1 for Lasso Problem is convex, non-smooth, and heavily-ill-conditioned. 9

11 2 Interlude: zoom on ISTA-based algorithms 10 Settings: min f + g; f smooth, g non-smooth f and g convex, f L-Lipschitz; both f and g convex ISTA: O(L f /ɛ) [Daubechies 2004] Step 1: Gradient descent on f Step 2: Proximal operator of g FISTA: O(L f / ɛ) [Beck Teboulle 2009] = ISTA with a Nesterov acceleration trick!

12 2 FISTA: Implementation for TV-L1 11 Convergence time in seconds α 1e-3 l 1 ratio e e e e e e e e [DOHMATOB 2014 (PRNI)] 1e e e e e e

13 2 FISTA: Implementation for GraphNet 12 Augment X: X := [X c α,ρ ] T R (n+3p) p Xz (t) = Xz (t) + c α,ρ (z (t) ) 1. Gradient descent step (datafit term): w (t+1) z (t) γ X T ( Xz (t) y) 2. Prox step (penalty term): w (t+1) soft αργ (w (t+1) ) 3. Nesterov acceleration: z (t+1) (1 + θ (t) )w (t+1) θ (t) w (t) Bottleneck: 80% of runtime spent doing Xz (t)! We badly need speedup!

14 2 Speedup via univariate screening 13 Whereby we detect and remove irrelevant voxels before optimization problem is even entered!

15 2 X T y maps: relevant voxels stick-out 14 L R y=20 100% brain vol

16 2 X T y maps: relevant voxels stick-out 14 L R L R y=20 100% brain vol y=20 50% brain vol

17 2 X T y maps: relevant voxels stick-out 14 L R L R L R y=20 y=20 y= % brain vol 50% brain vol 20% brain vol

18 2 X T y maps: relevant voxels stick-out 14 L R L R L R y=20 y=20 y= % brain vol 50% brain vol 20% brain vol The 20% mask has the 3 bright blobs we would expect to get... but contains much less voxels less run-time

k = 50% voxels y=20 k = 20% voxels Marginal screening [Lee 2014], but without

19 2 Our screening heuristic t p := pth percentile of the vector X T y. Discard jth voxel if Xj T y < t p L R L R L R y=20 k = 100% voxels y=20 k = 50% voxels y=20 k = 20% voxels Marginal screening [Lee 2014], but without the (invertibility) restriction k min(n, p) The regularization will do the rest... 15

20 2 Our screening heuristic 16 See [DOHMATOB 2015 (PRNI)] for a more detailed exposition of speedup heuristics developed.

21 2 Automatic model selection via cross-validation regularization parameters: 0 < α L <... < α 3 < α 2 < α 1 = α max mixing constants: 0 < ρ M <... < ρ 3 < ρ 2 < ρ 1 1 Thus L M grid to search over for best parameters (α 1, ρ 1 ) (α 1, ρ 2 ) (α 1, ρ 3 )... (α 1, ρ M ) (α 2, ρ 1 ) (α 2, ρ 2 ) (α 2, ρ 3 )... (α 2, ρ M ) (α 3, ρ 1 ) (α 3, ρ 2 ) (α 3, ρ 3 )... (α 3, ρ M ) (α L, ρ 1 ) (α L, ρ 2 ) (α L, ρ L )... (α L, ρ M ) 17

22 2 Automatic model selection via cross-validation 18 The final model uses average of the the per-fold best weights maps (bagging) This bagging strategy ensures more stable and robust weights maps

23 3 Some experimental results 19

24 3 Weights: SpaceNet versus SVM Faces vs objects classification on [Haxby 2001] Smooth-Lasso weights TV-L1 weights SVM weights 20

25 3 Classification scores: SpaceNet versus SVM 21

26 3 Concluding remarks 22 SpaceNet enforces both sparsity and structure, leading to better prediction / classification scores and more interpretable brain maps. The code runs in 15 minutes for simple datasets, and 30 minutes for very difficult datasets.

27 3 Concluding remarks 22 SpaceNet enforces both sparsity and structure, leading to better prediction / classification scores and more interpretable brain maps. The code runs in 15 minutes for simple datasets, and 30 minutes for very difficult datasets. In the next release, SpaceNet will feature as part of Nilearn [Abraham et al. 2014]

28 3 Why X T y maps give a good relevance measure? In an orthogonal design, least-squares solution is ŵ LS = (X T X) 1 X T y = (I) 1 X T y = X T y (intuition) X T y bears some info on optimal solution even for general X 23

29 3 Why X T y maps give a good relevance measure? In an orthogonal design, least-squares solution is ŵ LS = (X T X) 1 X T y = (I) 1 X T y = X T y (intuition) X T y bears some info on optimal solution even for general X Marginal screening: Set S = indices of top k voxels j in terms of X T j y values In [Lee 2014], k min(n, p), so that ŵ LS (X T S X S ) 1 X T S y We don t require invertibility condition k min(n, p). Our spatial regularization will do the rest! 23

30 3 Why X T y maps give a good relevance measure? In an orthogonal design, least-squares solution is ŵ LS = (X T X) 1 X T y = (I) 1 X T y = X T y (intuition) X T y bears some info on optimal solution even for general X Marginal screening: Set S = indices of top k voxels j in terms of X T j y values In [Lee 2014], k min(n, p), so that ŵ LS (X T S X S ) 1 X T S y We don t require invertibility condition k min(n, p). Our spatial regularization will do the rest! Lots of screening rules out there: [El Ghaoui 2010, Liu 2014, Wang 2015, Tibshirani 2010, Fercoq 2015] 23

Speeding-up model-selection in GraphNet via early-stopping and univariate feature-screening

Speeding-up model-selection in GraphNet via early-stopping and univariate feature-screening Elvis Dohmatob, Michael Eickenberg, Bertrand Thirion, Gaël Varoquaux To cite this version: Elvis Dohmatob, Michael