Applied Statistics and Machine Learning

Size: px

Start display at page:

Download "Applied Statistics and Machine Learning"

Baldric Warner
6 years ago
Views:

1 Applied Statistics and Machine Learning Logistic Regression, and Generalized Linear models Model Selection, Lasso, and Structured Sparsity Bin Yu, IMA, June 19, 2013

2 Classification is supervised learning Y s are 0 s and 1 s: challenger data, MISR data Task: relate predictors x s with Y (1) prediction (IT sector, banking, etc) (2) interpretation: what are the important predictors and they are suggestive for interventions for causal inference later. 2

3 Challenger data These data are from Table 1 of the article "Risk Analysis of the Space Shuttle: Pre-Challenger Predication of Failure" by Dalal, Fowlkes and Hoadley, Journal of the American Statistical Association, Vol. 84, No. 408 (Dec. 1989), pp I got them from Professor Stacey Shancock at Clark Univ. s website She has this Tukey quote at her site: "Far better an approximate answer to the right question, than the exact answer to the wrong question, which can always be made precise." - John Tukey 3

4 Challenger data Temp temperature at launch Failure number of O-rings that failed Failure1 indicator of O-ring failure or not 4

5 Challenger data (jittered) 5

6 Suppose you are the first person who ever thought about classification problem What would you do? A method to fit the data to come up with a prediction rule for the next launch at temp* Postulate a statistical model (e.g. normal reg model) for uncertainty statement 6

7 Suppose you are the first person who ever thought about classification problem Fitting methods: nearest neighbor (NN), LS, logistic regression How to fit? what is the criterion to fit? MAXIMUM LIKELILHOOD In the derivations to come, we use notations and some materials from Dobson (2001). 7

8 Logistic regression model What does this model mean for the Challenger data? What assumptions might be violated? Which are reasonable? 8

9 Logistic regression model 9

10 Logistic regression model 10

11 Logistic regression model Note that a linear approximation to U is equivalent to a quadratic approximation to (β) 11

12 Logistic regression model 12

13 Logistic regression model 13

14 Logistic regression model vs LS for Challenger data 14

15 Logistic regression vs LS for Challenger data 15

16 Generalized Linear Models (GLMs) GLMs is a statistical framework that unifies normal regression models (for continuous data), logistic (profit) regression models (for binary data), and log linear models (for count data). Logistic (probit) regression models can be also used for multi-class classification. Original paper on GLMs: Nelder & Wedderburn (1972) GLMs, Journal of Royal Statistical Society- Series A (JRSS-A). Books: A. J. Dobson (2001): An introduction to GLMs (2nd ed) P. McCullagh & J. A. Nelder (1999) Generalized linear models (2 nd ed) 1 16

17 Exponential families: 17

18 A sketch of proof: HW: derive the formula for V(a(Y)). 18

19 GLMs: link function is the key to relate response variable to beta in a linear way 19

20 Likelihood function of a GML 20

21 Maximum Likelihood Estimation (MLE) 21

22 Parametric bootstrap for GLMs Fit a GLM with MLE and then take n samples from the fitted distribution. Note that this is not the same as the parametric bootstrap described for regression model where we sample from the residuals, not from the fitted normal distribution to the residuals. Parametric bootstrap works for nice parametric families, typically when asymptotic normality holds. For example, bootstrap does not work for Unif(0, a) with a as the parameter. Non-parametric bootstrap sample directly from observed data. 22

23 How to compute MLE in GLMs: IRWLS 23

24 IRWLS algorithm for MLE in GLMs 24

25 Statistical interpretation of the IRWLS algorithm Is the solution to a weighted LS problem with weight vector 25

26 Statistical interpretation of IRWLS 26

27 IRWLS in words IRWLS is an iterative algorithm. At each iteration, IRWLS solves a WLS problem that is the loglikelihood function of a heteroscedastic linear regression model in the g-domain (where g(mu_i) is approximately linear in \beta), for which the variances of g(y_i) are known from the previous iteration. 27

28 Back to Movie-fMRI Data (Nishimoto et al, 11) 7200s training (1 replicate) and 5400s test (10 replicates). 28

29 How Stimuli Evoke Brain Signals? } Quantitative models - both stimulus and response high-dimensional Encoding model Decoding model Natural Input (Image or video/movie) fmri of Brain Dayan and Abbott (2005) 29

30 Movie reconstruction results for 3 subjects 20 June 2013 June 20, 2013 page 30 page 30

31 Mind-Reading Computers in Media one of the 50 best inventions of 2011 by Time Magazine Others: Economist, NPR, 31

32 What model is behind the movie reconstruction algorithm? Is the model interpretable and reliable? 32

33 Domain knowledge: key for big data discovery Hubel and Wiesel (1959) discovered, in neuron cells of the primary cortex area V1, orientation and location selectivity, and excitatory and inhibitory regions. 33

34 Modern Description of Hubel-Wiesel work: Early Visual Area V1 } Preprocessing an image: } Gabor filters corresponding to particular spatial frequencies, locations, orientations (Hubel and Wiesel, 1959, ) Sparse representation after Gabor Filters, static or dynamic 34

35 2D Gabor Features June 20, 2013 page 35

36 3D Gabor Features } Data split to 1 second movie clips } 3D Gabor Filters applied to get features of a movie clip in 26K dim 36

37 Regularization: key for big data discovery Two regularization methods are behind the movie reconstruction algorithm, after tons of work for feature construction based on domain knowledge by humans: } Encoding through L1-penalized Least Squares (LS) or Lasso: Separate sparse linear model fitted to features for each voxel via Lasso (Chen and Dohono, 94; Tibshirani, 96) (cf. e-l2boosting) } Decoding through L2- or Tikhonov regularization of sample cov. matrix of residuals across voxels. (cf. Kernel Machines) June 20, 2013 page 37

38 Given a voxel, n=7k, p = 26K image 3D wavelet features (each of which has a location) For each frame, response is the fmri signal at the voxel. An underdetermined problem since p is much larger than n. June 20, 2013 page 38

39 Movie-fMRI: linear encoding model for a voxel For each voxel and the ith movie clip, we postulate a linear encoding (regression) model: Y i = β 1 x i1 + β 2 x i β p x ip + i = X T i β + i where X i =(x i1,x i2,...,x ip ) T is the feature vector of movie clip β =(β 1, β 2,...,β p ) T weight vector that combines feature strengths into mean fmri response is the disturbance or noise term i Y i is the fmri response June 20, 2013 page 39

40 Movie-fMRI: finding the weight vector Least Squares: find β 1, β 2,...,β p to minimize n (Y i β 1 x i1 β 2 x i2... β p x ip ) 2 := Y Xβ 2 i=1 Since p = 26,000 >> n=7,200, this LS problem has many solutions. June 20, 2013 page 40

41 Why doesn t LS work when p>>n? Reason: colinearity of the columns of X -- it also happens in low-dim Least Squares Function Surfaces as a function of (β 1, β 2 ) 41

42 How to fix this problem? In general, impossible. However, in our case, for each voxel, Hubel and Wiesel s work suggests that only a small number of the predictors among the 26,000 of them be active sparsity! This prior information motivates a sparsity-enforced revision to LS: Lasso = Least Absolute Selection and Shrinkage Operator 42

43 Modeling history at Gallant Lab } Prediction on validation set is the benchmark } Methods tried: neural nets, SVMs, Lasso, } Among models with similar predictions, simpler (sparser) models by Lasso are preferred for interpretation This practice reflects a general trend in statistical machine learning -- moving from prediction to simpler/sparser models for interpretation, faster computation or data transmission. June 20, 2013 page 43

44 Occam s Razor 14th-century English logician and Franciscan friar, William of Ockham" Principle of Parsimony: Entities must not be multiplied beyond necessity. Wikipedia June 20, 2013 page 44

45 Occam s Razor via Model Selection in Linear Regression Maximum likelihood (ML) is LS with Gaussian assumption There are submodels ML goes for the largest submodel with all predictors Largest model often gives bad prediction for p large June 20, 2013 page 45

Model Selection Criteria Akaike (73,74) and Mallows (1973) Cp used estimated prediction errors to choose a model (assuming is known): σ 2 Schwartz (1980) used asymptotic approximations to negative

46 Model Selection Criteria Akaike (73,74) and Mallows (1973) Cp used estimated prediction errors to choose a model (assuming is known): σ 2 Schwartz (1980) used asymptotic approximations to negative log posterior probabilities to choose a model (assuming is known) σ 2 Both are penalized LS by. Rissanen s Minimum Description Length (MDL) principle gives rise to many different different criteria. The two-part code leads to BIC. (see e.g. Rissanen (1978) and review article by Hansen and Yu (2000)) June 20, 2013 page 46

47 More details on AIC June 20, 2013 page 47

48 More details on AIC June 20, 2013 page 48

49 More details on AIC PE = expected Prediction Error June 20, 2013 page 49

50 More details on AIC Assume Then PE Hence when p increases, the prediction error increases because a more complex model is being estimated with an associated larger variance. How to use RSS to estimate PE? June 20, 2013 page 50

51 More details on AIC June 20, 2013 page 51

52 More details on BIC June 20, 2013 page 52

53 Model Selection for movie-fmri problem For the linear encoding model, the number of submodels Combinatorial search: too expensive and often not necessary A recent alternative: continuous embedding into a convex optimization problem through L1 penalized LS (Lasso). June 20, 2013 page 53

Lasso: L 1 -norm as a penalty to L2 loss } The L 1 penalty is defined for coefficients } Used initially with L 2 loss: Signal processing: Basis Pursuit (Chen & Donoho,1994) Statistics:

54 Lasso: L 1 -norm as a penalty to L2 loss } The L 1 penalty is defined for coefficients } Used initially with L 2 loss: Signal processing: Basis Pursuit (Chen & Donoho,1994) Statistics: Non-Negative Garrote (Breiman, 1995) Statistics: LASSO (Tibshirani, 1996) ˆβ(λ) =argmin β { Y Xβ λ β 1 } Smoothing parameter often selected by Cross-Validation (CV) June 20, 2013 page 54

55 Lasso eases the instability problem of LS Lasso Function Surfaces as a function of (β 1, β 2 ) 55

56 Recall: why doesn t LS work when p>>n? Reason: colinearity of the columns of X -- it also happens in low-dim Least Squares Function Surfaces as a function of (β 1, β 2 ) 56

57 Lasso: computation Initially: quadratic program (QP) for each a grid on λ. QP is called for each λ. Later: path following algorithms such as homotopy by Osborne et al (2000) LARS by Efron et al (2004) Current: first-order or gradient-based algorithms for large p (see Mairal s lecture) June 20, 2013 page 57

58 Recent theory on Lasso (more in my last lecture) Under sparse high-dim linear regression model and appropriate conditions: Lasso is model selection consistent (Irrepresentable condition) Lasso has optimal L2 estimation rate (restricted eigen value condition) Selective references: Freund and Schapire (1996), Chen and Donoho (1994), Tibshirani (1996), Friedman (2001), Efron et al (2004), Zhao and Yu (2006), Meinshausen and Buhlmann (2006), Wainwright (2009), Candes and Tao (2005), Meinshausen and Yu (2009), Huang and Zhang (2009), Bickel, Ritove and Tsybakov (2011), Raskutii, Wainwright, and Yu (2010), Neghaban et al (2012) June 20, 2013 page 58

59 Encoding: energy-motion model necessary 59

60 Encoding: sparsity necessary Sparse regression improves prediction over OLS Linear Regression Sparse Regression 60 Prediction accuracy (Full-brain data)

61 Knowledge discovery: interpreting encoding models Spatial locations of selected features are suggestive of driving factors for brain activities at a voxel. Lasso+CV CV=Cross- Validation Voxel A Voxel B Voxel C Prediction scores on Voxels A-C are 0.72 (CV) June 20, 2013 page 61

62 ES-CV: Statistical Stability (ES) (Lim & Yu, 2013) Given a smoothing parameter λ, divide the data units into M blocks. Get Lasso estimate form an estimate ˆβ m (λ) for data with m-th block deleted, and X ˆβ m (λ) for the mean regression function. ˆβ(λ) = 1 ˆβ m (λ) M Define the estimation stability (ES) measure as m ES(λ) = 1 M m X ˆβ m (λ) X ˆβ(λ) 2 X ˆβ(λ) 2 which is the reciprocal of a test statistic for testing H 0 : Xβ =0 62

63 ES-CV (or SSCV): Estimation Stability (ES)+CV (continued) ES-CV selection criterion for smoothing parameter λ : Choose the λ that minimizes ES(λ) and is not smaller that the CV selection. Related works: Shao (95), Breiman (96), Bach (08), Meinshausen and Buhlmann (2008), 63

64 Back to fmri prblem: Spatial Locations of Selected Features Voxel A Voxel B Voxel C CV ES-CV Prediction on Voxels A-C: CV 0.72, ES-CV 0.7 June 20, 2013 page 64

65 ESCV: Sparsity Gain (60%) with No Prediction Loss (-1.3%) Prediction (correlation) SSCV CV % Change Model size SSCV June 20, 2013 CV % Change Based on validation data for 2088 voxels page 65

66 ES-CV: Desirable Properties CV (cross-validation) is widely used in practice. ES-CV is an effective improvement over CV on stability and hence interpretability and reliability of results. Computational cost similar to CV Easily parallelizable as CV Empirically sensible or nonparametric Other forms of perturbation include: sub-sampling, bootstrap, variable permutation,... 66

67 Structured sparsity: Composite Absolute Penalties (CAP) (Zhao, Rocha and Yu, 09) Motivations: } } } side information available on predictors and/or sparsity at group level extra regularization need (p>>n) than what Lasso provides } } Examples of groups: } } } Genes belonging to the same pathway; Categorical variables represented by dummies ; Noisy measurements of the same variable. Examples of hierarchy: } } } Multi-resolution/wavelet models; Interactions terms in factorial analysis (ANOVA); Order selection in Markov Chain models. Existing works can be seen as special cases of CAP: Elastic Net (Zou & Hastie, 05), GLASSO (Yuan & Lin, 06), Blockwise Sparse Regression (Kim, Kim & Kim, 2006) June 20, 2013 page 67

68 Norm L γ Bridge Parameter γ 1 June 20, 2013 page 68

69 Composite Absolute Penalties (CAP) } The CAP parameter estimate is given by: } } G k 's, k=1,,k - indices of k-th pre-defined group β Gk corresponding vector of coefficients. }. γ k group Lγ k norm: N k = βγ k γ k ; }. γ 0 overall norm: T(β) = N γ 0 } groups may overlap (hierarchical selection) June 20, 2013 page 69

70 CAP Group selection } Tailoring T(β) for group selection: } Define non-overlapping groups } Set γ k >1 } Group norm γ k tunes similarity within its group } γ k >1 encourages all variables in group i to be included/ excluded together } Set γ 0 =1: } This yields grouped sparsity } γ i =2 has been studied by Yuan and Lin (Grouped Lasso, 2005). June 20, 2013 page 70

71 CAP Hierarchical Structures } Tailoring T(β) for Hierarchical Structure: } Set γ 0 =1 } } Set γ i >1, i Groups overlap: } If β 2 appears in all groups where β 1 is included Then X 2 is encouraged to enter the model after X 1 } As an example: June 20, 2013 page 71

72 CAP: a Bayesian interpretation } For non-overlapping groups: } Prior on group norms: } Prior on individual coefficients: June 20, 2013 page 72

Seeking Interpretable Models for High Dimensional Data

Seeking Interpretable Models for High Dimensional Data Bin Yu Statistics Department, EECS Department University of California, Berkeley http://www.stat.berkeley.edu/~binyu Characteristics of Modern Data