Applied Statistics and Machine Learning

Size: px
Start display at page:

Download "Applied Statistics and Machine Learning"

Transcription

1 Applied Statistics and Machine Learning Logistic Regression, and Generalized Linear models Model Selection, Lasso, and Structured Sparsity Bin Yu, IMA, June 19, 2013

2 Classification is supervised learning Y s are 0 s and 1 s: challenger data, MISR data Task: relate predictors x s with Y (1) prediction (IT sector, banking, etc) (2) interpretation: what are the important predictors and they are suggestive for interventions for causal inference later. 2

3 Challenger data These data are from Table 1 of the article "Risk Analysis of the Space Shuttle: Pre-Challenger Predication of Failure" by Dalal, Fowlkes and Hoadley, Journal of the American Statistical Association, Vol. 84, No. 408 (Dec. 1989), pp I got them from Professor Stacey Shancock at Clark Univ. s website She has this Tukey quote at her site: "Far better an approximate answer to the right question, than the exact answer to the wrong question, which can always be made precise." - John Tukey 3

4 Challenger data Temp temperature at launch Failure number of O-rings that failed Failure1 indicator of O-ring failure or not 4

5 Challenger data (jittered) 5

6 Suppose you are the first person who ever thought about classification problem What would you do? A method to fit the data to come up with a prediction rule for the next launch at temp* Postulate a statistical model (e.g. normal reg model) for uncertainty statement 6

7 Suppose you are the first person who ever thought about classification problem Fitting methods: nearest neighbor (NN), LS, logistic regression How to fit? what is the criterion to fit? MAXIMUM LIKELILHOOD In the derivations to come, we use notations and some materials from Dobson (2001). 7

8 Logistic regression model What does this model mean for the Challenger data? What assumptions might be violated? Which are reasonable? 8

9 Logistic regression model 9

10 Logistic regression model 10

11 Logistic regression model Note that a linear approximation to U is equivalent to a quadratic approximation to (β) 11

12 Logistic regression model 12

13 Logistic regression model 13

14 Logistic regression model vs LS for Challenger data 14

15 Logistic regression vs LS for Challenger data 15

16 Generalized Linear Models (GLMs) GLMs is a statistical framework that unifies normal regression models (for continuous data), logistic (profit) regression models (for binary data), and log linear models (for count data). Logistic (probit) regression models can be also used for multi-class classification. Original paper on GLMs: Nelder & Wedderburn (1972) GLMs, Journal of Royal Statistical Society- Series A (JRSS-A). Books: A. J. Dobson (2001): An introduction to GLMs (2nd ed) P. McCullagh & J. A. Nelder (1999) Generalized linear models (2 nd ed) 1 16

17 Exponential families: 17

18 A sketch of proof: HW: derive the formula for V(a(Y)). 18

19 GLMs: link function is the key to relate response variable to beta in a linear way 19

20 Likelihood function of a GML 20

21 Maximum Likelihood Estimation (MLE) 21

22 Parametric bootstrap for GLMs Fit a GLM with MLE and then take n samples from the fitted distribution. Note that this is not the same as the parametric bootstrap described for regression model where we sample from the residuals, not from the fitted normal distribution to the residuals. Parametric bootstrap works for nice parametric families, typically when asymptotic normality holds. For example, bootstrap does not work for Unif(0, a) with a as the parameter. Non-parametric bootstrap sample directly from observed data. 22

23 How to compute MLE in GLMs: IRWLS 23

24 IRWLS algorithm for MLE in GLMs 24

25 Statistical interpretation of the IRWLS algorithm Is the solution to a weighted LS problem with weight vector 25

26 Statistical interpretation of IRWLS 26

27 IRWLS in words IRWLS is an iterative algorithm. At each iteration, IRWLS solves a WLS problem that is the loglikelihood function of a heteroscedastic linear regression model in the g-domain (where g(mu_i) is approximately linear in \beta), for which the variances of g(y_i) are known from the previous iteration. 27

28 Back to Movie-fMRI Data (Nishimoto et al, 11) 7200s training (1 replicate) and 5400s test (10 replicates). 28

29 How Stimuli Evoke Brain Signals? } Quantitative models - both stimulus and response high-dimensional Encoding model Decoding model Natural Input (Image or video/movie) fmri of Brain Dayan and Abbott (2005) 29

30 Movie reconstruction results for 3 subjects 20 June 2013 June 20, 2013 page 30 page 30

31 Mind-Reading Computers in Media one of the 50 best inventions of 2011 by Time Magazine Others: Economist, NPR, 31

32 What model is behind the movie reconstruction algorithm? Is the model interpretable and reliable? 32

33 Domain knowledge: key for big data discovery Hubel and Wiesel (1959) discovered, in neuron cells of the primary cortex area V1, orientation and location selectivity, and excitatory and inhibitory regions. 33

34 Modern Description of Hubel-Wiesel work: Early Visual Area V1 } Preprocessing an image: } Gabor filters corresponding to particular spatial frequencies, locations, orientations (Hubel and Wiesel, 1959, ) Sparse representation after Gabor Filters, static or dynamic 34

35 2D Gabor Features June 20, 2013 page 35

36 3D Gabor Features } Data split to 1 second movie clips } 3D Gabor Filters applied to get features of a movie clip in 26K dim 36

37 Regularization: key for big data discovery Two regularization methods are behind the movie reconstruction algorithm, after tons of work for feature construction based on domain knowledge by humans: } Encoding through L1-penalized Least Squares (LS) or Lasso: Separate sparse linear model fitted to features for each voxel via Lasso (Chen and Dohono, 94; Tibshirani, 96) (cf. e-l2boosting) } Decoding through L2- or Tikhonov regularization of sample cov. matrix of residuals across voxels. (cf. Kernel Machines) June 20, 2013 page 37

38 Given a voxel, n=7k, p = 26K image 3D wavelet features (each of which has a location) For each frame, response is the fmri signal at the voxel. An underdetermined problem since p is much larger than n. June 20, 2013 page 38

39 Movie-fMRI: linear encoding model for a voxel For each voxel and the ith movie clip, we postulate a linear encoding (regression) model: Y i = β 1 x i1 + β 2 x i β p x ip + i = X T i β + i where X i =(x i1,x i2,...,x ip ) T is the feature vector of movie clip β =(β 1, β 2,...,β p ) T weight vector that combines feature strengths into mean fmri response is the disturbance or noise term i Y i is the fmri response June 20, 2013 page 39

40 Movie-fMRI: finding the weight vector Least Squares: find β 1, β 2,...,β p to minimize n (Y i β 1 x i1 β 2 x i2... β p x ip ) 2 := Y Xβ 2 i=1 Since p = 26,000 >> n=7,200, this LS problem has many solutions. June 20, 2013 page 40

41 Why doesn t LS work when p>>n? Reason: colinearity of the columns of X -- it also happens in low-dim Least Squares Function Surfaces as a function of (β 1, β 2 ) 41

42 How to fix this problem? In general, impossible. However, in our case, for each voxel, Hubel and Wiesel s work suggests that only a small number of the predictors among the 26,000 of them be active sparsity! This prior information motivates a sparsity-enforced revision to LS: Lasso = Least Absolute Selection and Shrinkage Operator 42

43 Modeling history at Gallant Lab } Prediction on validation set is the benchmark } Methods tried: neural nets, SVMs, Lasso, } Among models with similar predictions, simpler (sparser) models by Lasso are preferred for interpretation This practice reflects a general trend in statistical machine learning -- moving from prediction to simpler/sparser models for interpretation, faster computation or data transmission. June 20, 2013 page 43

44 Occam s Razor 14th-century English logician and Franciscan friar, William of Ockham" Principle of Parsimony: Entities must not be multiplied beyond necessity. Wikipedia June 20, 2013 page 44

45 Occam s Razor via Model Selection in Linear Regression Maximum likelihood (ML) is LS with Gaussian assumption There are submodels ML goes for the largest submodel with all predictors Largest model often gives bad prediction for p large June 20, 2013 page 45

46 Model Selection Criteria Akaike (73,74) and Mallows (1973) Cp used estimated prediction errors to choose a model (assuming is known): σ 2 Schwartz (1980) used asymptotic approximations to negative log posterior probabilities to choose a model (assuming is known) σ 2 Both are penalized LS by. Rissanen s Minimum Description Length (MDL) principle gives rise to many different different criteria. The two-part code leads to BIC. (see e.g. Rissanen (1978) and review article by Hansen and Yu (2000)) June 20, 2013 page 46

47 More details on AIC June 20, 2013 page 47

48 More details on AIC June 20, 2013 page 48

49 More details on AIC PE = expected Prediction Error June 20, 2013 page 49

50 More details on AIC Assume Then PE Hence when p increases, the prediction error increases because a more complex model is being estimated with an associated larger variance. How to use RSS to estimate PE? June 20, 2013 page 50

51 More details on AIC June 20, 2013 page 51

52 More details on BIC June 20, 2013 page 52

53 Model Selection for movie-fmri problem For the linear encoding model, the number of submodels Combinatorial search: too expensive and often not necessary A recent alternative: continuous embedding into a convex optimization problem through L1 penalized LS (Lasso). June 20, 2013 page 53

54 Lasso: L 1 -norm as a penalty to L2 loss } The L 1 penalty is defined for coefficients } Used initially with L 2 loss: Signal processing: Basis Pursuit (Chen & Donoho,1994) Statistics: Non-Negative Garrote (Breiman, 1995) Statistics: LASSO (Tibshirani, 1996) ˆβ(λ) =argmin β { Y Xβ λ β 1 } Smoothing parameter often selected by Cross-Validation (CV) June 20, 2013 page 54

55 Lasso eases the instability problem of LS Lasso Function Surfaces as a function of (β 1, β 2 ) 55

56 Recall: why doesn t LS work when p>>n? Reason: colinearity of the columns of X -- it also happens in low-dim Least Squares Function Surfaces as a function of (β 1, β 2 ) 56

57 Lasso: computation Initially: quadratic program (QP) for each a grid on λ. QP is called for each λ. Later: path following algorithms such as homotopy by Osborne et al (2000) LARS by Efron et al (2004) Current: first-order or gradient-based algorithms for large p (see Mairal s lecture) June 20, 2013 page 57

58 Recent theory on Lasso (more in my last lecture) Under sparse high-dim linear regression model and appropriate conditions: Lasso is model selection consistent (Irrepresentable condition) Lasso has optimal L2 estimation rate (restricted eigen value condition) Selective references: Freund and Schapire (1996), Chen and Donoho (1994), Tibshirani (1996), Friedman (2001), Efron et al (2004), Zhao and Yu (2006), Meinshausen and Buhlmann (2006), Wainwright (2009), Candes and Tao (2005), Meinshausen and Yu (2009), Huang and Zhang (2009), Bickel, Ritove and Tsybakov (2011), Raskutii, Wainwright, and Yu (2010), Neghaban et al (2012) June 20, 2013 page 58

59 Encoding: energy-motion model necessary 59

60 Encoding: sparsity necessary Sparse regression improves prediction over OLS Linear Regression Sparse Regression 60 Prediction accuracy (Full-brain data)

61 Knowledge discovery: interpreting encoding models Spatial locations of selected features are suggestive of driving factors for brain activities at a voxel. Lasso+CV CV=Cross- Validation Voxel A Voxel B Voxel C Prediction scores on Voxels A-C are 0.72 (CV) June 20, 2013 page 61

62 ES-CV: Statistical Stability (ES) (Lim & Yu, 2013) Given a smoothing parameter λ, divide the data units into M blocks. Get Lasso estimate form an estimate ˆβ m (λ) for data with m-th block deleted, and X ˆβ m (λ) for the mean regression function. ˆβ(λ) = 1 ˆβ m (λ) M Define the estimation stability (ES) measure as m ES(λ) = 1 M m X ˆβ m (λ) X ˆβ(λ) 2 X ˆβ(λ) 2 which is the reciprocal of a test statistic for testing H 0 : Xβ =0 62

63 ES-CV (or SSCV): Estimation Stability (ES)+CV (continued) ES-CV selection criterion for smoothing parameter λ : Choose the λ that minimizes ES(λ) and is not smaller that the CV selection. Related works: Shao (95), Breiman (96), Bach (08), Meinshausen and Buhlmann (2008), 63

64 Back to fmri prblem: Spatial Locations of Selected Features Voxel A Voxel B Voxel C CV ES-CV Prediction on Voxels A-C: CV 0.72, ES-CV 0.7 June 20, 2013 page 64

65 ESCV: Sparsity Gain (60%) with No Prediction Loss (-1.3%) Prediction (correlation) SSCV CV % Change Model size SSCV June 20, 2013 CV % Change Based on validation data for 2088 voxels page 65

66 ES-CV: Desirable Properties CV (cross-validation) is widely used in practice. ES-CV is an effective improvement over CV on stability and hence interpretability and reliability of results. Computational cost similar to CV Easily parallelizable as CV Empirically sensible or nonparametric Other forms of perturbation include: sub-sampling, bootstrap, variable permutation,... 66

67 Structured sparsity: Composite Absolute Penalties (CAP) (Zhao, Rocha and Yu, 09) Motivations: } } } side information available on predictors and/or sparsity at group level extra regularization need (p>>n) than what Lasso provides } } Examples of groups: } } } Genes belonging to the same pathway; Categorical variables represented by dummies ; Noisy measurements of the same variable. Examples of hierarchy: } } } Multi-resolution/wavelet models; Interactions terms in factorial analysis (ANOVA); Order selection in Markov Chain models. Existing works can be seen as special cases of CAP: Elastic Net (Zou & Hastie, 05), GLASSO (Yuan & Lin, 06), Blockwise Sparse Regression (Kim, Kim & Kim, 2006) June 20, 2013 page 67

68 Norm L γ Bridge Parameter γ 1 June 20, 2013 page 68

69 Composite Absolute Penalties (CAP) } The CAP parameter estimate is given by: } } G k 's, k=1,,k - indices of k-th pre-defined group β Gk corresponding vector of coefficients. }. γ k group Lγ k norm: N k = βγ k γ k ; }. γ 0 overall norm: T(β) = N γ 0 } groups may overlap (hierarchical selection) June 20, 2013 page 69

70 CAP Group selection } Tailoring T(β) for group selection: } Define non-overlapping groups } Set γ k >1 } Group norm γ k tunes similarity within its group } γ k >1 encourages all variables in group i to be included/ excluded together } Set γ 0 =1: } This yields grouped sparsity } γ i =2 has been studied by Yuan and Lin (Grouped Lasso, 2005). June 20, 2013 page 70

71 CAP Hierarchical Structures } Tailoring T(β) for Hierarchical Structure: } Set γ 0 =1 } } Set γ i >1, i Groups overlap: } If β 2 appears in all groups where β 1 is included Then X 2 is encouraged to enter the model after X 1 } As an example: June 20, 2013 page 71

72 CAP: a Bayesian interpretation } For non-overlapping groups: } Prior on group norms: } Prior on individual coefficients: June 20, 2013 page 72

Seeking Interpretable Models for High Dimensional Data

Seeking Interpretable Models for High Dimensional Data Seeking Interpretable Models for High Dimensional Data Bin Yu Statistics Department, EECS Department University of California, Berkeley http://www.stat.berkeley.edu/~binyu Characteristics of Modern Data

More information

Stability. Bin Yu Statistics and EECS, University of California-Berkeley. SAMSI Opening Workshop on Massive Data, Sept, 2012

Stability. Bin Yu Statistics and EECS, University of California-Berkeley. SAMSI Opening Workshop on Massive Data, Sept, 2012 Stability Bin Yu Statistics and EECS, University of California-Berkeley SAMSI Opening Workshop on Massive Data, Sept, 2012 1 In honor of John W. Tukey June 16, 1915 July 26, 2000 2 1962: Future of Data

More information

Gradient LASSO algoithm

Gradient LASSO algoithm Gradient LASSO algoithm Yongdai Kim Seoul National University, Korea jointly with Yuwon Kim University of Minnesota, USA and Jinseog Kim Statistical Research Center for Complex Systems, Korea Contents

More information

Comparison of Optimization Methods for L1-regularized Logistic Regression

Comparison of Optimization Methods for L1-regularized Logistic Regression Comparison of Optimization Methods for L1-regularized Logistic Regression Aleksandar Jovanovich Department of Computer Science and Information Systems Youngstown State University Youngstown, OH 44555 aleksjovanovich@gmail.com

More information

Nonparametric Methods Recap

Nonparametric Methods Recap Nonparametric Methods Recap Aarti Singh Machine Learning 10-701/15-781 Oct 4, 2010 Nonparametric Methods Kernel Density estimate (also Histogram) Weighted frequency Classification - K-NN Classifier Majority

More information

Variable Selection 6.783, Biomedical Decision Support

Variable Selection 6.783, Biomedical Decision Support 6.783, Biomedical Decision Support (lrosasco@mit.edu) Department of Brain and Cognitive Science- MIT November 2, 2009 About this class Why selecting variables Approaches to variable selection Sparsity-based

More information

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Topics in Machine Learning-EE 5359 Model Assessment and Selection Topics in Machine Learning-EE 5359 Model Assessment and Selection Ioannis D. Schizas Electrical Engineering Department University of Texas at Arlington 1 Training and Generalization Training stage: Utilizing

More information

10601 Machine Learning. Model and feature selection

10601 Machine Learning. Model and feature selection 10601 Machine Learning Model and feature selection Model selection issues We have seen some of this before Selecting features (or basis functions) Logistic regression SVMs Selecting parameter value Prior

More information

Sparsity and image processing

Sparsity and image processing Sparsity and image processing Aurélie Boisbunon INRIA-SAM, AYIN March 6, Why sparsity? Main advantages Dimensionality reduction Fast computation Better interpretability Image processing pattern recognition

More information

Machine Learning / Jan 27, 2010

Machine Learning / Jan 27, 2010 Revisiting Logistic Regression & Naïve Bayes Aarti Singh Machine Learning 10-701/15-781 Jan 27, 2010 Generative and Discriminative Classifiers Training classifiers involves learning a mapping f: X -> Y,

More information

What is machine learning?

What is machine learning? Machine learning, pattern recognition and statistical data modelling Lecture 12. The last lecture Coryn Bailer-Jones 1 What is machine learning? Data description and interpretation finding simpler relationship

More information

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu FMA901F: Machine Learning Lecture 3: Linear Models for Regression Cristian Sminchisescu Machine Learning: Frequentist vs. Bayesian In the frequentist setting, we seek a fixed parameter (vector), with value(s)

More information

Package EBglmnet. January 30, 2016

Package EBglmnet. January 30, 2016 Type Package Package EBglmnet January 30, 2016 Title Empirical Bayesian Lasso and Elastic Net Methods for Generalized Linear Models Version 4.1 Date 2016-01-15 Author Anhui Huang, Dianting Liu Maintainer

More information

Lecture 13: Model selection and regularization

Lecture 13: Model selection and regularization Lecture 13: Model selection and regularization Reading: Sections 6.1-6.2.1 STATS 202: Data mining and analysis October 23, 2017 1 / 17 What do we know so far In linear regression, adding predictors always

More information

Model selection and validation 1: Cross-validation

Model selection and validation 1: Cross-validation Model selection and validation 1: Cross-validation Ryan Tibshirani Data Mining: 36-462/36-662 March 26 2013 Optional reading: ISL 2.2, 5.1, ESL 7.4, 7.10 1 Reminder: modern regression techniques Over the

More information

Applying Supervised Learning

Applying Supervised Learning Applying Supervised Learning When to Consider Supervised Learning A supervised learning algorithm takes a known set of input data (the training set) and known responses to the data (output), and trains

More information

Nonparametric sparse hierarchical models describe V1 fmri responses to natural images

Nonparametric sparse hierarchical models describe V1 fmri responses to natural images Nonparametric sparse hierarchical models describe V1 fmri responses to natural images Pradeep Ravikumar, Vincent Q. Vu and Bin Yu Department of Statistics University of California, Berkeley Berkeley, CA

More information

Using Machine Learning to Optimize Storage Systems

Using Machine Learning to Optimize Storage Systems Using Machine Learning to Optimize Storage Systems Dr. Kiran Gunnam 1 Outline 1. Overview 2. Building Flash Models using Logistic Regression. 3. Storage Object classification 4. Storage Allocation recommendation

More information

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA Predictive Analytics: Demystifying Current and Emerging Methodologies Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA May 18, 2017 About the Presenters Tom Kolde, FCAS, MAAA Consulting Actuary Chicago,

More information

Modeling Visual Cortex V4 in Naturalistic Conditions with Invari. Representations

Modeling Visual Cortex V4 in Naturalistic Conditions with Invari. Representations Modeling Visual Cortex V4 in Naturalistic Conditions with Invariant and Sparse Image Representations Bin Yu Departments of Statistics and EECS University of California at Berkeley Rutgers University, May

More information

Reconstructing visual experiences from brain activity evoked by natural movies

Reconstructing visual experiences from brain activity evoked by natural movies Reconstructing visual experiences from brain activity evoked by natural movies Shinji Nishimoto, An T. Vu, Thomas Naselaris, Yuval Benjamini, Bin Yu, and Jack L. Gallant, Current Biology, 2011 -Yi Gao,

More information

Penalizied Logistic Regression for Classification

Penalizied Logistic Regression for Classification Penalizied Logistic Regression for Classification Gennady G. Pekhimenko Department of Computer Science University of Toronto Toronto, ON M5S3L1 pgen@cs.toronto.edu Abstract Investigation for using different

More information

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100. Linear Model Selection and Regularization especially usefull in high dimensions p>>100. 1 Why Linear Model Regularization? Linear models are simple, BUT consider p>>n, we have more features than data records

More information

Random Forest A. Fornaser

Random Forest A. Fornaser Random Forest A. Fornaser alberto.fornaser@unitn.it Sources Lecture 15: decision trees, information theory and random forests, Dr. Richard E. Turner Trees and Random Forests, Adele Cutler, Utah State University

More information

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017

CPSC 340: Machine Learning and Data Mining. Feature Selection Fall 2017 CPSC 340: Machine Learning and Data Mining Feature Selection Fall 2017 Assignment 2: Admin 1 late day to hand in tonight, 2 for Wednesday, answers posted Thursday. Extra office hours Thursday at 4pm (ICICS

More information

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation Xingguo Li Tuo Zhao Xiaoming Yuan Han Liu Abstract This paper describes an R package named flare, which implements

More information

Machine Learning. Topic 4: Linear Regression Models

Machine Learning. Topic 4: Linear Regression Models Machine Learning Topic 4: Linear Regression Models (contains ideas and a few images from wikipedia and books by Alpaydin, Duda/Hart/ Stork, and Bishop. Updated Fall 205) Regression Learning Task There

More information

Mapping of Hierarchical Activation in the Visual Cortex Suman Chakravartula, Denise Jones, Guillaume Leseur CS229 Final Project Report. Autumn 2008.

Mapping of Hierarchical Activation in the Visual Cortex Suman Chakravartula, Denise Jones, Guillaume Leseur CS229 Final Project Report. Autumn 2008. Mapping of Hierarchical Activation in the Visual Cortex Suman Chakravartula, Denise Jones, Guillaume Leseur CS229 Final Project Report. Autumn 2008. Introduction There is much that is unknown regarding

More information

Simple Model Selection Cross Validation Regularization Neural Networks

Simple Model Selection Cross Validation Regularization Neural Networks Neural Nets: Many possible refs e.g., Mitchell Chapter 4 Simple Model Selection Cross Validation Regularization Neural Networks Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February

More information

Sparsity Based Regularization

Sparsity Based Regularization 9.520: Statistical Learning Theory and Applications March 8th, 200 Sparsity Based Regularization Lecturer: Lorenzo Rosasco Scribe: Ioannis Gkioulekas Introduction In previous lectures, we saw how regularization

More information

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K. GAMs semi-parametric GLMs Simon Wood Mathematical Sciences, University of Bath, U.K. Generalized linear models, GLM 1. A GLM models a univariate response, y i as g{e(y i )} = X i β where y i Exponential

More information

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018 SUPERVISED LEARNING METHODS Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018 2 CHOICE OF ML You cannot know which algorithm will work

More information

The fastclime Package for Linear Programming and Large-Scale Precision Matrix Estimation in R

The fastclime Package for Linear Programming and Large-Scale Precision Matrix Estimation in R Journal of Machine Learning Research (2013) Submitted ; Published The fastclime Package for Linear Programming and Large-Scale Precision Matrix Estimation in R Haotian Pang Han Liu Robert Vanderbei Princeton

More information

Fast or furious? - User analysis of SF Express Inc

Fast or furious? - User analysis of SF Express Inc CS 229 PROJECT, DEC. 2017 1 Fast or furious? - User analysis of SF Express Inc Gege Wen@gegewen, Yiyuan Zhang@yiyuan12, Kezhen Zhao@zkz I. MOTIVATION The motivation of this project is to predict the likelihood

More information

Lasso. November 14, 2017

Lasso. November 14, 2017 Lasso November 14, 2017 Contents 1 Case Study: Least Absolute Shrinkage and Selection Operator (LASSO) 1 1.1 The Lasso Estimator.................................... 1 1.2 Computation of the Lasso Solution............................

More information

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models DB Tsai Steven Hillion Outline Introduction Linear / Nonlinear Classification Feature Engineering - Polynomial Expansion Big-data

More information

Recommender Systems New Approaches with Netflix Dataset

Recommender Systems New Approaches with Netflix Dataset Recommender Systems New Approaches with Netflix Dataset Robert Bell Yehuda Koren AT&T Labs ICDM 2007 Presented by Matt Rodriguez Outline Overview of Recommender System Approaches which are Content based

More information

Multiresponse Sparse Regression with Application to Multidimensional Scaling

Multiresponse Sparse Regression with Application to Multidimensional Scaling Multiresponse Sparse Regression with Application to Multidimensional Scaling Timo Similä and Jarkko Tikka Helsinki University of Technology, Laboratory of Computer and Information Science P.O. Box 54,

More information

Network Traffic Measurements and Analysis

Network Traffic Measurements and Analysis DEIB - Politecnico di Milano Fall, 2017 Sources Hastie, Tibshirani, Friedman: The Elements of Statistical Learning James, Witten, Hastie, Tibshirani: An Introduction to Statistical Learning Andrew Ng:

More information

Cross-validation and the Bootstrap

Cross-validation and the Bootstrap Cross-validation and the Bootstrap In the section we discuss two resampling methods: cross-validation and the bootstrap. These methods refit a model of interest to samples formed from the training set,

More information

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989] Boosting Simple Model Selection Cross Validation Regularization Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University October 3 rd, 2007 1 Boosting [Schapire, 1989] Idea: given a weak

More information

Lasso Regression: Regularization for feature selection

Lasso Regression: Regularization for feature selection Lasso Regression: Regularization for feature selection CSE 416: Machine Learning Emily Fox University of Washington April 12, 2018 Symptom of overfitting 2 Often, overfitting associated with very large

More information

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009 Learning and Inferring Depth from Monocular Images Jiyan Pan April 1, 2009 Traditional ways of inferring depth Binocular disparity Structure from motion Defocus Given a single monocular image, how to infer

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Feature Selection Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. Admin Assignment 3: Due Friday Midterm: Feb 14 in class

More information

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer Model Assessment and Selection Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Model Training data Testing data Model Testing error rate Training error

More information

MULTIVARIATE ANALYSES WITH fmri DATA

MULTIVARIATE ANALYSES WITH fmri DATA MULTIVARIATE ANALYSES WITH fmri DATA Sudhir Shankar Raman Translational Neuromodeling Unit (TNU) Institute for Biomedical Engineering University of Zurich & ETH Zurich Motivation Modelling Concepts Learning

More information

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation

An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation An R Package flare for High Dimensional Linear Regression and Precision Matrix Estimation Xingguo Li Tuo Zhao Xiaoming Yuan Han Liu Abstract This paper describes an R package named flare, which implements

More information

Topics in Machine Learning

Topics in Machine Learning Topics in Machine Learning Gilad Lerman School of Mathematics University of Minnesota Text/slides stolen from G. James, D. Witten, T. Hastie, R. Tibshirani and A. Ng Machine Learning - Motivation Arthur

More information

Lecture 9: Support Vector Machines

Lecture 9: Support Vector Machines Lecture 9: Support Vector Machines William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 8 What we ll learn in this lecture Support Vector Machines (SVMs) a highly robust and

More information

Assessing the Quality of the Natural Cubic Spline Approximation

Assessing the Quality of the Natural Cubic Spline Approximation Assessing the Quality of the Natural Cubic Spline Approximation AHMET SEZER ANADOLU UNIVERSITY Department of Statisticss Yunus Emre Kampusu Eskisehir TURKEY ahsst12@yahoo.com Abstract: In large samples,

More information

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13 CSE 634 - Data Mining Concepts and Techniques STATISTICAL METHODS Professor- Anita Wasilewska (REGRESSION) Team 13 Contents Linear Regression Logistic Regression Bias and Variance in Regression Model Fit

More information

The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R

The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R Journal of Machine Learning Research 6 (205) 553-557 Submitted /2; Revised 3/4; Published 3/5 The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R Xingguo Li Department

More information

Machine Learning (BSMC-GA 4439) Wenke Liu

Machine Learning (BSMC-GA 4439) Wenke Liu Machine Learning (BSMC-GA 4439) Wenke Liu 01-25-2018 Outline Background Defining proximity Clustering methods Determining number of clusters Other approaches Cluster analysis as unsupervised Learning Unsupervised

More information

Soft Threshold Estimation for Varying{coecient Models 2 ations of certain basis functions (e.g. wavelets). These functions are assumed to be smooth an

Soft Threshold Estimation for Varying{coecient Models 2 ations of certain basis functions (e.g. wavelets). These functions are assumed to be smooth an Soft Threshold Estimation for Varying{coecient Models Artur Klinger, Universitat Munchen ABSTRACT: An alternative penalized likelihood estimator for varying{coecient regression in generalized linear models

More information

Naïve Bayes for text classification

Naïve Bayes for text classification Road Map Basic concepts Decision tree induction Evaluation of classifiers Rule induction Classification using association rules Naïve Bayesian classification Naïve Bayes for text classification Support

More information

08 An Introduction to Dense Continuous Robotic Mapping

08 An Introduction to Dense Continuous Robotic Mapping NAVARCH/EECS 568, ROB 530 - Winter 2018 08 An Introduction to Dense Continuous Robotic Mapping Maani Ghaffari March 14, 2018 Previously: Occupancy Grid Maps Pose SLAM graph and its associated dense occupancy

More information

Generalized Additive Models

Generalized Additive Models :p Texts in Statistical Science Generalized Additive Models An Introduction with R Simon N. Wood Contents Preface XV 1 Linear Models 1 1.1 A simple linear model 2 Simple least squares estimation 3 1.1.1

More information

10-701/15-781, Fall 2006, Final

10-701/15-781, Fall 2006, Final -7/-78, Fall 6, Final Dec, :pm-8:pm There are 9 questions in this exam ( pages including this cover sheet). If you need more room to work out your answer to a question, use the back of the page and clearly

More information

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review Gautam Kunapuli Machine Learning Data is identically and independently distributed Goal is to learn a function that maps to Data is generated using an unknown function Learn a hypothesis that minimizes

More information

REPLACING MLE WITH BAYESIAN SHRINKAGE CAS ANNUAL MEETING NOVEMBER 2018 GARY G. VENTER

REPLACING MLE WITH BAYESIAN SHRINKAGE CAS ANNUAL MEETING NOVEMBER 2018 GARY G. VENTER REPLACING MLE WITH BAYESIAN SHRINKAGE CAS ANNUAL MEETING NOVEMBER 2018 GARY G. VENTER ESTIMATION Problems with MLE known since Charles Stein 1956 paper He showed that when estimating 3 or more means, shrinking

More information

Boosting Simple Model Selection Cross Validation Regularization

Boosting Simple Model Selection Cross Validation Regularization Boosting: (Linked from class website) Schapire 01 Boosting Simple Model Selection Cross Validation Regularization Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 8 th,

More information

Moving Beyond Linearity

Moving Beyond Linearity Moving Beyond Linearity Basic non-linear models one input feature: polynomial regression step functions splines smoothing splines local regression. more features: generalized additive models. Polynomial

More information

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies http://blog.csdn.net/zouxy09/article/details/8775360 Automatic Colorization of Black and White Images Automatically Adding Sounds To Silent Movies Traditionally this was done by hand with human effort

More information

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1 Preface to the Second Edition Preface to the First Edition vii xi 1 Introduction 1 2 Overview of Supervised Learning 9 2.1 Introduction... 9 2.2 Variable Types and Terminology... 9 2.3 Two Simple Approaches

More information

GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute

GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute GENREG WHAT IS IT? The Generalized Regression platform was introduced in JMP Pro 11 and got much better in version

More information

Lecture on Modeling Tools for Clustering & Regression

Lecture on Modeling Tools for Clustering & Regression Lecture on Modeling Tools for Clustering & Regression CS 590.21 Analysis and Modeling of Brain Networks Department of Computer Science University of Crete Data Clustering Overview Organizing data into

More information

Sparse Coding and Dictionary Learning for Image Analysis

Sparse Coding and Dictionary Learning for Image Analysis Sparse Coding and Dictionary Learning for Image Analysis Part IV: Recent Advances in Computer Vision and New Models Francis Bach, Julien Mairal, Jean Ponce and Guillermo Sapiro CVPR 10 tutorial, San Francisco,

More information

Machine Learning Techniques for Detecting Hierarchical Interactions in GLM s for Insurance Premiums

Machine Learning Techniques for Detecting Hierarchical Interactions in GLM s for Insurance Premiums Machine Learning Techniques for Detecting Hierarchical Interactions in GLM s for Insurance Premiums José Garrido Department of Mathematics and Statistics Concordia University, Montreal EAJ 2016 Lyon, September

More information

Voxel selection algorithms for fmri

Voxel selection algorithms for fmri Voxel selection algorithms for fmri Henryk Blasinski December 14, 2012 1 Introduction Functional Magnetic Resonance Imaging (fmri) is a technique to measure and image the Blood- Oxygen Level Dependent

More information

Effectiveness of Sparse Features: An Application of Sparse PCA

Effectiveness of Sparse Features: An Application of Sparse PCA 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Data mining techniques for actuaries: an overview

Data mining techniques for actuaries: an overview Data mining techniques for actuaries: an overview Emiliano A. Valdez joint work with Banghee So and Guojun Gan University of Connecticut Advances in Predictive Analytics (APA) Conference University of

More information

Network Lasso: Clustering and Optimization in Large Graphs

Network Lasso: Clustering and Optimization in Large Graphs Network Lasso: Clustering and Optimization in Large Graphs David Hallac, Jure Leskovec, Stephen Boyd Stanford University September 28, 2015 Convex optimization Convex optimization is everywhere Introduction

More information

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University DS 4400 Machine Learning and Data Mining I Alina Oprea Associate Professor, CCIS Northeastern University January 24 2019 Logistics HW 1 is due on Friday 01/25 Project proposal: due Feb 21 1 page description

More information

Lecture 16: High-dimensional regression, non-linear regression

Lecture 16: High-dimensional regression, non-linear regression Lecture 16: High-dimensional regression, non-linear regression Reading: Sections 6.4, 7.1 STATS 202: Data mining and analysis November 3, 2017 1 / 17 High-dimensional regression Most of the methods we

More information

Machine Learning: An Applied Econometric Approach Online Appendix

Machine Learning: An Applied Econometric Approach Online Appendix Machine Learning: An Applied Econometric Approach Online Appendix Sendhil Mullainathan mullain@fas.harvard.edu Jann Spiess jspiess@fas.harvard.edu April 2017 A How We Predict In this section, we detail

More information

Analysis of Functional MRI Timeseries Data Using Signal Processing Techniques

Analysis of Functional MRI Timeseries Data Using Signal Processing Techniques Analysis of Functional MRI Timeseries Data Using Signal Processing Techniques Sea Chen Department of Biomedical Engineering Advisors: Dr. Charles A. Bouman and Dr. Mark J. Lowe S. Chen Final Exam October

More information

Supervised vs unsupervised clustering

Supervised vs unsupervised clustering Classification Supervised vs unsupervised clustering Cluster analysis: Classes are not known a- priori. Classification: Classes are defined a-priori Sometimes called supervised clustering Extract useful

More information

Generalized Additive Model

Generalized Additive Model Generalized Additive Model by Huimin Liu Department of Mathematics and Statistics University of Minnesota Duluth, Duluth, MN 55812 December 2008 Table of Contents Abstract... 2 Chapter 1 Introduction 1.1

More information

Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares Input vectors

More information

6 Model selection and kernels

6 Model selection and kernels 6. Bias-Variance Dilemma Esercizio 6. While you fit a Linear Model to your data set. You are thinking about changing the Linear Model to a Quadratic one (i.e., a Linear Model with quadratic features φ(x)

More information

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Leveling Up as a Data Scientist.   ds/2014/10/level-up-ds.jpg Model Optimization Leveling Up as a Data Scientist http://shorelinechurch.org/wp-content/uploa ds/2014/10/level-up-ds.jpg Bias and Variance Error = (expected loss of accuracy) 2 + flexibility of model

More information

ELEG Compressive Sensing and Sparse Signal Representations

ELEG Compressive Sensing and Sparse Signal Representations ELEG 867 - Compressive Sensing and Sparse Signal Representations Gonzalo R. Arce Depart. of Electrical and Computer Engineering University of Delaware Fall 211 Compressive Sensing G. Arce Fall, 211 1 /

More information

Nonparametric Regression

Nonparametric Regression Nonparametric Regression John Fox Department of Sociology McMaster University 1280 Main Street West Hamilton, Ontario Canada L8S 4M4 jfox@mcmaster.ca February 2004 Abstract Nonparametric regression analysis

More information

Advanced and Predictive Analytics with JMP 12 PRO. JMP User Meeting 9. Juni Schwalbach

Advanced and Predictive Analytics with JMP 12 PRO. JMP User Meeting 9. Juni Schwalbach Advanced and Predictive Analytics with JMP 12 PRO JMP User Meeting 9. Juni 2016 -Schwalbach Definition Predictive Analytics encompasses a variety of statistical techniques from modeling, machine learning

More information

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a Week 9 Based in part on slides from textbook, slides of Susan Holmes Part I December 2, 2012 Hierarchical Clustering 1 / 1 Produces a set of nested clusters organized as a Hierarchical hierarchical clustering

More information

Lecture 21 : A Hybrid: Deep Learning and Graphical Models

Lecture 21 : A Hybrid: Deep Learning and Graphical Models 10-708: Probabilistic Graphical Models, Spring 2018 Lecture 21 : A Hybrid: Deep Learning and Graphical Models Lecturer: Kayhan Batmanghelich Scribes: Paul Liang, Anirudha Rayasam 1 Introduction and Motivation

More information

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016 CPSC 340: Machine Learning and Data Mining Principal Component Analysis Fall 2016 A2/Midterm: Admin Grades/solutions will be posted after class. Assignment 4: Posted, due November 14. Extra office hours:

More information

Lecture 25: Review I

Lecture 25: Review I Lecture 25: Review I Reading: Up to chapter 5 in ISLR. STATS 202: Data mining and analysis Jonathan Taylor 1 / 18 Unsupervised learning In unsupervised learning, all the variables are on equal standing,

More information

Clustering and The Expectation-Maximization Algorithm

Clustering and The Expectation-Maximization Algorithm Clustering and The Expectation-Maximization Algorithm Unsupervised Learning Marek Petrik 3/7 Some of the figures in this presentation are taken from An Introduction to Statistical Learning, with applications

More information

Machine Learning. Chao Lan

Machine Learning. Chao Lan Machine Learning Chao Lan Machine Learning Prediction Models Regression Model - linear regression (least square, ridge regression, Lasso) Classification Model - naive Bayes, logistic regression, Gaussian

More information

Naïve Bayes, Gaussian Distributions, Practical Applications

Naïve Bayes, Gaussian Distributions, Practical Applications Naïve Bayes, Gaussian Distributions, Practical Applications Required reading: Mitchell draft chapter, sections 1 and 2. (available on class website) Machine Learning 10-601 Tom M. Mitchell Machine Learning

More information

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane

DATA MINING AND MACHINE LEARNING. Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane DATA MINING AND MACHINE LEARNING Lecture 6: Data preprocessing and model selection Lecturer: Simone Scardapane Academic Year 2016/2017 Table of contents Data preprocessing Feature normalization Missing

More information

The picasso Package for High Dimensional Regularized Sparse Learning in R

The picasso Package for High Dimensional Regularized Sparse Learning in R The picasso Package for High Dimensional Regularized Sparse Learning in R X. Li, J. Ge, T. Zhang, M. Wang, H. Liu, and T. Zhao Abstract We introduce an R package named picasso, which implements a unified

More information

Neural Networks and Deep Learning

Neural Networks and Deep Learning Neural Networks and Deep Learning Example Learning Problem Example Learning Problem Celebrity Faces in the Wild Machine Learning Pipeline Raw data Feature extract. Feature computation Inference: prediction,

More information

I How does the formulation (5) serve the purpose of the composite parameterization

I How does the formulation (5) serve the purpose of the composite parameterization Supplemental Material to Identifying Alzheimer s Disease-Related Brain Regions from Multi-Modality Neuroimaging Data using Sparse Composite Linear Discrimination Analysis I How does the formulation (5)

More information

Sparse & Functional Principal Components Analysis

Sparse & Functional Principal Components Analysis Sparse & Functional Principal Components Analysis Genevera I. Allen Department of Statistics and Electrical and Computer Engineering, Rice University, Department of Pediatrics-Neurology, Baylor College

More information

Chapter 7: Numerical Prediction

Chapter 7: Numerical Prediction Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Chapter 7: Numerical Prediction Lecture: Prof. Dr.

More information

A study of classification algorithms using Rapidminer

A study of classification algorithms using Rapidminer Volume 119 No. 12 2018, 15977-15988 ISSN: 1314-3395 (on-line version) url: http://www.ijpam.eu ijpam.eu A study of classification algorithms using Rapidminer Dr.J.Arunadevi 1, S.Ramya 2, M.Ramesh Raja

More information

Instance-based Learning

Instance-based Learning Instance-based Learning Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University February 19 th, 2007 2005-2007 Carlos Guestrin 1 Why not just use Linear Regression? 2005-2007 Carlos Guestrin

More information

Slides for Data Mining by I. H. Witten and E. Frank

Slides for Data Mining by I. H. Witten and E. Frank Slides for Data Mining by I. H. Witten and E. Frank 7 Engineering the input and output Attribute selection Scheme-independent, scheme-specific Attribute discretization Unsupervised, supervised, error-

More information