Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM)

Similar documents
FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Machine Learning / Jan 27, 2010

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

What is machine learning?

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

Nonparametric Risk Attribution for Factor Models of Portfolios. October 3, 2017 Kellie Ottoboni

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Clustering Lecture 5: Mixture Model

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Going nonparametric: Nearest neighbor methods for regression and classification

Model learning for robot control: a survey

Nonparametric and Semiparametric Econometrics Lecture Notes for Econ 221. Yixiao Sun Department of Economics, University of California, San Diego

Economics Nonparametric Econometrics

Divide and Conquer Kernel Ridge Regression

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Data Analysis 3. Support Vector Machines. Jan Platoš October 30, 2017

Notes and Announcements

Multiresponse Sparse Regression with Application to Multidimensional Scaling

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

A Dendrogram. Bioinformatics (Lec 17)

Knowledge Discovery and Data Mining

Locally Weighted Learning for Control. Alexander Skoglund Machine Learning Course AASS, June 2005

Nonparametric Greedy Algorithms for the Sparse Learning Problem

Generative and discriminative classification techniques

Seeking Interpretable Models for High Dimensional Data

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Last time... Coryn Bailer-Jones. check and if appropriate remove outliers, errors etc. linear regression

Nonparametric sparse hierarchical models describe V1 fmri responses to natural images

Classification by Nearest Shrunken Centroids and Support Vector Machines

Network Traffic Measurements and Analysis

Last time... Bias-Variance decomposition. This week

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Monte Carlo for Spatial Models

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

Soft Threshold Estimation for Varying{coecient Models 2 ations of certain basis functions (e.g. wavelets). These functions are assumed to be smooth an

Linear Methods for Regression and Shrinkage Methods

Basis Functions. Volker Tresp Summer 2017

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Nonparametric Regression

Note Set 4: Finite Mixture Models and the EM Algorithm

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

Generalized Additive Model

Support vector machines

Nonparametric Approaches to Regression

08 An Introduction to Dense Continuous Robotic Mapping

error low bias high variance test set training set high low Model Complexity Typical Behaviour 2 CSC2515 Machine Learning high bias low variance

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Generalized additive models I

Kernel Methods & Support Vector Machines

CS 229 Midterm Review

Support Vector Machines + Classification for IR

Content-based image and video analysis. Machine learning

Moving Beyond Linearity

6 Model selection and kernels

Classification with PAM and Random Forest

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Instance-based Learning

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Model selection and validation 1: Cross-validation

Machine Learning in Biology

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Collaborative Sparsity and Compressive MRI

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear Models

low bias high variance high bias low variance error test set training set high low Model Complexity Typical Behaviour Lecture 11:

Topics in Machine Learning

Machine Learning Lecture 3

Chap.12 Kernel methods [Book, Chap.7]

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2017

Supervised vs unsupervised clustering

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Supervised Learning for Image Segmentation

Perceptron as a graph

4.12 Generalization. In back-propagation learning, as many training examples as possible are typically used.

Gradient Boosted Feature Selection. Zhixiang (Eddie) Xu, Gao Huang, Kilian Q. Weinberger, Alice X. Zheng

Machine Learning Lecture 3

Basis Functions. Volker Tresp Summer 2016

CSC 411: Lecture 02: Linear Regression

Machine Learning Lecture 3

Instance-based Learning

Machine Learning: Think Big and Parallel

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

Logistic Regression and Gradient Ascent

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Overview of various smoothers

Deep Generative Models Variational Autoencoders

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

CSE 417T: Introduction to Machine Learning. Lecture 22: The Kernel Trick. Henry Chai 11/15/18

Lasso Regression: Regularization for feature selection

Regularization and model selection

Solution Sketches Midterm Exam COSC 6342 Machine Learning March 20, 2013

Non-Bayesian Classifiers Part II: Linear Discriminants and Support Vector Machines

CPSC 340: Machine Learning and Data Mining. More Linear Classifiers Fall 2017

Nonparametric regression using kernel and spline methods

Transcription:

School of Computer Science Probabilistic Graphical Models Structured Sparse Additive Models Junming Yin and Eric Xing Lecture 7, April 4, 013 Reading: See class website 1 Outline Nonparametric regression and kernel smoothing Additive models Sparse additive models (SpAM) Structured sparse additive models (GroupSpAM) p 1

Nonparametric Regression and Kernel Smoothing 3 Non-linear functions: 4

LR with non-linear basis functions LR does not mean we can only deal with linear relationships We are free to design (non-linear) features under LR m T y = θ θ 0 + ( x) ( x) j =1 jφ = θ φ where the φ j (x) are fixed basis functions (and we define φ 0 (x) = 1). Example: polynomial regression: 3 [ 1, x, x x ] φ( x ) : =, We will be concerned with estimating (distributions over) the weights θ and choosing the model order M. 5 Basis functions There are many basis functions, e.g.: Polynomial l φ (x) Radial basis functions Sigmoidal j j 11 = x φ ( x) x µ j φ j ( x) = σ s j exp ( x µ ) j = s Splines, Fourier, Wavelets, etc 6 3

1D and D RBFs 1D RBF After fit: 7 Good and Bad RBFs A good D RBF Two bad D RBFs 8 4

Overfitting and underfitting y = θ + +θ1x 0 θ y = θ + = 5 0 + θ1x θx y = θ x j j 0 j 9 Bias and variance We define the bias of a model to be the expected generalization error even if we were to fit it to a very (say, infinitely) large training set. By fitting "spurious" patterns in the training set, we might again obtain a model with large generalization error. In this case, we say the model has large variance. 10 5

Locally weighted linear regression The algorithm: Instead of minimizing 1 n T J ( θ ) = ( xi θ y i ) i= 1 n 1 T now we fit θ to minimize J ( θ ) = w i ( xi θ yi ) Where do w i 's come from? i= 1 ( x ) exp i x τ w i = where x is the query point for which we'd like to know its corresponding y Essentially we put higher weights on (errors on) training examples that are close to the query point (than those that are further away from the query) 11 Parametric vs. non-parametric Locally weighted linear regression is another example we are running into of a non-parametric algorithm. (what are the others?) The (unweighted) linear regression algorithm that we saw earlier is known as a parametric learning algorithm because it has a fixed, finite number of parameters (the θ), which are fit to the data; Once we've fit the θ and stored them away, we no longer need to keep the training data around to make future predictions. In contrast, to make predictions using locally weighted linear regression, we need to keep the entire training set around. The term "non-parametric" (roughly) refers to the fact that the amount of stuff we need to keep in order to represent the hypothesis grows linearly with the size of the training set. 1 6

Parametric vs. non-parametric Parametric model: Assumes all data can be represented using a fixed, finite number of parameters. Examples: polynomial regression Nonparametric model: Number of parameters can grow with sample size. Examples: nonparametric regression 13 Robust regression probabilistic interpretation What regular regression does: Assume y k was originally generated using the following recipe: yk = k T θ x + N( 0, σ ) Computational task is to find the Maximum Likelihood estimation of θ 14 7

Nonparametric Regression: Formal Definition Nonparametric regression is concerned with estimating the regression function from a training set The parameter to be estimated is the whole function m(x) No parametric assumption such as linearity is made about the regression function m(x) More flexible than parametric model However, usually require keeping the entire training set (memory-based) 15 Kernel Smoother The simplest nonparametric regression estimator Local weighted (smooth) average of The weight depends on the distance to Nadaraya-Watson kernel estimator K is the smoothing kernel function K(x)>=0 and h is the bandwidth 16 8

Kernel Function It satisfies Different types 17 Bandwidth The choice of bandwidth h is much more important than the type of kernel K Small h -> rough estimates Large h -> smoother estimates In practice: cross-validation or plug-in methods 18 9

Linear Smoothers Kernel smoothers are examples of linear smoothers For each x, the estimator is a linear combination of Other examples: smoothing splines, locally weighted polynomial, etc 19 Linear Smoothers (con t) Define be the fitted values of the training examples, then The n x n matrix S is called the smoother matrix with The fitted values are the smoother version of original values Recall the regression function can be viewed as P is the conditional expectation operator that projects a random variable (it is Y here) onto the linear space of X It plays the role of smoother in the population setting 0 10

Additive Models 1 Additive Models Due to curse of dimensionality, smoothers break down in high dimensional setting Hastie & Tibshirani (1990) proposed the additive model Each is a smooth one-dimensional component function However, the model is not identifiable Can add a constant to one component function and subtract the same constant from another component Can be easily fixed by assuming 11

Backfitting The optimization problem in the population setting is It can be shown that the optimum is achieved at is the conditional expectation operator onto jth input space is the partial residual 3 Backfitting (con t) Replace conditional operator by smoother matrix results in the backfitting algorithm Initialize: Cycle: for is the current fitted values of jth component on the n training examples It is a coordinate descent algorithm 4 1

Example 48 rock samples from a petroleum reservoir The response: permeability The covariates: the area of pores, perimeter in pixels and shape (perimeter/sqrt(area)) 5 Sparse Additive Models (SpAM) 6 13

SpAM A sparse version of additive models (Ravikumar et. al 009) Can perform component/variable selection for additive models even when n << p The optimization problem in the population setting is behaves like an l1 ball across different components to encourage functional sparsity If each component function is constrained to have the linear form, the formulation reduces to standard lasso (Tibshirani 1996) 7 SpAM Backfitting The optimum is achieved by soft-thresholding step is the partial residual; is the positive part (thresholding condition) As in standard additive models, replace by is the empirical estimate of 8 14

Example n =150, p = 00 (only 4 component functions are non-zeros) 9 Structured Sparse Additive Models (GroupSpAM) 30 15

GroupSpAM Exploit structured sparsity in the nonparametric setting The simplest structure is a non-overlapping group (or a partition of the original p variables) The optimization problem in the population setting is Challenges: New difficulty to characterize the thresholding condition at group level No closed-form solution to the stationary condition, in the form of softthresholding step 31 Thresholding Conditions Theorem: the whole group g of functions if and only if is the partial residual after removing all functions from group g Necessity: straightforward to prove Sufficiency: more involved (see Yin et. al, 01) 3 16

GroupSpAM Backfitting 33 Experiments Sample size n=150 and dimension p = 00, 1000 34 17

Experiments (p = 00) Performance based on 100 independent simulations (t = 0) Performance based on 100 independent simulations (t = ) 35 Experiments (p = 1000) Performance based on 100 independent simulations (t = 0) Performance based on 100 independent simulations (t = ) 36 18

Estimated Component Functions 37 GroupSpAM with Overlap Allow overlap between the different groups (Jacob et al., 009) Idea: decompose each original component function to be a sum of a set of latent functions and then apply the functional group penalty to the decomposed The resulting support is a union of pre-defined groups Can be reduced to the GroupSpAM with disjoint groups and solved by the same backfitting algorithm 38 19

Breast Cancer Data Sample size n = 95 tumors (metastatic vs non-metastatic) and dimension p = 3,510 genes. Goal: identify few genes that can predict the types of tumors. Group structure: each group consists of the set of genes in a pathway and groups are overlapping. 39 Summary Novel statistical method for structured functional sparsity in nonparametric additive models Functional sparsity at the group level in additive models. Can easily incorporate prior knowledge of the structures among the covariates. Highly flexible: no assumptions are made on the design matrices or on the correlation of component functions in each group. Benefit of group sparsity: better performance in terms of support recovery and prediction accuracy in additive models. 40 0

References Hastie, T. and Tibshirani, R. Generalized Additive Models. Chapman & Hall/CRC, 1990. Buja, A., Hastie, T., and Tibshirani, R. Linear Smoothers and Additive Models. Ann. Statist. Volume 17, Number (1989), 453-510. Ravikumar, P., Lafferty, J., Liu, H., and Wasserman, L. Sparse additive models. JRSSB, 71(5):1009 1030, 1030, 009. Yin, J., Chen, X., and Xing, E. Group Sparse Additive Models, ICML, 01 41 1