Nonparametric Methods Recap

Similar documents
Notes and Announcements

10601 Machine Learning. Model and feature selection

Simple Model Selection Cross Validation Regularization Neural Networks

Boosting Simple Model Selection Cross Validation Regularization. October 3 rd, 2007 Carlos Guestrin [Schapire, 1989]

Machine Learning / Jan 27, 2010

Boosting Simple Model Selection Cross Validation Regularization

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Network Traffic Measurements and Analysis

Slides for Data Mining by I. H. Witten and E. Frank

Going nonparametric: Nearest neighbor methods for regression and classification

What is machine learning?

Lecture 13: Model selection and regularization

Machine Learning Techniques for Data Mining

Random Forest A. Fornaser

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Model Complexity and Generalization

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Decision Tree CE-717 : Machine Learning Sharif University of Technology

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Supervised Learning: K-Nearest Neighbors and Decision Trees

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

error low bias high variance test set training set high low Model Complexity Typical Behaviour 2 CSC2515 Machine Learning high bias low variance

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

CSE446: Linear Regression. Spring 2017

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Model selection and validation 1: Cross-validation

low bias high variance high bias low variance error test set training set high low Model Complexity Typical Behaviour Lecture 11:

CSE 417T: Introduction to Machine Learning. Lecture 6: Bias-Variance Trade-off. Henry Chai 09/13/18

Decision Tree (Continued) and K-Nearest Neighbour. Dr. Xiaowei Huang

Distribution-free Predictive Approaches

Tree-based methods for classification and regression

Overfitting. Machine Learning CSE546 Carlos Guestrin University of Washington. October 2, Bias-Variance Tradeoff

Using Machine Learning to Optimize Storage Systems

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

STA 4273H: Statistical Machine Learning

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee

Nonparametric Approaches to Regression

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

7. Decision or classification trees

Machine Learning. Chao Lan

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

EE 511 Linear Regression

CSC411/2515 Tutorial: K-NN and Decision Tree

CS 229 Midterm Review

Lecture on Modeling Tools for Clustering & Regression

Going nonparametric: Nearest neighbor methods for regression and classification

Machine Learning Techniques for Data Mining

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

Machine Learning. Topic 4: Linear Regression Models

Support Vector Machines

CPSC 340: Machine Learning and Data Mining

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Optimization Methods for Machine Learning (OMML)

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Machine Learning Lecture 3

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

CPSC 340: Machine Learning and Data Mining. More Regularization Fall 2017

Generative and discriminative classification techniques

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Lasso Regression: Regularization for feature selection

k-nearest Neighbors + Model Selection

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Introduction to Automated Text Analysis. bit.ly/poir599

Cross-validation. Cross-validation is a resampling method.

RESAMPLING METHODS. Chapter 05

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

CART. Classification and Regression Trees. Rebecka Jörnsten. Mathematical Sciences University of Gothenburg and Chalmers University of Technology

Cross-validation and the Bootstrap

Multicollinearity and Validation CIVL 7012/8012

CSC 411 Lecture 4: Ensembles I

CSE 446 Bias-Variance & Naïve Bayes

Economics Nonparametric Econometrics

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

Advanced and Predictive Analytics with JMP 12 PRO. JMP User Meeting 9. Juni Schwalbach

Cross-validation for detecting and preventing overfitting

Lecture 19: Decision trees

Naïve Bayes for text classification

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Supervised Learning: Nearest Neighbors

How do we obtain reliable estimates of performance measures?

Classification Algorithms in Data Mining

Chapter 7: Numerical Prediction

5 Learning hypothesis classes (16 points)

Data Mining Classification - Part 1 -

Instance-based Learning

Instance-based Learning

The exam is closed book, closed notes except your one-page cheat sheet.

1) Give decision trees to represent the following Boolean functions:

CS4491/CS 7265 BIG DATA ANALYTICS

Introduction to Classification & Regression Trees

DATA MINING OVERFITTING AND EVALUATION

Nearest neighbors classifiers

Transcription:

Nonparametric Methods Recap Aarti Singh Machine Learning 10-701/15-781 Oct 4, 2010

Nonparametric Methods Kernel Density estimate (also Histogram) Weighted frequency Classification - K-NN Classifier Majority vote Kernel Regression Weighted average where 2

Kernel Regression as Weighted Least Squares Weighted Least Squares Kernel regression corresponds to locally constant estimator obtained from (locally) weighted least squares i.e. set f(x i ) = b (a constant) 3

Kernel Regression as Weighted Least set f(x i ) = b (a constant) Squares constant Notice that 4

Local Linear/Polynomial Regression Weighted Least Squares Local Polynomial regression corresponds to locally polynomial estimator obtained from (locally) weighted least squares i.e. set (local polynomial of degree p around X) More in 10-702 (statistical machine learning) 5

Summary Parametric vs Nonparametric approaches Nonparametric models place very mild assumptions on the data distribution and provide good models for complex data Parametric models rely on very strong (simplistic) distributional assumptions Nonparametric models (not histograms) requires storing and computing with the entire data set. Parametric models, once fitted, are much more efficient in terms of storage and computation. 6

Summary Instance based/non-parametric approaches Four things make a memory based learner: 1. A distance metric, dist(x,x i ) Euclidean (and many more) 2. How many nearby neighbors/radius to look at? k, D/h 3. A weighting function (optional) W based on kernel K 4. How to fit with the local points? Average, Majority vote, Weighted average, Poly fit 7

What you should know Histograms, Kernel density estimation Effect of bin width/ kernel bandwidth Bias-variance tradeoff K-NN classifier Nonlinear decision boundaries Kernel (local) regression Interpretation as weighted least squares Local constant/linear/polynomial regression 8

Practical Issues in Machine Learning Overfitting and Model selection Aarti Singh Machine Learning 10-701/15-781 Oct 4, 2010

True vs. Empirical Risk True Risk: Target performance measure Classification Probability of misclassification Regression Mean Squared Error performance on a random test point (X,Y) Empirical Risk: Performance on training data Classification Proportion of misclassified examples Regression Average Squared Error

Overfitting Is the following predictor a good one? What is its empirical risk? (performance on training data) zero! What about true risk? > zero Will predict very poorly on new random test point: Large generalization error!

Weight Weight Overfitting If we allow very complicated predictors, we could overfit the training data. Examples: Classification (0-NN classifier) Football player? No Yes Height Height

Overfitting If we allow very complicated predictors, we could overfit the training data. Examples: Regression (Polynomial of order k degree up to k-1) 1.5 k=1 k=2 1 1.4 1.2 1 0.8 0.6 0.5 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.4 k=3 1.2 k=7 1 0.8 0.6 0.4 0.2 0-0.2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 5 0-5 -10-15 -20-25 -30-35 -40-45 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Effect of Model Complexity If we allow very complicated predictors, we could overfit the training data. fixed # training data Empirical risk is no longer a good indicator of true risk

Behavior of True Risk Want to be as good as optimal predictor Excess Risk finite sample size + noise Due to randomness of training data Due to restriction of model class Estimation error Excess risk Approx. error

Behavior of True Risk

Bias Variance Tradeoff Regression:... Notice: Optimal predictor does not have zero error variance bias^2 Noise var Excess Risk = = variance + bias^2 Random component est err approx err

Bias Variance Tradeoff: Derivation Regression: Notice: Optimal predictor does not have zero error 0

Bias Variance Tradeoff: Derivation Regression: Notice: Optimal predictor does not have zero error Now, lets look at the second term: variance how much does the predictor vary about its mean for different training datasets Note: this term doesn t depend on D n

Bias Variance Tradeoff: Derivation 0 since noise is independent and zero mean bias^2 how much does the mean of the predictor differ from the optimal predictor noise variance

Bias Variance Tradeoff 3 Independent training datasets Large bias, Small variance poor approximation but robust/stable 1.6 1.6 1.6 1.4 1.4 1.4 1.2 1.2 1.2 1 1 1 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Small bias, Large variance good approximation but instable 2 2 2 1.5 1.5 1.5 1 1 1 0.5 0.5 0.5 0 0 0-0.5-0.5-0.5-1 -1-1 -1.5-1.5-1.5-2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-2 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Examples of Model Spaces Model Spaces with increasing complexity: Nearest-Neighbor classifiers with varying neighborhood sizes k = 1,2,3, Small neighborhood => Higher complexity Decision Trees with depth k or with k leaves Higher depth/ More # leaves => Higher complexity Regression with polynomials of order k = 0, 1, 2, Higher degree => Higher complexity Kernel Regression with bandwidth h Small bandwidth => Higher complexity How can we select the right complexity model?

Model Selection Setup: Model Classes of increasing complexity We can select the right complexity model in a data-driven/adaptive way: Cross-validation Structural Risk Minimization Complexity Regularization Information Criteria - AIC, BIC, Minimum Description Length (MDL)

Hold-out method We would like to pick the model that has smallest generalization error. Can judge generalization error by using an independent sample of data. Hold out procedure: n data points available 1) Split into two sets: Training dataset Validation dataset NOT test Data!! 2) Use D T for training a predictor from each model class: Evaluated on training dataset D T

Hold-out method 3) Use Dv to select the model class which has smallest empirical error on D v 4) Hold-out predictor Evaluated on validation dataset D V Intuition: Small error on one set of data will not imply small error on a randomly sub-sampled second set of data Ensures method is stable

Hold-out method Drawbacks: May not have enough data to afford setting one subset aside for getting a sense of generalization abilities Validation error may be misleading (bad estimate of generalization error) if we get an unfortunate split Limitations of hold-out can be overcome by a family of random subsampling methods at the expense of more computation.

Cross-validation K-fold cross-validation Create K-fold partition of the dataset. Form K hold-out predictors, each time using one partition as validation and rest K-1 as training datasets. Final predictor is average/majority vote over the K hold-out estimates. training validation Run 1 Run 2 Run K

Cross-validation Leave-one-out (LOO) cross-validation Special case of K-fold with K=n partitions Equivalently, train on n-1 samples and validate on only one sample per run for n runs training validation Run 1 Run 2 Run K

Cross-validation Random subsampling Randomly subsample a fixed fraction αn (0< α <1) of the dataset for validation. Form hold-out predictor with remaining data as training data. Repeat K times Final predictor is average/majority vote over the K hold-out estimates. training validation Run 1 Run 2 Run K

Estimating generalization error Generalization error Hold-out 1-fold: Error estimate = K-fold/LOO/random Error estimate = sub-sampling: We want to estimate the error of a predictor based on n data points. If K is large (close to n), bias of error estimate is small since each training set has close to n data points. However, variance of error estimate is high since each validation set has fewer data points and might deviate a lot from the mean. Run 1 Run 2 Run K training validation

Practical Issues in Cross-validation How to decide the values for K and α? Large K + The bias of the error estimate will be small - The variance of the error estimate will be large (few validation pts) - The computational time will be very large as well (many experiments) Small K + The # experiments and, therefore, computation time are reduced + The variance of the error estimate will be small (many validation pts) - The bias of the error estimate will be large Common choice: K = 10, a = 0.1

Structural Risk Minimization Penalize models using bound on deviation of true and empirical risks. With high probability, Bound on deviation from true risk Concentration bounds (later) High probability Upper bound on true risk C(f) - large for complex models

Structural Risk Minimization Deviation bounds are typically pretty loose, for small sample sizes. In practice, Choose by cross-validation! Problem: Identify flood plain from noisy satellite images Noiseless image Noisy image True Flood plain (elevation level > x)

Structural Risk Minimization Deviation bounds are typically pretty loose, for small sample sizes. In practice, Choose by cross-validation! Problem: Identify flood plain from noisy satellite images True Flood plain (elevation level > x) Zero penalty CV penalty Theoretical penalty

Occam s Razor William of Ockham (1285-1349) Principle of Parsimony: One should not increase, beyond what is necessary, the number of entities required to explain anything. Alternatively, seek the simplest explanation. Penalize complex models based on Prior information (bias) Information Criterion (MDL, AIC, BIC)

Importance of Domain knowledge Oil Spill Contamination Distribution of photon arrivals Compton Gamma-Ray Observatory Burst and Transient Source Experiment (BATSE)

Complexity Regularization Penalize complex models using prior knowledge. Bayesian viewpoint: prior probability of f, p(f) Cost of model (log prior) cost is small if f is highly probable, cost is large if f is improbable ERM (empirical risk minimization) over a restricted class F uniform prior on f ϵ F, zero probability for other predictors

Complexity Regularization Penalize complex models using prior knowledge. Cost of model (log prior) Examples: MAP estimators Regularized Linear Regression - Ridge Regression, Lasso How to choose tuning parameter λ? Cross-validation Penalize models based on some norm of regression coefficients

Information Criteria AIC, BIC Penalize complex models based on their information content. # bits needed to describe f (description length) AIC (Akiake IC) C(f) = # parameters Allows # parameters to be infinite as # training data n become large BIC (Bayesian IC) C(f) = # parameters * log n Penalizes complex models more heavily limits complexity of models as # training data n become large

Information Criteria - MDL Penalize complex models based on their information content. MDL (Minimum Description Length) # bits needed to describe f (description length) Example: Binary Decision trees k leaves => 2k 1 nodes 2k 1 bits to encode tree structure + k bits to encode label of each leaf (0/1) 5 leaves => 9 bits to encode structure

Summary True and Empirical Risk Over-fitting Approx err vs Estimation err, Bias vs Variance tradeoff Model Selection, Estimating Generalization Error Hold-out, K-fold cross-validation Structural Risk Minimization Complexity Regularization Information Criteria AIC, BIC, MDL