COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Similar documents
Nearest Neighbor Classification. Machine Learning Fall 2017

Perceptron as a graph

CS 340 Lec. 4: K-Nearest Neighbors

Instance-based Learning

Lecture 3. Oct

Nonparametric Methods Recap

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Linear Regression and K-Nearest Neighbors 3/28/18

Data Mining. Lecture 03: Nearest Neighbor Learning

K-Nearest Neighbour Classifier. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CISC 4631 Data Mining

Lecture on Modeling Tools for Clustering & Regression

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Data Preprocessing. Supervised Learning

K-Nearest Neighbour (Continued) Dr. Xiaowei Huang

CSC411/2515 Tutorial: K-NN and Decision Tree

Instance-based Learning

Simple Model Selection Cross Validation Regularization Neural Networks

Distribution-free Predictive Approaches

CS 8520: Artificial Intelligence. Machine Learning 2. Paula Matuszek Fall, CSC 8520 Fall Paula Matuszek

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

Network Traffic Measurements and Analysis

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Going nonparametric: Nearest neighbor methods for regression and classification

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

RESAMPLING METHODS. Chapter 05

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Supervised Learning: Nearest Neighbors

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Machine Learning in Biology

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Supervised Learning: K-Nearest Neighbors and Decision Trees

CSE 446 Bias-Variance & Naïve Bayes

Instance-Based Learning. Goals for the lecture

Hyperparameters and Validation Sets. Sargur N. Srihari

Hierarchical Clustering 4/5/17

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

CS570: Introduction to Data Mining

Improving the way neural networks learn Srikumar Ramalingam School of Computing University of Utah

Regularization and model selection

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

CS 584 Data Mining. Classification 1

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Unsupervised Data Mining: Clustering. Izabela Moise, Evangelos Pournaras, Dirk Helbing

3 Nonlinear Regression

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

COMPUTATIONAL INTELLIGENCE (CS) (INTRODUCTION TO MACHINE LEARNING) SS16. Lecture 2: Linear Regression Gradient Descent Non-linear basis functions

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Perceptron: This is convolution!

The exam is closed book, closed notes except your one-page cheat sheet.

ECG782: Multidimensional Digital Signal Processing

Decision Tree (Continued) and K-Nearest Neighbour. Dr. Xiaowei Huang

Classification and K-Nearest Neighbors

Topic 1 Classification Alternatives

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Nearest neighbor classification DSE 220

Data Mining and Data Warehousing Classification-Lazy Learners

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Support Vector Machines + Classification for IR

Nearest Neighbor Methods

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Model Complexity and Generalization

Clustering Billions of Images with Large Scale Nearest Neighbor Search

5 Learning hypothesis classes (16 points)

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Artificial Intelligence. Programming Styles

Challenges motivating deep learning. Sargur N. Srihari

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Introduction to object recognition. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, and others

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Bias-Variance Analysis of Ensemble Learning

Instance-Based Learning.

1) Give decision trees to represent the following Boolean functions:

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Learning to Learn: additional notes

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Lecture-17: Clustering with K-Means (Contd: DT + Random Forest)

Going nonparametric: Nearest neighbor methods for regression and classification

3 Nonlinear Regression

Data Mining and Machine Learning: Techniques and Algorithms

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

PROBLEM 4

Machine Learning Classifiers and Boosting

CSC 411: Lecture 05: Nearest Neighbors

Neural Network Neurons

Transcription:

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18 Lecture 6: k-nn Cross-validation Regularization

LEARNING METHODS

Lazy vs eager learning Eager learning generalizes training data before evaluation (e.g. Neural networks) Fast prediction evaluation Summarize training set (noise reduction) Lazy learning wait a prediction query to generalize (e.g. k-nn) Local approximation Quick adaptation to variation of the training set Require storage of the full training set Slow evaluation

Instance based learning Type of lazy learning Store in memory the training set Compare a test sample to the samples memory

K-NEAREST NEIGHBORS (K-NN)

k-nn Simple Non-differentiable Lazy learning The main idea: Find the k closest samples (for instance with Euclidean distance) Assign the most frequent class occurring on those k samples x2 k=1 k=5 x1

1-NN: Nearest Neighbor No computation of the explicit decision boundary The decision boundary form a subset of the Voronoi diagram a Voronoi diagram is a partitioning of a plane into regions based on distance to points in a specific subset of the plane. Decision boundaries are irregular Voronoi diagram Decision boundary

The number of neighbors influence The best k is data dependent Larger values of k : robustness to noise but fuzzy boundaries Model selection (validation set) is the best heuristic to optimize k

Variants Training: Save only preprocessed input (feature extraction and dimensionality reduction) Testing: Classification: Majority of votes of its k nearest neighbors Regression: Average of its k nearest neighbors.

Pros and cons Pros: Easy to implement/understand No training Learn very complex decision boundaries No information loss (all samples are kept) Cons: Require storage of all the data samples Slow at query time Bad performance if metric or feature vector is bad

Simple concepts are never deprecated Memory Networks (Weston et al. Facebook 2015) made tremendous progress in text processing and artificial reasoning: For each input x Find memory instance m most similar to x Return a function f(x,m) and update the stored memories Differentiable version of these simple algorithms can be designed to use back-prop Differentiable Neural Computer (Graves et al. Google 2016) End-to-End Memory Networks (Sukbaathar et al. Facebook 2016)

Application tips When to use k-nn: Lots of data is available Small number of features What if the classes are not evenly represented? In that case a more frequent class tend to dominate the prediction of the new example Weighting heuristics

UNDERFITTING AND OVERFITTING

Under-/ and Overfitting Training error (cost) underfitting Test error (cost) overfitting just right Model complexity (e.g. degree of polynomial terms)

Under- and Overfitting Underfitting: Model is too simple High training error, high test error Training error (cost) Test error (cost) Overfitting: Model is too complex (often: too many parameters relative to number of training examples) Low training error, high test error In between just right Moderate training error Lowest test error Model complexity

How to deal with overfitting Use model selection to automatically select the right model complexity Use regularization to keep parameters small Collect more data (often not possible or inefficient) Manually throw out features which are unlikely to contribute (often hard to guess which ones, potentially throwing out the wrong ones) Pre-processing, change the feature vector or perform dimension reduction (endless effort, often not possible or inefficient)

Model selection: Training/Validation/Test set workflow Training set Learning algorithm A Learning algorithm B Learning algorithm C For example: Linear regression, Polynomial regression, Artificial Neural Network Hypothesis h A Hypothesis h B Hypothesis h C Validation set Model selection Selected Hypothesis h Test set Testing Test error/cost

CROSS-VALIDATION

Cross-validation The goal: Define a validation set to pre-test" in the training phase. Why to use it: Limit overfitting: keep track of the predictive power The trick: Recycle the data by using different training/validation partitions

Model selection with Cross-validation 1. Compute averaged cross-validated error (CV) for each model h A.6.5.7.6.6 h B.4.5.5.6.5 2. Choose the model with smallest CV

Cross-validation approaches Disadvantage of a single validation set: Little training data - the function is poorly fitted Little validation data - the true error is poorly estimated Tricks and warnings Beware if the variance of the error over partitions is large Train the best class over full data after selection Use the same partitions for all hypothesis Common types of partitioning: k-fold 2-fold Leave-one-out Repeated random sub-sampling

K-fold cross-validation Useful when training dataset is small Steps: Split the data into k equal folds Repeat k times cross-validation process: each of the folds should be used once as a validation set and the rest as a training set Calculate the mean and the variance of k runs Disadvantage: It requires k runs of algorithm which means k times as much computation

2-fold cross-validation The simplest approach, also called holdout method Idea: Split randomly the whole training data into 2 equal folds (k=2) Train on the first fold and validate on the second, and vice verse Advantage: Both training and validation sets are fairly large Each data point is used for both training and validation on each fold

Leave-one-out cross-validation This is a special case where k equals the number of samples in the training set Idea: Use a single sample as a validation set and all the rest as training set (k times) Used in the case of really small training set

Leave-one-out cross-validation example 1st order 3rd order CV = 0.6 CV = 1.5 5th order 7th order CV = 6.0 CV = 15.6

Repeated random sub-sampling validation Idea: Randomly split the dataset into training and validation sets k times Advantage: Choose independently how large each validation set is and how many trials you average over Disadvantage: Validation subsets may overlap (some sample may never be selected)

REGULARIZATION

Bias-variance dilemma (tradeoff) 2 error = Variance + Bias Variance coherence across samples Bias distance to target Regularization provides techniques to reduce the Variance

Under-/ and Overfitting Training error (cost) Test error (cost) low variance high bias underfitting high variance low bias overfitting just right Model complexity (e.g. degree of polynomial terms)

y y y Polynomial regression under-/overfitting 2.5 2 1.5 MSEtrain=0.22, MSEtest=0.29 training test polynomial fit degree 1 2.5 2 1.5 MSEtrain=0.01, MSEtest=0.01 training test polynomial fit degree 5 2.5 2 1.5 MSEtrain=0.00, MSEtest=3.17 training test polynomial fit degree 15 1 0.5 0-2 -1 0 1 2 x 1 0.5 0-2 -1 0 1 2 x 1 0.5 0-2 -1 0 1 2 x underfitting just right overfitting high bias high variance suppose we penalize

Regularization Prefer simple models Exclude extreme models How to do it: Instead of minimizing the original problem where is norm (Euclidean norm) minimize Large Low leads to underfitting (high bias) to overfitting (high variance)

Regularized linear regression Regularized cost function: Analytical solution: Gradient descent solution:

Regularization of NN How many hidden layers and how many neurons? Fewer risk of underfitting More risk of overfitting Reduce the complexity (reduce Variance) Weight decay Network structure (weight sharing) Keep track of predictive power (reduce Error directly) Early stopping

Weight decay Weight decay is a norm regularization for Neural networks The weights of a NN will be an additional term in an Error function:

Sparse structure Weights: 32 x 32 x K hidden Weights: 8 x 8 x K hidden Different role between hidden units Sparse = many weights set to 0

Weight sharing Weights: 8 x 8 x K hidden W i = W j = W 0 Weights: 8 x 8 Spatial invariance Even positions where pixels are always 0 may learn to recognize some shapes

Early stopping A form of regularization based on the scheme of model selection Steps: The weights are initialized to small values Stop when the error on validation data increases error

SUMMARY (QUESTIONS)

Some questions Difference between lazy and eager learning? Training and testing procedure for k-nn? How does the number of neighbors influence k-nn? When to use k-nn and what are pros/cons? What is overfitting and how to deal with it? What is validation set? What is cross-validation? Types of partitioning in cross-validation? What is the bias-variance tradeoff? What is regularization and how is it used? What are regularization methods for NN?