Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Similar documents
k-nearest Neighbors + Model Selection

Experimental Design + k- Nearest Neighbors

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

Introduction to Artificial Intelligence

Nearest Neighbor Classification

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

CSCI567 Machine Learning (Fall 2014)

A Systematic Overview of Data Mining Algorithms. Sargur Srihari University at Buffalo The State University of New York

Input: Concepts, Instances, Attributes

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

CS 584 Data Mining. Classification 1

KTH ROYAL INSTITUTE OF TECHNOLOGY. Lecture 14 Machine Learning. K-means, knn

Machine Learning: Algorithms and Applications Mockup Examination

DATA MINING INTRODUCTION TO CLASSIFICATION USING LINEAR CLASSIFIERS

CISC 4631 Data Mining

EPL451: Data Mining on the Web Lab 5

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Distribution-free Predictive Approaches

Representing structural patterns: Reading Material: Chapter 3 of the textbook by Witten

Summary. Machine Learning: Introduction. Marcin Sydow

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Basic Concepts Weka Workbench and its terminology

CS5670: Computer Vision

Classification and K-Nearest Neighbors

CSCI567 Machine Learning (Fall 2018)

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Nearest neighbors classifiers

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Classification: Decision Trees

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Identification Of Iris Flower Species Using Machine Learning

CSC411/2515 Tutorial: K-NN and Decision Tree

CSE 573: Artificial Intelligence Autumn 2010

Lecture 8: Grid Search and Model Validation Continued

Performance Analysis of Data Mining Classification Techniques

Data Mining Practical Machine Learning Tools and Techniques

Data Preprocessing. Supervised Learning

Computer Vision. Exercise Session 10 Image Categorization

Lecture #11: The Perceptron

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

The Curse of Dimensionality

Notes and Announcements

Machine Learning / Jan 27, 2010

Linear Regression and K-Nearest Neighbors 3/28/18

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Classification: Feature Vectors

Classification Algorithms in Data Mining

CSE4334/5334 DATA MINING

Chapter 4: Algorithms CS 795

Hands on Datamining & Machine Learning with Weka

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Introduction to Machine Learning

Naïve Bayes for text classification

Nonparametric Methods Recap

Machine Learning with MATLAB --classification

Nearest Neighbor Classification. Machine Learning Fall 2017

Data Mining. Practical Machine Learning Tools and Techniques. Slides for Chapter 3 of Data Mining by I. H. Witten, E. Frank and M. A.

Practice EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

Decision Tree (Continued) and K-Nearest Neighbour. Dr. Xiaowei Huang

CSC 411 Lecture 4: Ensembles I

Non-Parametric Modeling

Chapter 4: Algorithms CS 795

Simple Model Selection Cross Validation Regularization Neural Networks

CSC 411: Lecture 05: Nearest Neighbors

Machine Learning. Topic 5: Linear Discriminants. Bryan Pardo, EECS 349 Machine Learning, 2013

Statistical Learning Part 2 Nonparametric Learning: The Main Ideas. R. Moeller Hamburg University of Technology

Data Mining Practical Machine Learning Tools and Techniques

6.034 Quiz 2, Spring 2005

STATS Data Analysis using Python. Lecture 15: Advanced Command Line

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Machine Learning Classifiers and Boosting

Nearest Neighbor Classification

Classifier Inspired Scaling for Training Set Selection

Machine Learning for. Artem Lind & Aleskandr Tkachenko

Text Categorization. Foundations of Statistic Natural Language Processing The MIT Press1999

Machine Learning. Classification

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

Supervised Learning: K-Nearest Neighbors and Decision Trees

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

Machine Learning Chapter 2. Input

CS489/698 Lecture 2: January 8 th, 2018

Decision Trees In Weka,Data Formats

Decision Trees Oct

LARGE MARGIN CLASSIFIERS

Support Vector Machines - Supplement

Data analysis case study using R for readily available data set using any one machine learning Algorithm

PROBLEM 4

Comparision between Quad tree based K-Means and EM Algorithm for Fault Prediction

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Machine Learning. Chao Lan

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

IN recent years, neural networks have attracted considerable attention

Transcription:

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Model Selection Matt Gormley Lecture 4 January 29, 2018 1

Q&A Q: How do we deal with ties in k-nearest Neighbors (e.g. even k or equidistant points)? A: I would ask you all for a good solution! Q: How do we define a distance function when the features are categorical (e.g. weather takes values {sunny, rainy, overcast})? A: Step 1: Convert from categorical attributes to numeric features (e.g. binary) Step 2: Select an appropriate distance function (e.g. Hamming distance) 2

Reminders Homework 2: Decision Trees Out: Wed, Jan 24 Due: Mon, Feb 5 at 11:59pm 10601 Notation Crib Sheet 3

K-NEAREST NEIGHBORS 7

Chalkboard: k-nearest Neighbors KNN for binary classification Distance functions Efficiency of KNN Inductive bias of KNN KNN Properties 8

KNN ON FISHER IRIS DATA 9

Fisher Iris Dataset Fisher (1936) used 150 measurements of flowers from 3 different species: Iris setosa (0), Iris virginica (1), Iris versicolor (2) collected by Anderson (1936) Species Sepal Length Sepal Width Petal Length 0 4.3 3.0 1.1 0.1 0 4.9 3.6 1.4 0.1 0 5.3 3.7 1.5 0.2 1 4.9 2.4 3.3 1.0 1 5.7 2.8 4.1 1.3 1 6.3 3.3 4.7 1.6 1 6.7 3.0 5.0 1.7 Petal Width Full dataset: https://en.wikipedia.org/wiki/iris_flower_data_set 10

Fisher Iris Dataset Fisher (1936) used 150 measurements of flowers from 3 different species: Iris setosa (0), Iris virginica (1), Iris versicolor (2) collected by Anderson (1936) Species Sepal Length 0 4.3 3.0 0 4.9 3.6 0 5.3 3.7 1 4.9 2.4 1 5.7 2.8 1 6.3 3.3 1 6.7 3.0 Sepal Width Deleted two of the four features, so that input space is 2D Full dataset: https://en.wikipedia.org/wiki/iris_flower_data_set 11

KNN on Fisher Iris Data 12

KNN on Fisher Iris Data Special Case: Nearest Neighbor 13

KNN on Fisher Iris Data Special Case: Majority Vote 14

KNN on Fisher Iris Data 15

KNN on Fisher Iris Data Special Case: Nearest Neighbor 16

KNN on Fisher Iris Data 17

KNN on Fisher Iris Data 18

KNN on Fisher Iris Data 19

KNN on Fisher Iris Data 20

KNN on Fisher Iris Data 21

KNN on Fisher Iris Data 22

KNN on Fisher Iris Data 23

KNN on Fisher Iris Data 24

KNN on Fisher Iris Data 25

KNN on Fisher Iris Data 26

KNN on Fisher Iris Data 27

KNN on Fisher Iris Data 28

KNN on Fisher Iris Data 29

KNN on Fisher Iris Data 30

KNN on Fisher Iris Data 31

KNN on Fisher Iris Data 32

KNN on Fisher Iris Data 33

KNN on Fisher Iris Data 34

KNN on Fisher Iris Data 35

KNN on Fisher Iris Data Special Case: Majority Vote 36

KNN ON GAUSSIAN DATA 37

KNN on Gaussian Data 38

KNN on Gaussian Data 39

KNN on Gaussian Data 40

KNN on Gaussian Data 41

KNN on Gaussian Data 42

KNN on Gaussian Data 43

KNN on Gaussian Data 44

KNN on Gaussian Data 45

KNN on Gaussian Data 46

KNN on Gaussian Data 47

KNN on Gaussian Data 48

KNN on Gaussian Data 49

KNN on Gaussian Data 50

KNN on Gaussian Data 51

KNN on Gaussian Data 52

KNN on Gaussian Data 53

KNN on Gaussian Data 54

KNN on Gaussian Data 55

KNN on Gaussian Data 56

KNN on Gaussian Data 57

KNN on Gaussian Data 58

KNN on Gaussian Data 59

KNN on Gaussian Data 60

KNN on Gaussian Data 61

KNN on Gaussian Data 62

K-NEAREST NEIGHBORS 63

Questions How could k-nearest Neighbors (KNN) be applied to regression? Can we do better than majority vote? (e.g. distance-weighted KNN) Where does the Cover & Hart (1967) Bayes error rate bound come from? 64

KNN Learning Objectives You should be able to Describe a dataset as points in a high dimensional space [CIML] Implement k-nearest Neighbors with O(N) prediction Describe the inductive bias of a k-nn classifier and relate it to feature scale [a la. CIML] Sketch the decision boundary for a learning algorithm (compare k-nn and DT) State Cover & Hart (1967)'s large sample analysis of a nearest neighbor classifier Invent "new" k-nn learning algorithms capable of dealing with even k Explain computational and geometric examples of the curse of dimensionality 65

k-nearest Neighbors But how do we choose k? 66

MODEL SELECTION 67

Model Selection WARNING: In some sense, our discussion of model selection is premature. The models we have considered thus far are fairly simple. The models and the many decisions available to the data scientist wielding them will grow to be much more complex than what we ve seen so far. 68

Model Selection Statistics Def: a model defines the data generation process (i.e. a set or family of parametric probability distributions) Def: model parameters are the values that give rise to a particular probability distribution in the model family Def: learning (aka. estimation) is the process of finding the parameters that best fit the data Def: hyperparameters are the parameters of a prior distribution over parameters Machine Learning Def: (loosely) a model defines the hypothesis space over which learning performs its search Def: model parameters are the numeric values or structure selected by the learning algorithm that give rise to a hypothesis Def: the learning algorithm defines the data-driven search over the hypothesis space (i.e. search for good parameters) Def: hyperparameters are the tunable aspects of the model, that the learning algorithm does not select 69

Model Selection Example: Decision Tree model = set of all possible trees, possibly restricted by some hyperparameters (e.g. max depth) parameters = structure of a specific decision tree learning algorithm = ID3, CART, etc. hyperparameters = maxdepth, threshold for splitting criterion, etc. Machine Learning Def: (loosely) a model defines the hypothesis space over which learning performs its search Def: model parameters are the numeric values or structure selected by the learning algorithm that give rise to a hypothesis Def: the learning algorithm defines the data-driven search over the hypothesis space (i.e. search for good parameters) Def: hyperparameters are the tunable aspects of the model, that the learning algorithm does not select 70

Model Selection Example: k-nearest Neighbors model = set of all possible nearest neighbors classifiers parameters = none (KNN is an instance-based or non-parametric method) learning algorithm = for naïve setting, just storing the data hyperparameters = k, the number of neighbors to consider Machine Learning Def: (loosely) a model defines the hypothesis space over which learning performs its search Def: model parameters are the numeric values or structure selected by the learning algorithm that give rise to a hypothesis Def: the learning algorithm defines the data-driven search over the hypothesis space (i.e. search for good parameters) Def: hyperparameters are the tunable aspects of the model, that the learning algorithm does not select 71

Model Selection Example: Perceptron model = set of all linear separators parameters = vector of weights (one for each feature) learning algorithm = mistake based updates to the parameters hyperparameters = none (unless using some variant such as averaged perceptron) Machine Learning Def: (loosely) a model defines the hypothesis space over which learning performs its search Def: model parameters are the numeric values or structure selected by the learning algorithm that give rise to a hypothesis Def: the learning algorithm defines the data-driven search over the hypothesis space (i.e. search for good parameters) Def: hyperparameters are the tunable aspects of the model, that the learning algorithm does not select 72

Model Selection Statistics Def: a model defines the data generation process (i.e. a set or family of parametric probability distributions) Def: model parameters are the values that give rise to a particular probability distribution in the model family Def: learning (aka. estimation) is the process of finding the parameters that best fit the data Def: hyperparameters are the parameters of a prior distribution over parameters Machine Learning If learning is all about picking the best parameters how do we pick the best hyperparameters? Def: (loosely) a model defines the hypothesis space over which learning performs its search Def: model parameters are the numeric values or structure selected by the learning algorithm that give rise to a hypothesis Def: the learning algorithm defines the data-driven search over the hypothesis space (i.e. search for good parameters) Def: hyperparameters are the tunable aspects of the model, that the learning algorithm does not select 73

Model Selection Two very similar definitions: Def: model selection is the process by which we choose the best model from among a set of candidates Def: hyperparameter optimization is the process by which we choose the best hyperparameters from among a set of candidates (could be called a special case of model selection) Both assume access to a function capable of measuring the quality of a model Both are typically done outside the main training algorithm --- typically training is treated as a black box 74

Example of Hyperparameter Opt. Chalkboard: Special cases of k-nearest Neighbors Choosing k with validation data Choosing k with cross-validation 75

Cross-Validation Cross validation is a method of estimating loss on held out data Input: training data, learning algorithm, loss function (e.g. 0/1 error) Output: an estimate of loss function on held-out data Key idea: rather than just a single validation set, use many! (Error is more stable. Slower computation.) D = y (1) y (2) y (N) x (1) x (2) x (N) Fold 1 Fold 2 Fold 3 Fold 4 Algorithm: Divide data into folds (e.g. 4) 1. Train on folds {1,2,3} and predict on {4} 2. Train on folds {1,2,4} and predict on {3} 3. Train on folds {1,3,4} and predict on {2} 4. Train on folds {2,3,4} and predict on {1} Concatenate all the predictions and evaluate loss (almost equivalent to averaging loss over the folds) 76

Model Selection WARNING (again): This section is only scratching the surface! Lots of methods for hyperparameter optimization: (to talk about later) Grid search Random search Bayesian optimization Graduate-student descent Main Takeaway: Model selection / hyperparameter optimization is just another form of learning 77

Model Selection Learning Objectives You should be able to Plan an experiment that uses training, validation, and test datasets to predict the performance of a classifier on unseen data (without cheating) Explain the difference between (1) training error, (2) validation error, (3) cross-validation error, (4) test error, and (5) true error For a given learning technique, identify the model, learning algorithm, parameters, and hyperparamters Define "instance-based learning" or "nonparametric methods" Select an appropriate algorithm for optimizing (aka. learning) hyperparameters 78