CSC411/2515 Tutorial: K-NN and Decision Tree

Similar documents
CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Nominal Data. May not have a numerical representation Distance measures might not make sense. PR and ANN

BITS F464: MACHINE LEARNING

Decision Trees: Discussion

Supervised Learning: K-Nearest Neighbors and Decision Trees

CS489/698 Lecture 2: January 8 th, 2018

Introduction to Machine Learning

Introduction to Artificial Intelligence

Decision tree learning

Decision Tree CE-717 : Machine Learning Sharif University of Technology

Nominal Data. May not have a numerical representation Distance measures might not make sense PR, ANN, & ML

Distribution-free Predictive Approaches

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Artificial Intelligence. Programming Styles

Network Traffic Measurements and Analysis

Data Mining and Machine Learning: Techniques and Algorithms

Lecture 5 of 42. Decision Trees, Occam s Razor, and Overfitting

Supervised Learning. Decision trees Artificial neural nets K-nearest neighbor Support vectors Linear regression Logistic regression...

Prediction. What is Prediction. Simple methods for Prediction. Classification by decision tree induction. Classification and regression evaluation

Nonparametric Methods Recap

Classification with Decision Tree Induction

CS 340 Lec. 4: K-Nearest Neighbors

Nearest neighbor classification DSE 220

1) Give decision trees to represent the following Boolean functions:

CS4491/CS 7265 BIG DATA ANALYTICS

Classification and K-Nearest Neighbors

Topics in Machine Learning

Model Complexity and Generalization

Machine Learning. Decision Trees. Manfred Huber

Algorithms: Decision Trees

CSE 446 Bias-Variance & Naïve Bayes

K-Nearest Neighbour Classifier. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

CISC 4631 Data Mining

Linear Regression and K-Nearest Neighbors 3/28/18

Supervised Learning Classification Algorithms Comparison

Data Mining. Lecture 03: Nearest Neighbor Learning

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Going nonparametric: Nearest neighbor methods for regression and classification

Data Mining and Machine Learning. Instance-Based Learning. Rote Learning k Nearest-Neighbor Classification. IBL and Rule Learning

Classification Algorithms in Data Mining

Notes and Announcements

Midterm Examination CS540-2: Introduction to Artificial Intelligence

Data Mining Practical Machine Learning Tools and Techniques

PROBLEM 4

Business Club. Decision Trees

Decision Tree (Continued) and K-Nearest Neighbour. Dr. Xiaowei Huang

Machine Learning in Telecommunications

Bias-Variance Analysis of Ensemble Learning

Decision Trees. Query Selection

Data Mining and Knowledge Discovery Practice notes 2

9/6/14. Our first learning algorithm. Comp 135 Introduction to Machine Learning and Data Mining. knn Algorithm. knn Algorithm (simple form)

error low bias high variance test set training set high low Model Complexity Typical Behaviour 2 CSC2515 Machine Learning high bias low variance

Classification and Regression Trees

SOCIAL MEDIA MINING. Data Mining Essentials

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Machine Learning and Pervasive Computing

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

CS 229 Midterm Review

k Nearest Neighbors Super simple idea! Instance-based learning as opposed to model-based (no pre-processing)

Data Mining and Knowledge Discovery: Practice Notes

Based on Raymond J. Mooney s slides

Data Mining. Decision Tree. Hamid Beigy. Sharif University of Technology. Fall 1396

Nearest Neighbor Classification. Machine Learning Fall 2017

CSC 411 Lecture 4: Ensembles I

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

k-nearest Neighbors + Model Selection

USING OF THE K NEAREST NEIGHBOURS ALGORITHM (k-nns) IN THE DATA CLASSIFICATION

Problem 1: Complexity of Update Rules for Logistic Regression

Machine Learning nearest neighbors classification. Luigi Cerulo Department of Science and Technology University of Sannio

International Journal of Scientific Research & Engineering Trends Volume 4, Issue 6, Nov-Dec-2018, ISSN (Online): X

Data Mining and Knowledge Discovery: Practice Notes

Classification. Instructor: Wei Ding

I211: Information infrastructure II

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

low bias high variance high bias low variance error test set training set high low Model Complexity Typical Behaviour Lecture 11:

Recap of the last lecture. CS276A Text Retrieval and Mining. Text Categorization Examples. Categorization/Classification. Text Classification

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Decision Tree Learning

K- Nearest Neighbors(KNN) And Predictive Accuracy

CSC 411: Lecture 05: Nearest Neighbors

Supervised vs unsupervised clustering

STA 4273H: Statistical Machine Learning

Lasso Regression: Regularization for feature selection

Evaluating Classifiers

Classification/Regression Trees and Random Forests

Chapter 4: Algorithms CS 795

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

Data Mining Classification - Part 1 -

Uninformed Search Methods. Informed Search Methods. Midterm Exam 3/13/18. Thursday, March 15, 7:30 9:30 p.m. room 125 Ag Hall

CS229 Lecture notes. Raphael John Lamarre Townshend

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Evaluating Classifiers

Data Preprocessing. Supervised Learning

Decision trees. Decision trees are useful to a large degree because of their simplicity and interpretability

Decision Trees: Representation

Transcription:

CSC411/2515 Tutorial: K-NN and Decision Tree Mengye Ren csc{411,2515}ta@cs.toronto.edu September 25, 2016

Cross-validation K-nearest-neighbours Decision Trees

Review: Motivation for Validation Framework: learning as optimization Goal: optimize model complexity (for our task) Formulation: minimize underfitting and overfitting

Review: Motivation for Validation Framework: learning as optimization Goal: optimize model complexity (for our task) Formulation: minimize underfitting and overfitting In particular, we want our model to generalize well without overfitting.

Review: Motivation for Validation Framework: learning as optimization Goal: optimize model complexity (for our task) Formulation: minimize underfitting and overfitting In particular, we want our model to generalize well without overfitting. We can ensure this by validating the model.

Types of Validation (1) hold-out validation: split data into training set and validation set usually: 30% as hold-out set

Types of Validation (1) hold-out validation: split data into training set and validation set usually: 30% as hold-out set Problems: waste of dataset estimation of error rate may be misleading

Types of Validation (2) cross-validation: random sub-sampling 1 1 Figure from Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

Types of Validation (2) cross-validation: random sub-sampling 1 Problem: more computationally expensive than hold-out validation 1 Figure from Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.

Variants of Cross-validation (1) leave-p-out: use p examples as the validation set, and the rest as training; repeat for all configurations of examples. e.g., for p = 1:

Variants of Cross-validation (1) leave-p-out: use p examples as the validation set, and the rest as training; repeat for all configurations of examples. e.g., for p = 1: Problem: exhaustive: We are required to train and test ( N p) times, where N is the number of training examples.

Variants of Cross-validation (2) K-fold: partition training data into K equally sized subsamples; for each fold, use K 1 subsamples as training data with the last subsample as validation

Variants of Cross-validation (2): K-fold Advantages: All observations are used for both training and validation, and each observation is used for validation exactly once. non-exhaustive = more tractable than LpOCV

Variants of Cross-validation (2): K-fold Advantages: All observations are used for both training and validation, and each observation is used for validation exactly once. non-exhaustive = more tractable than LpOCV Problems: expensive for large N, K (since we train/test K models on N examples) but there are some efficient hacks to save time over the brute-force method... can still overfit if we validate too many models! Solution: hold out an additional test set before doing any model selection, and check that the best model performs well even on the additional test set (nested cross-validation)

Practical Tips for Using K-fold Cross-validation Q: How many folds do we need? A: with larger K,... error estimation tends to be more accurate but computation time will be greater

Practical Tips for Using K-fold Cross-validation Q: How many folds do we need? A: with larger K,... error estimation tends to be more accurate but computation time will be greater In practice: usually choose K 10 BUT larger dataset = choose smaller K

Cross-validation K-nearest-neighbours Decision Trees

K-nearest-neighbours: Definition Training: store all training examples (perfect memory) Test: predict value/class of an unseen (test) instance based on closeness to stored training examples, relative to some distance (similarity) measure 2 2 Figure from Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT press.

Predicting with K-nearest-neighbours for K = 1, predict the same value/class as the nearest instance in the training set. for K > 1, find the K closest training examples, and either predict class by majority vote (in classification). predict value by average weighted inverse distance (in regression).

Practical Tips for Using KNN (1) ties may occur in a classification problem when K > 1 for binary classification: choose K odd to avoid ties for multi-class classification: decrease the value of K until the tie is broken if that doesn t work, use the class given by a 1NN classifier

Practical Tips for Using KNN (2) magnitude of K: smaller K: predictions have higher variance (less stable) larger K: predictions have higher bias (less true)

Aside: The Bias-variance Tradeoff A learning procedure creates biased models if... the predictive distribution of the models differs greatly from the target distribution.

Aside: The Bias-variance Tradeoff A learning procedure creates biased models if... the predictive distribution of the models differs greatly from the target distribution. A learning procedure creates models with high variance if... the models have greatly different test predictions (across different training sets from the same target distribution).

Practical Tips for Using KNN (2) magnitude of K: smaller K: predictions have higher variance (less stable) larger K: predictions have higher bias (less true) Cross-validation can help here!

Practical Tips for Using KNN (3) the choice of distance measure affects the results! e.g., standard Euclidean distance: Problems: d E (x, y) = (x i y i ) 2 i assumes features have equal variance

Practical Tips for Using KNN (3) the choice of distance measure affects the results! e.g., standard Euclidean distance: Problems: d E (x, y) = (x i y i ) 2 i assumes features have equal variance Solution: standardize / scale the feature values assumes features are uncorrelated

Practical Tips for Using KNN (3) the choice of distance measure affects the results! e.g., standard Euclidean distance: Problems: d E (x, y) = (x i y i ) 2 i assumes features have equal variance Solution: standardize / scale the feature values assumes features are uncorrelated Solution: use a more complex model (e.g., a Gaussian)

Practical Tips for Using KNN (3) the choice of distance measure affects the results! e.g., standard Euclidean distance: Problems: d E (x, y) = (x i y i ) 2 i assumes features have equal variance Solution: standardize / scale the feature values assumes features are uncorrelated Solution: use a more complex model (e.g., a Gaussian) in high-dimensional space, noisy features dominate

Practical Tips for Using KNN (3) the choice of distance measure affects the results! e.g., standard Euclidean distance: Problems: d E (x, y) = (x i y i ) 2 i assumes features have equal variance Solution: standardize / scale the feature values assumes features are uncorrelated Solution: use a more complex model (e.g., a Gaussian) in high-dimensional space, noisy features dominate Solution: apply (learn?) feature weightings

Practical Tips for Using KNN (4) K NN has perfect memory, so computational complexity is an issue at test time: O(N D) computations per test point

Practical Tips for Using KNN (4) K NN has perfect memory, so computational complexity is an issue Solutions: at test time: O(N D) computations per test point dimensionality reduction sample features project the data to a lower dimensional space sample training examples use clever data structures, like k-d trees

MATLAB Demo

Cross-validation K-nearest-neighbours Decision Trees

Decision Trees: Definition Goal: Approximate a discrete-valued target function Representation: a tree, of which each internal (non-leaf) node tests an attribute each branch corresponds to an attribute value each leaf node assigns a class 3 3 Example from Mitchell, T (1997). Machine Learning, McGraw Hill.

Decision Trees: Induction The ID3 algorithm: while training examples are not perfectly classified, do choose the most informative attribute θ (that has not already been used) as the decision attribute for the next node N (greedy selection) for each value (discrete θ) / range (continuous θ), create a new descendant of N sort the training examples to the descendants of N

Decision Trees: Example PlayTennis

After splitting the training examples first on Outlook... What should we choose as the next attribute under the branch Outlook = Sunny?

Choosing the Most Informative Attribute Formulation: Maximise information gain over attributes Y. Information Gain (PlayTennis Y ) = H(PlayTennis) H(PlayTennis Y ) = x P(PlayTennis = x) log P(PlayTennis = x) x,y P(PlayTennis = x, Y = y) log P(Y = y) P(PlayTennis = x, Y = y)

Information Gain Computation (1) InfoGain (PlayTennis Humidity) =.970 3 5 (0.0) 2 5 (0.0) =.970

Information Gain Computation (2) InfoGain (PlayTennis Temperature) =.970 2 5 (0.0) 2 5 (1.0) 1 5 (0.0) =.570

Information Gain Computation (3) InfoGain (PlayTennis Wind) =.970 2 5 (1.0) 3 5 (0.918) =.019

The Decision Tree for PlayTennis

Recall: The Bias-variance Tradeoff Where do decision trees naturally lie in this space?

Recall: The Bias-variance Tradeoff Where do decision trees naturally lie in this space? Answer: high variance

Recall: The Bias-variance Tradeoff Where do decision trees naturally lie in this space? Answer: high variance How to fix: pruning (e.g., reduced-error pruning, rule post-pruning)