Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

Similar documents
Network Traffic Measurements and Analysis

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Evaluating Classifiers

Bayes Net Learning. EECS 474 Fall 2016

CPSC 340: Machine Learning and Data Mining. Multi-Class Classification Fall 2017

Evaluation. Evaluate what? For really large amounts of data... A: Use a validation set.

Machine Learning. Chao Lan

Evaluating Classifiers

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

CS273 Midterm Exam Introduction to Machine Learning: Winter 2015 Tuesday February 10th, 2014

Evaluating Classifiers

1 Machine Learning System Design

Weka ( )

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

CPSC 340: Machine Learning and Data Mining. Probabilistic Classification Fall 2017

Boolean Classification

Evaluation Metrics. (Classifiers) CS229 Section Anand Avati

Machine Learning and Bioinformatics 機器學習與生物資訊學

Regularization and model selection

Linear Regression & Gradient Descent

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Machine Learning Classifiers and Boosting

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Simple Model Selection Cross Validation Regularization Neural Networks

Understanding Andrew Ng s Machine Learning Course Notes and codes (Matlab version)

Structured Learning. Jun Zhu

CSE 446 Bias-Variance & Naïve Bayes

INTRODUCTION TO MACHINE LEARNING. Measuring model performance or error

Large Scale Data Analysis Using Deep Learning

CS535 Big Data Fall 2017 Colorado State University 10/10/2017 Sangmi Lee Pallickara Week 8- A.

Classification. Slide sources:

Business Club. Decision Trees

CS 179 Lecture 16. Logistic Regression & Parallel SGD

Perceptron as a graph

What is machine learning?

Ensemble methods in machine learning. Example. Neural networks. Neural networks

Conditional Random Fields - A probabilistic graphical model. Yen-Chin Lee 指導老師 : 鮑興國

Support Vector Machines

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

CS249: ADVANCED DATA MINING

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

CS4491/CS 7265 BIG DATA ANALYTICS

The exam is closed book, closed notes except your one-page cheat sheet.

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Practice Questions for Midterm

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Instance-based Learning

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

CSE 546 Machine Learning, Autumn 2013 Homework 2

Assignment 4 (Sol.) Introduction to Data Analytics Prof. Nandan Sudarsanam & Prof. B. Ravindran

CS6220: DATA MINING TECHNIQUES

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Distributed Machine Learning" on Spark

Evaluating Machine-Learning Methods. Goals for the lecture

Natural Language Processing CS 6320 Lecture 6 Neural Language Models. Instructor: Sanda Harabagiu

Model selection and validation 1: Cross-validation

Tree-based methods for classification and regression

Announcements: projects

CPSC 340: Machine Learning and Data Mining. Logistic Regression Fall 2016

Perceptron Introduction to Machine Learning. Matt Gormley Lecture 5 Jan. 31, 2018

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

HMC CS 158, Fall 2017 Problem Set 3 Programming: Regularized Polynomial Regression

Conditional Random Fields and beyond D A N I E L K H A S H A B I C S U I U C,

Stages of (Batch) Machine Learning

Artificial Neural Networks (Feedforward Nets)

Data Mining and Knowledge Discovery Practice notes 2

Boosting Simple Model Selection Cross Validation Regularization

IEE 520 Data Mining. Project Report. Shilpa Madhavan Shinde

Louis Fourrier Fabien Gaie Thomas Rolf

Homework 2: HMM, Viterbi, CRF/Perceptron

Evaluating Machine Learning Methods: Part 1

Practical Methodology. Lecture slides for Chapter 11 of Deep Learning Ian Goodfellow

Data Mining and Knowledge Discovery: Practice Notes

Instance-based Learning

ImageNet Classification with Deep Convolutional Neural Networks

Bias-Variance Analysis of Ensemble Learning

Parallelization in the Big Data Regime 5: Data Parallelization? Sham M. Kakade

More on Learning. Neural Nets Support Vectors Machines Unsupervised Learning (Clustering) K-Means Expectation-Maximization

COMP90051 Statistical Machine Learning

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Naïve Bayes Classification. Material borrowed from Jonathan Huang and I. H. Witten s and E. Frank s Data Mining and Jeremy Wyatt and others

Classification and Regression Trees

CAP 6412 Advanced Computer Vision

1 Training/Validation/Testing

A Brief Look at Optimization

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

Advanced Video Content Analysis and Video Compression (5LSH0), Module 8B

CS 6784 Paper Presentation

Gradient Descent Optimization Algorithms for Deep Learning Batch gradient descent Stochastic gradient descent Mini-batch gradient descent

CS229 Final Project One Click: Object Removal

Structured Prediction Basics

Classification Part 4

Data Mining Classification: Bayesian Decision Theory

Topics in Machine Learning

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Transcription:

Partitioning Data IRDS: Evaluation, Debugging, and Diagnostics Charles Sutton University of Edinburgh Training Validation Test Training : Running learning algorithms Validation : Tuning parameters of learning algorithm (e.g., regularization parameters) Test : Estimate performance on new situation ideally only used once but this is never really possible (research field overfitting) Cross-Validation Cross-Validation for parameter tuning (e.g., k in k-nearest neighbour) Fold 1 Fold 2 Fold 3 (test) Fold 4 Fold 5 Fold 1 Fold 2 Fold 3 (validate) Fold 4 Fold 5 Test set Split data into K equal partitions For each partition Train on all other Average performance over K folds This way every example is used an a test example (useful if data scarce) First partition into training and test set Then on training set only: For each value of parameter, e.g., k in {1,2,5,10, } Run K-CV to estimate performance Train one model with best k on entire train set

Measures for Regression Root Mean Squared Error (RMSE) v u RMSE = t 1 N Mean Absolute Error MAE = 1 N (y i f(x i )) 2 y i f(x i ) Test set denoted {(x i,y i ) i 2 {1, 2,...N}} Learned Regression function f(x i ) Measures for Classification Accuracy %age correctly labeled Precision Recall ACC = TP + TN N when I say +, how often am I right? P = of the real +, how many do I find? R = TP TP + FP TP TP + FN Predicted label TP: True positives True label + - + TP FP - FN TN Number of test instances where true label == +, predicted label == + FP: False positives TN: True negatives FN: False negatives TP + FP + TN + FN = N (for a multi-class problem, can compute P and R for each class) Total number of test instances Interesting Facts about P, R Accuracy is a simple measure and a single number. This is good. Precision and recall can be interesting when The classes are highly skewed You want to understand performance on individual classes One class more important, e.g., information retrieval Many classes and want to break You want to understand performance as a function of how conservative the predictions are. P and R are an interesting pair because they are in conflict A good principle for pairs of evaluation measures Debugging and Diagnostics Based on slides from Stephen Gould and Andrew Ng

What do I do now? You build a classifier (e.g., a spam filter using logistic regression) and the error is too high. What do you do to fix it? There are lots of things you could try: Collect more training data. Add different features (e.g., from the email header) Try fewer features (e.g., exclude rare words from the classifier) Try an SVM instead of logistic regression Fix a bug in your stochastic gradient descent procedure You could do trial and error, but better is to think of diagnostics Bias-Variance Tradeoff Let 2 R be some quantity we estimate by a random variable ˆ Example: Z 1 = xp(x)dx ˆ = x i x 1...x N p N Define the bias and variance Both are averages across all data sets that we might have seen. Bias(ˆ ) =E( ˆ ) = µ i 2 Var(ˆ ) =E hˆ µ ZZZ where µ = ˆ p(x 1,...,x N ) dx 1 dx N FUN to look up: Bias-variance decomposition Useful concepts more generally. These trade off Bias-Variance Tradeoff Bias-Variance Tradeoff Prediction Error High Bias Low Variance Test Sample Low Bias High Variance Prediction Error Low High Bias Low Variance Training Sample Test Sample Model Complexity Low Bias High Variance Not just a cartoon. Can use as a diagnostic. On x axis could put High Number of features (sort in some meaningful way) Model parameter that controls complexity Low Training Sample Model Complexity Figure from [Hastie, Tibshirani, and Friedman, 2009] High k in k-nearest neighbour number of trees in boosting, random forests regularization parameters Or perhaps you have access to more complex models e.g., naive Bayes versus HMM

Learning Curves Learning Curve Example 1 error rate error rate test set test set training set target error rate training set training set size training set size (Q: Why is error going up?) Test error no longer decreasing Even training error is too high Not much difference between training and test error high bias Learning Curve Example 2 Optimization in the Loop Often learning methods work by optimizing some objective function. error rate training set size Test error still decreasing Big gap training and test error test set target error rate training set high variance For example, recall logistic regression To learn the weights, we solve max w p(y =1 x) = data points (x (i),y (i) ) for i in 1...N Maybe optimise this using gradient descent When this performs poorly, now have two questions Is my numerical optimization algorithm performing poorly? Or is objective function L not doing what I want? (Simple ex: spam filtering with cost-sensitive error) Comes up especially often during research in data science Often we introduce new models (== new objective function) Which might be harder to optimize 1 1 + exp{ w > x} N L(w) = X log p(y = y (i) x = x (i) )

Optimization Example Example: To optimize max w N L(w) = X log p(y = y (i) x = x (i) ) Simple choice is batch gradient descent: r w L(w) = r w [log p(y = y (i) x = x (i) )] This will be slow if N is big. Alternative: stochastic gradient descent. Simplest version: Sample i Uniform({1, 2,...,N}) Compute r w log p(y = y (i) x = x (i) ) (single instance) Update using this gradient. (this is standard in deep learning, e.g.) Optimization Diagnostic You run a logistic regression spam filter on 100,000 training instances. Using batch gradient descent, you get an accuracy of 85% Not good enough, so you get a larger set of 100,000,000 examples Batch gradient is too slow, so you switch to SGD Now you only get 80% accuracy (??) Diagnostic: Check the batch training objective L(w) = 100X000 000 log p(y = y (i) x = x (i) ) Compute this for final result of batch GD w GD and SGD w SGD If L(w SGD) apple L(w GD) then your SGD procedure is screwed up (maybe try a different step size?) This kind of thing happens far more generally. w GD w SGD The Numerical Gradient Check Nested Models Often optimization packages require you to implement functions for both (although automatic differentiation is becoming more popular) In that case, check whether w 7 L(w) 1 L(w + ) Easy to have a bug in one function but not the other. Do this for different settings of w 7 r w L L(w) =r w L MATLAB does this automatically if you ask it to w Often complicated models contain simpler models as a special case. For logistic regression: 1 p(y =1 x) = 1 + exp{ w > x} so if w = 0, the distribution over y is be uniform. Is that what happens in your code? If not, bug. Another example: a hidden Markov model and a mixture model p(x, z) = Y p(x t z t )p(z t z t 1 ) t if p(z t z t 1 ) ignores z t 1, then all x t independent... mixture model Lots of ways to get diagnostics from this: Training error of HMM should be strictly better Force your HMM code to fit observation distributions only. Do you get the same distribution as mixture model Logistic regression: Numerical gradient check Try it first at w=0. It will be easier to debug there.

Pipelines of Predictions Practical systems use predictors at multiple points e.g., Finding company mergers from newswire text Split article into sentences Add part of speech tags Recognize company names Many steps rely on learning, will make errors Is one step a weak link? Or are errors slowly propagating? Debug by replacing intermediate predictions with gold standard (human annotations) Syntactic parsing Classify merger relationship Overall Advice For practical work: Try quick and dirty first. Iterate quickly Different diagnostics Learning curves As function of size of training set As function of model complexity Additionally: number of iterations of learning algorithm Optimization diagnostics Diagnostics using model nesting Breaking chains of predictions Sometimes diagnostics require a bit of ingenuity. Trust no one Just because something is true in the maths doesn t mean it is in your code Imagine how you think the method is probably behaving and check whether that happens (this holds for research too)