Notes and Announcements

Similar documents
Nonparametric Methods Recap

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Machine Learning / Jan 27, 2010

Homework. Gaussian, Bishop 2.3 Non-parametric, Bishop 2.5 Linear regression Pod-cast lecture on-line. Next lectures:

Machine Learning. Supervised Learning. Manfred Huber

Machine Learning Lecture 3

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Machine Learning Lecture 3

Machine Learning Lecture 3

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Going nonparametric: Nearest neighbor methods for regression and classification

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Instance-Based Learning: Nearest neighbor and kernel regression and classificiation

Model Assessment and Selection. Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

Distribution-free Predictive Approaches

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Machine Learning. Chao Lan

Generative and discriminative classification techniques

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

Classification and K-Nearest Neighbors

Boosting Simple Model Selection Cross Validation Regularization

Instance-based Learning

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Simple Model Selection Cross Validation Regularization Neural Networks

Generative and discriminative classification techniques

VECTOR SPACE CLASSIFICATION

Nearest Neighbor Classification

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

Semi-supervised Learning

Lecture 03 Linear classification methods II

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Non-Parametric Modeling

k-nearest Neighbors + Model Selection

CSE 446 Bias-Variance & Naïve Bayes

Going nonparametric: Nearest neighbor methods for regression and classification

Supervised Learning: Nearest Neighbors

Announcements:$Rough$Plan$$

Network Traffic Measurements and Analysis

CSCI567 Machine Learning (Fall 2014)

Machine Learning. Classification

Naïve Bayes for text classification

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques.

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

CSC 411: Lecture 05: Nearest Neighbors

Nearest neighbors classifiers

What is machine learning?

Mini-project 2 CMPSCI 689 Spring 2015 Due: Tuesday, April 07, in class

Introduction to Automated Text Analysis. bit.ly/poir599

Kernels + K-Means Introduction to Machine Learning. Matt Gormley Lecture 29 April 25, 2018

Perceptron as a graph

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

CSC411/2515 Tutorial: K-NN and Decision Tree

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Random Forest A. Fornaser

Louis Fourrier Fabien Gaie Thomas Rolf

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

Bayes Net Learning. EECS 474 Fall 2016

Machine Learning and Pervasive Computing

University of Cambridge Engineering Part IIB Paper 4F10: Statistical Pattern Processing Handout 11: Non-Parametric Techniques

Recognition Tools: Support Vector Machines

Practice EXAM: SPRING 2012 CS 6375 INSTRUCTOR: VIBHAV GOGATE

Supervised vs unsupervised clustering

Lecture 27, April 24, Reading: See class website. Nonparametric regression and kernel smoothing. Structured sparse additive models (GroupSpAM)

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

1 Training/Validation/Testing

Nonparametric Approaches to Regression

CS 229 Midterm Review

CPSC 340: Machine Learning and Data Mining. Non-Linear Regression Fall 2016

Logis&c Regression. Aar$ Singh & Barnabas Poczos. Machine Learning / Jan 28, 2014

Classification Algorithms in Data Mining

Supervised Learning Classification Algorithms Comparison

Instance-based Learning

CS 231A Computer Vision (Fall 2012) Problem Set 3

Supervised Learning: K-Nearest Neighbors and Decision Trees

BIOINF 585: Machine Learning for Systems Biology & Clinical Informatics

Gradient Descent. Wed Sept 20th, James McInenrey Adapted from slides by Francisco J. R. Ruiz

10601 Machine Learning. Model and feature selection

Data Mining and Machine Learning: Techniques and Algorithms

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Linear Regression and K-Nearest Neighbors 3/28/18

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

Support Vector Machines + Classification for IR

Nearest Neighbor Classification. Machine Learning Fall 2017

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Nonparametric and Semiparametric Econometrics Lecture Notes for Econ 221. Yixiao Sun Department of Economics, University of California, San Diego

Machine Learning Techniques for Data Mining

Statistical Learning Part 2 Nonparametric Learning: The Main Ideas. R. Moeller Hamburg University of Technology

Lecture 9: Support Vector Machines

Introduction to Machine Learning CMU-10701

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Support Vector Machines

ECG782: Multidimensional Digital Signal Processing

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Transcription:

Notes and Announcements Midterm exam: Oct 20, Wednesday, In Class Late Homeworks Turn in hardcopies to Michelle. DO NOT ask Michelle for extensions. Note down the date and time of submission. If submitting softcopy, email to 10-701 instructors list. Software needs to be submitted via Blackboard. HW2 out today watch email 1

Projects Hands-on experience with Machine Learning Algorithms understand when they work and fail, develop new ones! Project Ideas online, discuss TAs, every project must have a TA mentor Proposal (10%): Oct 11 Mid-term report (25%): Nov 8 Poster presentation (20%): Dec 2, 3-6 pm, NSH Atrium Final Project report (45%): Dec 6 2

Project Proposal Proposal (10%): Oct 11 1 pg maximum Describe data set Project idea (approx two paragraphs) Software you will need to write. 1-3 relevant papers. Read at least one before submitting your proposal. Teammate. Maximum team size is 2. division of work Project milestone for mid-term report? Include experimental results. 3

Recitation Tomorrow! Linear & Non-linear Regression, Nonparametric methods Strongly recommended!! Place: NSH 1507 (Note) Time: 5-6 pm TK 4

Non-parametric methods Kernel density estimate, knn classifier, kernel regression Aarti Singh Machine Learning 10-701/15-781 Sept 29, 2010

Parametric methods Assume some functional form (Gaussian, Bernoulli, Multinomial, logistic, Linear) for P(X i Y) and P(Y) as in Naïve Bayes P(Y X) as in Logistic regression Estimate parameters (m,s 2,q,w,b) using MLE/MAP and plug in Pro need few data points to learn parameters Con Strong distributional assumptions, not satisfied in practice 6

Example 7 1 9 2 4 8 5 3 Hand-written digit images projected as points on a two-dimensional (nonlinear) feature spaces 7

Non-Parametric methods Typically don t make any distributional assumptions As we have more data, we should be able to learn more complex models Let number of parameters scale with number of training data Today, we will see some nonparametric methods for Density estimation Classification Regression 8

Histogram density estimate Partition the feature space into distinct bins with widths i and count the number of observations, n i, in each bin. Often, the same width is used for all bins, i =. acts as a smoothing parameter. Image src: Bishop book 9

Effect of histogram bin width # bins = 1/D Bias of histogram density estimate: x Assuming density it roughly constant in each bin (holds true if D is small) 10

Bias Variance tradeoff Choice of #bins # bins = 1/D (p(x) approx constant per bin) (more data per bin, stable estimate) Bias how close is the mean of estimate to the truth Variance how much does the estimate vary around mean Small D, large #bins Large D, small #bins Small bias, Large variance Large bias, Small variance Bias-Variance tradeoff 11

MSE = Bias + Variance Choice of #bins # bins = 1/D fixed n D decreases n i decreases Image src: Bishop book Image src: Larry book 12

Histogram as MLE Class of density estimates constants on each bin Parameters p j - density in bin j Note since Maximize likelihood of data under probability model with parameters p j Show that histogram density estimate is MLE under this model HW/Recitation 13

Kernel density estimate Histogram blocky estimate 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0-5 -4-3 -2-1 0 1 2 3 4 5 Kernel density estimate aka Parzen/moving window method 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0-5 -4-3 -2-1 0 1 2 3 4 5 14

Kernel density estimate more generally -1 1 15

Kernel density estimation Place small "bumps" at each data point, determined by the kernel function. The estimator consists of a (normalized) "sum of bumps. Img src: Wikipedia Gaussian bumps (red) around six data points and their sum (blue) Note that where the points are denser the density estimate will have higher values. 16

Kernels Any kernel function that satisfies 17

Kernels Finite support only need local points to compute estimate Infinite support - need all points to compute estimate -But quite popular since smoother (10-702) 18

Choice of kernel bandwidth Too small Image Source: Larry s book All of Nonparametric Statistics Bart-Simpson Density Just right Too large 19

Histograms vs. Kernel density estimation D = h acts as a smoother. 20

Bias-variance tradeoff Simulations 21

k-nn (Nearest Neighbor) density Histogram Kernel density est estimation Fix D, estimate number of points within D of x (n i or n x ) from data Fix n x = k, estimate D from data (volume of ball around x that contains k training pts) k-nn density est 22

k-nn density estimation Not very popular for density estimation - expensive to compute, bad estimates But a related version for classification quite popular k acts as a smoother. 23

From Density estimation to Classification 24

k-nn classifier Sports Science Arts 25

k-nn classifier Test document Sports Science Arts 26

k-nn classifier (k=4) Test document D k,x Sports Science Arts What should we predict? Average? Majority? Why? 27

k-nn classifier Optimal Classifier: k-nn Classifier: (Majority vote) # training pts of class y that lie within D k ball # total training pts of class y 28

1-Nearest Neighbor (knn) classifier Sports Science Arts 29

2-Nearest Neighbor (knn) classifier K even not used in practice Sports Science Arts 30

3-Nearest Neighbor (knn) classifier Sports Science Arts 31

5-Nearest Neighbor (knn) classifier Sports Science Arts 32

What is the best K? Bias-variance tradeoff Larger K => predicted label is more stable Smaller K => predicted label is more accurate Similar to density estimation Choice of K - in next class 33

1-NN classifier decision boundary Voronoi Diagram K = 1 34

k-nn classifier decision boundary K acts as a smoother (Bias-variance tradeoff) Guarantee: For, the error rate of the 1-nearestneighbour classifier is never more than twice the optimal error. 35

Dataset Case Study: knn for Web Classification 20 News Groups (20 classes) Download :(http://people.csail.mit.edu/jrennie/20newsgroups/) 61,118 words, 18,774 documents Class labels descriptions 37

Experimental Setup Training/Test Sets: 50%-50% randomly split. 10 runs report average results Evaluation Criteria: 38

Accuracy Results: Binary Classes alt.atheism vs. comp.graphics rec.autos vs. rec.sport.baseball comp.windows.x vs. rec.motorcycles k 39

From Classification to Regression 40

Temperature sensing What is the temperature in the room? at location x? x Average Local Average 41

Kernel Regression h Aka Local Regression Nadaraya-Watson Kernel Estimator Where Weight each training point based on distance to test point Boxcar kernel yields local average 42

Kernels 43

Choice of kernel bandwidth h h=1 Too small h=10 Too small Image Source: Larry s book All of Nonparametric Statistics Just h=50 right h=200 Too large Choice of kernel is not that important 44

Kernel Regression as Weighted Least Squares Weighted Least Squares Kernel regression corresponds to locally constant estimator obtained from (locally) weighted least squares i.e. set f(x i ) = b (a constant) 45

Kernel Regression as Weighted Least set f(x i ) = b (a constant) Squares constant Notice that 46

Local Linear/Polynomial Regression Weighted Least Squares Local Polynomial regression corresponds to locally polynomial estimator obtained from (locally) weighted least squares i.e. set (local polynomial of degree p around X) More in HW, 10-702 (statistical machine learning) 47

Summary Instance based/non-parametric approaches Four things make a memory based learner: 1. A distance metric, dist(x,x i ) Euclidean (and many more) 2. How many nearby neighbors/radius to look at? k, D/h 3. A weighting function (optional) W based on kernel K 4. How to fit with the local points? Average, Majority vote, Weighted average, Poly fit 48

Summary Parametric vs Nonparametric approaches Nonparametric models place very mild assumptions on the data distribution and provide good models for complex data Parametric models rely on very strong (simplistic) distributional assumptions Nonparametric models (not histograms) requires storing and computing with the entire data set. Parametric models, once fitted, are much more efficient in terms of storage and computation. 49

What you should know Histograms, Kernel density estimation Effect of bin width/ kernel bandwidth Bias-variance tradeoff K-NN classifier Nonlinear decision boundaries Kernel (local) regression Interpretation as weighted least squares Local constant/linear/polynomial regression 50