Machine Learning. Classification

Similar documents
Generative and discriminative classification techniques

Machine Learning. Supervised Learning. Manfred Huber

Classification and K-Nearest Neighbors

Generative and discriminative classification techniques

Semi-supervised Learning

ECG782: Multidimensional Digital Signal Processing

6.867 Machine Learning

Application of Support Vector Machine Algorithm in Spam Filtering

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Naïve Bayes for text classification

Distribution-free Predictive Approaches

Notes and Announcements

Machine Learning / Jan 27, 2010

Thorsten Joachims Then: Universität Dortmund, Germany Now: Cornell University, USA

Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

We use non-bold capital letters for all random variables in these notes, whether they are scalar-, vector-, matrix-, or whatever-valued.

VECTOR SPACE CLASSIFICATION

CSE 573: Artificial Intelligence Autumn 2010

CS4495/6495 Introduction to Computer Vision. 8C-L1 Classification: Discriminative models

Applying Supervised Learning

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Machine Learning Lecture 3

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.

UVA CS 4501: Machine Learning. Lecture 10: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Object Recognition. Lecture 11, April 21 st, Lexing Xie. EE4830 Digital Image Processing

Slides for Data Mining by I. H. Witten and E. Frank

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

Probabilistic Classifiers DWML, /27

STA 4273H: Statistical Machine Learning

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

Support vector machines

Machine Learning Lecture 3

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Image analysis. Computer Vision and Classification Image Segmentation. 7 Image analysis

Machine Learning Lecture 3

Bayes Classifiers and Generative Methods

Lecture 19: Generative Adversarial Networks

Supervised vs unsupervised clustering

Lecture 9: Support Vector Machines

Generative and discriminative classification

Announcements:$Rough$Plan$$

k-nearest Neighbors + Model Selection

Feature Descriptors. CS 510 Lecture #21 April 29 th, 2013

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Machine Learning. Chao Lan

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Content-based image and video analysis. Machine learning

MS1b Statistical Data Mining Part 3: Supervised Learning Nonparametric Methods

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

A Taxonomy of Semi-Supervised Learning Algorithms

Data Mining and Knowledge Discovery: Practice Notes

Classification: Feature Vectors

SUPERVISED LEARNING METHODS. Stanley Liang, PhD Candidate, Lassonde School of Engineering, York University Helix Science Engagement Programs 2018

CSE 158. Web Mining and Recommender Systems. Midterm recap

Naïve Bayes, Gaussian Distributions, Practical Applications

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Generative and discriminative classification techniques

Supervised Learning Classification Algorithms Comparison

CS 188: Artificial Intelligence Fall 2008

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Kernels and Clustering

Instance-based Learning

Linear Regression and K-Nearest Neighbors 3/28/18

Intro to Artificial Intelligence

Data Mining and Knowledge Discovery: Practice Notes

Data Preprocessing. Supervised Learning

UVA CS 6316/4501 Fall 2016 Machine Learning. Lecture 15: K-nearest-neighbor Classifier / Bias-Variance Tradeoff. Dr. Yanjun Qi. University of Virginia

Overview. Non-Parametrics Models Definitions KNN. Ensemble Methods Definitions, Examples Random Forests. Clustering. k-means Clustering 2 / 8

Machine Learning Techniques for Data Mining

1) Give decision trees to represent the following Boolean functions:

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

6.034 Quiz 2, Spring 2005

Overview of machine learning

CS 343: Artificial Intelligence

Machine Learning and Pervasive Computing

Features: representation, normalization, selection. Chapter e-9

ADVANCED CLASSIFICATION TECHNIQUES

Data Mining and Knowledge Discovery: Practice Notes

CISC 4631 Data Mining

Keyword Extraction by KNN considering Similarity among Features

Multi-Label Classification with Conditional Tree-structured Bayesian Networks

Machine Learning. Semi-Supervised Learning. Manfred Huber

Linear combinations of simple classifiers for the PASCAL challenge

Unsupervised Learning: Clustering

MODULE 7 Nearest Neighbour Classifier and its variants LESSON 11. Nearest Neighbour Classifier. Keywords: K Neighbours, Weighted, Nearest Neighbour

SUPPORT VECTOR MACHINES

Data Mining. Lecture 03: Nearest Neighbor Learning

Bayes Risk. Classifiers for Recognition Reading: Chapter 22 (skip 22.3) Discriminative vs Generative Models. Loss functions in classifiers

Data Mining and Knowledge Discovery Practice notes: Numeric Prediction, Association Rules

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Data Mining and Knowledge Discovery Practice notes Numeric prediction and descriptive DM

Chapter 4: Non-Parametric Techniques

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

Machine Learning. B. Unsupervised Learning B.1 Cluster Analysis. Lars Schmidt-Thieme, Nicolas Schilling

Instance-Based Representations. k-nearest Neighbor. k-nearest Neighbor. k-nearest Neighbor. exemplars + distance measure. Challenges.

Multivariate Data Analysis and Machine Learning in High Energy Physics (V)

Transcription:

10-701 Machine Learning Classification

Inputs Inputs Inputs Where we are Density Estimator Probability Classifier Predict category Today Regressor Predict real no. Later

Classification Assume we want to teach a computer to distinguish between cats and dogs

Bayes decision rule x input feature set y - label If we know the conditional probability p(x y) and class priors p(y) we can determine the appropriate class by using Bayes rule: def P(x y i)p(y i) P(y i x) qi (x) P(x) We can use q i (x) to select the appropriate class. We chose class 0 if q 0 (x) q 1 (x) and class 1 otherwise This is termed the Bayes decision rule and leads to optimal classification. However, it is often very hard to compute Minimizes our probability of making a mistake Note that p(x) does not affect our decision

Bayes decision rule P(y i x) P(x y i)p(y i) P(x) def qi (x) We can also use the resulting probabilities to determine our confidence in the class assignment by looking at the likelihood ratio: L(x) q 0 (x) q 1 (x) Also known as likelihood ratio, we will talk more about this later

Bayes decision rule: Example Normal Gaussians Normal Gaussians x 1 x 1 1 1 1 x 1 x x 2 x 2 2 2 x 2 x 2

Bayes error p(x y =1) For the Bayes decision rule we can calculate the probability of an error This is the probability that we assign a sample to the wrong class, also known as the risk The risk for sample x is: P(Y X) P 1 (X)P(Y=1) / P(X) P 0 (X)P(Y=0) / P(X) x values for which we will have errors X R(x) = min{p 1 (x)p(y=1), P 0 (x)p(y=0)} / P(x) Risk can be used to determine a reject region

Bayes error P(Y X) P 1 (X)P(Y=1) / P(X) P 0 (X)P(Y=0) / P(X) The probability that we assign a sample to the wrong class, is known as the risk The risk for sample x is: R(x) = min{p 1 (x)p(y=1), P 0 (x)p(y=0)} / P(x) X We can also compute the expected risk (the risk for the entire range of values of x): L 1 is the region where we assign instances to class 1 E[r(x)] r(x)p(x)dx x x min{ p 1 (x)p(y 1), p 0 (x)p(y 0)}dx p(y 0) p 0 (x)dx p(y 1) p 1 (x)dx L 1 L 0

Loss function The risk value we computed assumes that both errors (assigning instances of class 1 to class 0 and vice versa) are equally harmful. However, this is not always the case. Why? In general our goal is to minimize loss, often defined by a loss function: L 0,1 (x) which is the penalty we pay when assigning instances of class 0 to class 1 E[ L] L0,1 p( y 0) p0( x) dx L1,0 p( y 1) p1( x) dx L 1 L 0

Types of classifiers We can divide the large variety of classification approaches into roughly two main types 1. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors 2. Generative: - build a generative statistical model - e.g., Naïve Bayes 3. Discriminative - directly estimate a decision rule/boundary - e.g., decision tree

Classification Assume we want to teach a computer to distinguish between cats and dogs Several steps: 1. feature transformation 2. Model / classifier specification 3. Model / classifier estimation (with regularization) 4. feature selection

Classification Assume we want to teach a computer to distinguish between cats and dogs Several steps: 1. feature transformation 2. Model / classifier specification 3. Model / classifier estimation (with regularization) 4. feature selection How do we encode the picture? A collection of pixels? Do we use the entire image or a subset?

Classification Assume we want to teach a computer to distinguish between cats and dogs Several steps: 1. feature transformation 2. Model / classifier specification 3. Model / classifier estimation (with regularization) 4. feature selection What type of classifier should we use?

Classification Assume we want to teach a computer to distinguish between cats and dogs Several steps: 1. feature transformation 2. Model / classifier specification 3. Model / classifier estimation (with regularization) 4. feature selection How do we learn the parameters of our classifier? Do we have enough examples to learn a good model?

Classification Assume we want to teach a computer to distinguish between cats and dogs Several steps: 1. feature transformation 2. Model / classifier specification 3. Model / classifier estimation (with regularization) 4. feature selection Do we really need all the features? Can we use a smaller number and still achieve the same (or better) results?

Supervised learning Classification is one of the key components of supervised learning Unlike other learning paradigms, in supervised learning the teacher (us) provides the algorithm with the solutions to some of the instances and the goal is to generalize so that a model / method can be used to determine the labels of the unobserved samples X Classifier w 1, w 2 X,Y Y teacher

Types of classifiers We can divide the large variety of classification approaches into roughly two main types 1. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors 2. Generative: - build a generative statistical model - e.g., Bayesian networks 3. Discriminative - directly estimate a decision rule/boundary - e.g., decision tree

K nearest neighbors

K nearest neighbors (KNN) A simple, yet surprisingly efficient algorithm Requires the definition of a distance function or similarity measures between samples Select the class based on the majority vote in the k closest points?

K nearest neighbors (KNN) Need to determine an appropriates value for k What happens if we chose k=1? What if k=3??

K nearest neighbors (KNN) Choice of k influences the smoothness of the resulting classifier In that sense it is similar to a kernel methods (discussed later in the course) However, the smoothness of the function is determined by the actual distribution of the data (p(x)) and not by a predefined parameter.?

The effect of increasing k

The effect of increasing k We will be using Euclidian distance to determine what are the k nearest neighbors: d(x,x') (x i x i ') 2 i

KNN with k=1

KNN with k=3 Ties are broken using the order: Red, Green, Blue

KNN with k=5 Ties are broken using the order: Red, Green, Blue

Comparisons of different k s K = 1 K = 3 K = 5

A probabilistic interpretation of KNN The decision rule of KNN can be viewed using a probabilistic interpretation What KNN is trying to do is approximate the Bayes decision rule on a subset of the data To do that we need to compute certain properties including the conditional probability of the data given the class (p(x y)), the prior probability of each class (p(y)) and the marginal probability of the data (p(x)) These properties would be computed for some small region around our sample and the size of that region will be dependent on the distribution of the test samples* * Remember this idea. We will return to it when discussing kernel functions

Computing probabilities for KNN Let V be the volume of the m dimensional ball around z containing the k nearest neighbors for z (where m is the number of features). Then we can write p(x)v P K N p(x) K NV p(x y 1) K 1 N 1 V p(y 1) N 1 N Using Bayes rule we get: p( z y 1) p( y 1) ( y 1 z) p( z) K K p 1 z new data point to classify V - selected ball P probability that a random point is in V N - total number of samples K - number of nearest neighbors N 1 - total number of samples from class 1 K 1 - number of samples from class 1 in K

Computing probabilities for KNN Using Bayes rule we get: N - total number of samples V - Volume of selected ball K - number of nearest neighbors N 1 - total number of samples from class 1 K 1 - number of samples from class 1 in K p( z y 1) p( y 1) ( y 1 z) p( z) K K p 1 Using Bayes decision rule we will chose the class with the highest probability, which in this case is the class with the highest number of samples in K

Important points Optimal decision using Bayes rule Types of classifiers Effect of values of k on knn classifiers Probabilistic interpretation of knn