Machine Learning. Nonparametric methods for Classification. Eric Xing , Fall Lecture 2, September 12, 2016

Similar documents
Notes and Announcements

Instance-based Learning CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2015

Lecture 1, August 12, Reading:

Chapter 4: Non-Parametric Techniques

VECTOR SPACE CLASSIFICATION

Nearest Neighbor Classification. Machine Learning Fall 2017

Machine Learning. Supervised Learning. Manfred Huber

k-nearest Neighbors + Model Selection

Statistical Learning Part 2 Nonparametric Learning: The Main Ideas. R. Moeller Hamburg University of Technology

ECG782: Multidimensional Digital Signal Processing

Data Mining Classification: Alternative Techniques. Lecture Notes for Chapter 4. Instance-Based Learning. Introduction to Data Mining, 2 nd Edition

CSCI567 Machine Learning (Fall 2014)

Generative and discriminative classification techniques

Clustering Billions of Images with Large Scale Nearest Neighbor Search

Non-Parametric Modeling

Nonparametric Methods Recap

Supervised Learning: K-Nearest Neighbors and Decision Trees

Supervised vs unsupervised clustering

Machine Learning / Jan 27, 2010

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Machine Learning and Pervasive Computing

Empirical risk minimization (ERM) A first model of learning. The excess risk. Getting a uniform guarantee

PATTERN CLASSIFICATION AND SCENE ANALYSIS

Nearest Neighbor Classification

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

CSC 411: Lecture 05: Nearest Neighbors

CPSC 340: Machine Learning and Data Mining. Non-Parametric Models Fall 2016

K Nearest Neighbor Wrap Up K- Means Clustering. Slides adapted from Prof. Carpuat

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

K-Nearest Neighbors. Jia-Bin Huang. Virginia Tech Spring 2019 ECE-5424G / CS-5824

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Chapter 5: Outlier Detection

Naïve Bayes for text classification

DATA MINING LECTURE 10B. Classification k-nearest neighbor classifier Naïve Bayes Logistic Regression Support Vector Machines

Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

Nearest Neighbor Predictors

CS178: Machine Learning and Data Mining. Complexity & Nearest Neighbor Methods

Instance-based Learning

Machine Learning: k-nearest Neighbors. Lecture 08. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Lecture 24: Image Retrieval: Part II. Visual Computing Systems CMU , Fall 2013

Model Selection Introduction to Machine Learning. Matt Gormley Lecture 4 January 29, 2018

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

CSE 573: Artificial Intelligence Autumn 2010

Large-scale visual recognition Efficient matching

CS4495/6495 Introduction to Computer Vision. 8C-L1 Classification: Discriminative models

CS 340 Lec. 4: K-Nearest Neighbors

Clustering. Mihaela van der Schaar. January 27, Department of Engineering Science University of Oxford

CS6375: Machine Learning Gautam Kunapuli. Mid-Term Review

Problem 1: Complexity of Update Rules for Logistic Regression

Machine Learning. Classification

Classifiers and Detection. D.A. Forsyth

Metric Learning Applied for Automatic Large Image Classification

A Taxonomy of Semi-Supervised Learning Algorithms

Classification: Feature Vectors

Machine Learning Techniques for Data Mining

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Slides for Data Mining by I. H. Witten and E. Frank

Feature Extractors. CS 188: Artificial Intelligence Fall Some (Vague) Biology. The Binary Perceptron. Binary Decision Rule.

Semi-supervised Learning

Going nonparametric: Nearest neighbor methods for regression and classification

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

Machine Perception of Music & Audio. Topic 10: Classification

Announcements. CS 188: Artificial Intelligence Spring Generative vs. Discriminative. Classification: Feature Vectors. Project 4: due Friday.


Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Overview Citation. ML Introduction. Overview Schedule. ML Intro Dataset. Introduction to Semi-Supervised Learning Review 10/4/2010

Nonparametric Classification. Prof. Richard Zanibbi

CISC 4631 Data Mining

9 Classification: KNN and SVM

We use non-bold capital letters for all random variables in these notes, whether they are scalar-, vector-, matrix-, or whatever-valued.

Based on Raymond J. Mooney s slides

Announcements. CS 188: Artificial Intelligence Spring Classification: Feature Vectors. Classification: Weights. Learning: Binary Perceptron

10-701/15-781, Fall 2006, Final

Recap: Gaussian (or Normal) Distribution. Recap: Minimizing the Expected Loss. Topics of This Lecture. Recap: Maximum Likelihood Approach

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Network Traffic Measurements and Analysis

Nearest Neighbor with KD Trees

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

Parametrizing the easiness of machine learning problems. Sanjoy Dasgupta, UC San Diego

Machine Learning Lecture 3

Machine Learning Lecture 3

Supervised Learning: Nearest Neighbors

Task Description: Finding Similar Documents. Document Retrieval. Case Study 2: Document Retrieval

Search Engines. Information Retrieval in Practice

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Large Scale Data Analysis Using Deep Learning

Kernels and Clustering

Nearest neighbors classifiers

Chapter 1. Introduction

Supervised Learning for Image Segmentation

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

CLSH: Cluster-based Locality-Sensitive Hashing

Near Neighbor Search in High Dimensional Data (1) Dr. Anwar Alhenshiri

INF 4300 Classification III Anne Solberg The agenda today:

CS570: Introduction to Data Mining

Fast Indexing and Search. Lida Huang, Ph.D. Senior Member of Consulting Staff Magma Design Automation

Metric Learning for Large-Scale Image Classification:

Transcription:

Machine Learning 10-701, Fall 2016 Nonparametric methods for Classification Eric Xing Lecture 2, September 12, 2016 Reading: 1

Classification Representing data: Hypothesis (classifier) 2

Clustering 3

Supervised vs. Unsupervised Learning 4

Univariate prediction without using a model: good or bad? Nonparametric Classifier (Instance-based learning) Nonparametric density estimation K-nearest-neighbor classifier Optimality of knn Spectrum clustering Clustering Graph partition and normalized cut The spectral clustering algorithm Very little learning is involved in these methods But they are indeed among the most popular and powerful machine learning methods 5

Decision-making as dividing a high-dimensional space Class-specific Dist.: P(X Y) Class prior (i.e., "weight"): P(Y) ), ; ( ) ( 1 1 1 1 X p Y X p ), ; ( ) ( 2 2 2 2 X p Y X p 6

The Bayes Decision Rule for Minimum Error The a posteriori probability of a sample Bayes Test: Likelihood Ratio: Discriminant function: ) ( ) ( ) ( ) ( ) ( ) ( ) ( X q i Y X p i Y X p X p i Y P i Y X p X i Y P i i i i i i h(x ) (X ) 7

Example of Decision Rules When each class is a normal We can write the decision boundary analytically in some cases homework!! 8

Bayes Error We must calculate the probability of error the probability that a sample is assigned to the wrong class Given a datum X, what is the risk? The Bayes error (the expected risk): 9

More on Bayes Error Bayes error is the lower bound of probability of classification error Bayes classifier is the theoretically best classifier that minimize probability of classification error Computing Bayes error is in general a very complex problem. Why? Density estimation: Integrating density function: 10

Learning Classifier The decision rule: Learning strategies Generative Learning Discriminative Learning Instance-based Learning (Store all past experience in memory) A special case of nonparametric classifier K-Nearest-Neighbor Classifier: where the h(x) is represented by ALL the data, and by an algorithm 11

Recall: Vector Space Representation Each document is a vector, one component for each term (= word). Doc 1 Doc 2 Doc 3... Word 1 3 0 0... Word 2 0 8 1... Word 3 12 1 10...... 0 1 3...... 0 0 0... Normalize to unit length. High-dimensional vector space: Terms are axes, 10,000+ dimensions, or even 100,000+ Docs are vectors in this space 12

Test Document =? Sports Science Arts 13

1-Nearest Neighbor (knn) classifier Sports Science Arts 14

2-Nearest Neighbor (knn) classifier Sports Science Arts 15

3-Nearest Neighbor (knn) classifier Sports Science Arts 16

K-Nearest Neighbor (knn) classifier Voting knn Sports Science Arts 17

Classes in a Vector Space Sports Science Arts 18

knn Is Close to Optimal Cover and Hart 1967 Asymptotically, the error rate of 1-nearest-neighbor classification is less than twice the Bayes rate [error rate of classifier knowing model that generated data] In particular, asymptotic error rate is 0 if Bayes rate is 0. Decision boundary: 19

Where does knn come from? How to estimation p(x)? Nonparametric density estimation Parzen density estimate E.g. (Kernel density est.): More generally: 20

Where does knn come from? Nonparametric density estimation Parzen density estimate knn density estimate Bayes classifier based on knn density estimator: Voting knn classifier Pick K 1 and K 2 implicitly by picking K 1 +K 2 =K, V 1 =V 2, N 1 =N 2 21

Asymptotic Analysis Condition risk: r k (X,X NN ) Test sample X NN sample X NN Denote the event X is class I as X I Assuming k=1 When an infinite number of samples is available, X NN will be so close to X 22

Asymptotic Analysis, cont. Recall conditional Bayes risk: Thus the asymptotic condition risk This is called the MacLaurin series expansion It can be shown that This is remarkable, considering that the procedure does not use any information about the underlying distributions and only the class of the single nearest neighbor determines the outcome of the decision. 23

In fact Example: 24

knn is an instance of Instance-Based Learning What makes an Instance-Based Learner? A distance metric How many nearby neighbors to look at? A weighting function (optional) How to relate to the local points? 25

Distance Metric Euclidean distance: D( x, x') i Or equivalently, i 2 ( x i x i ') 2 T D( x, x') ( x x') ( x x') Other metrics: L 1 norm: x-x' L norm: max x-x' (elementwise ) Mahalanobis: where is full, and symmetric Correlation Angle Hamming distance, Manhattan distance 26

Case Study: knn for Web Classification Dataset 20 News Groups (20 classes) Download :(http://people.csail.mit.edu/jrennie/20newsgroups/) 61,118 words, 18,774 documents Class labels descriptions 27

Experimental Setup Training/Test Sets: 50%-50% randomly split. 10 runs report average results Evaluation Criteria: 28

Results: Binary Classes Accuracy alt.atheism vs. comp.graphics rec.autos vs. rec.sport.baseball comp.windows.x vs. rec.motorcycles k 29

Results: Multiple Classes Accuracy Random select 5-out-of-20 classes, repeat 10 runs and average All 20 classes k 30

Is knn ideal? more later 31

Effect of Parameters Sample size The more the better Need efficient search algorithm for NN Dimensionality Curse of dimensionality Density How smooth? Metric The relative scalings in the distance metric affect region shapes. Weight Spurious or less relevant points need to be downweighted K 32

Sample size and dimensionality From page 316, Fukumaga 33

Neighborhood size From page 350, Fukumaga 34

knn for image classification: basic set-up Antelope? Trombone Jellyfish Kangaroo German Shepherd 35

Voting? Kangaroo 5 NN Count 3 2 1 0 Antelope Jellyfish German Shepherd Kangaroo Trombone 36

10K classes, 4.5M Queries, 4.5M training? 37 Background image courtesy: Antonio Torralba

KNN on 10K classes 10K classes 4.5M queries 4.5M training Features BOW GIST Deng, Berg, Li & Fei Fei, ECCV 2010 38

Nearest Neighbor Search in High Dimensional Metric Space Linear Search: E.g. scanning 4.5M images! k-d trees: axis parallel partitions of the data Only effective in low-dimensional data Large Scale Approximate Indexing Locality Sensitive Hashing (LSH) Spill-Tree NV-Tree All above run on a single machine with all data in memory, and scale to millions of images Web-scale Approximate Indexing Parallel variant of Spill-tree, NV-tree on distributed systems, Scale to Billions of images in disks on multiple machines 39

Locality sensitive hashing Approximate knn Good enough in practice Can get around curse of dimensionality Locality sensitive hashing Near feature points (likely) same hash values Hash table 40

Example: Random projection h(x) = sgn (x r), r is a random unit vector h(x) gives 1 bit. Repeat and concatenate. Prob[h(x) = h(y)] = 1 θ(x,y) / π y y θ r x x y x h(x) = 0, h(y) = 0 h(x) = 0, h(y) = 1 x h(x) = 0, h(y) = 1 hyperplane y 000 101 Hash table 41

Example: Random projection h(x) = sgn (x r), r is a random unit vector h(x) gives 1 bit. Repeat and concatenate. Prob[h(x) = h(y)] = 1 θ(x,y) / π y y y r x x x θ h(x) = 0, h(y) = 0 h(x) = 0, h(y) = 0 h(x) = 0, h(y) = 0 hyperplane x y 000 101 Hash table 42

Locality sensitive hashing Retrieved NNs Hash table? 43

Locality sensitive hashing 1000X speed-up with 50% recall of top 10-NN 1.2M images + 1000 dimensions Percentage of exact NN retrieved Recall of L1Prod at top 10 0.7 0.6 0.5 0.4 0.3 L1Prod LSH + L1Prod ranking RandHP LSH + L1Prod ranking 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Scan cost Percentage of points scanned x 10 3 44

Summary: Nearest-Neighbor Learning Algorithm Learning is just storing the representations of the training examples in D Testing instance x: Compute similarity between x and all examples in D. Assign x the category of the most similar example in D. Does not explicitly compute a generalization or category prototype Efficient indexing needed in high dimensional, large-scale problems Also called: Case-based learning Memory-based learning Lazy learning 45

Summary (continued) Bayes classifier is the best classifier which minimizes the probability of classification error. Nonparametric and parametric classifier A nonparametric classifier does not rely on any assumption concerning the structure of the underlying density function. A classifier becomes the Bayes classifier if the density estimates converge to the true densities when an infinite number of samples are used The resulting error is the Bayes error, the smallest achievable error given the underlying distributions.