Learning to Learn: additional notes

Similar documents
Non-Bayesian Classifiers Part I: k-nearest Neighbor Classifier and Distance Functions

CS6716 Pattern Recognition

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

COMPUTATIONAL INTELLIGENCE

Chapter 2 Basic Structure of High-Dimensional Spaces

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

How do we obtain reliable estimates of performance measures?

CLASSIFICATION WITH RADIAL BASIS AND PROBABILISTIC NEURAL NETWORKS

Pattern Recognition ( , RIT) Exercise 1 Solution

CISC 4631 Data Mining

Feature scaling in support vector data description

Image Processing. Image Features

COMPUTER AND ROBOT VISION

Computational Statistics The basics of maximum likelihood estimation, Bayesian estimation, object recognitions

Machine Learning: k-nearest Neighbors. Lecture 08. Razvan C. Bunescu School of Electrical Engineering and Computer Science

9881 Project Deliverable #1

Supervised vs unsupervised clustering

Machine Learning in Biology

Analytical model A structure and process for analyzing a dataset. For example, a decision tree is a model for the classification of a dataset.

Introduction to Pattern Recognition Part II. Selim Aksoy Bilkent University Department of Computer Engineering

3 Nonlinear Regression

Perceptron as a graph

Topic 6 Representation and Description

CS7267 MACHINE LEARNING NEAREST NEIGHBOR ALGORITHM. Mingon Kang, PhD Computer Science, Kennesaw State University

Nearest Neighbor Classification. Machine Learning Fall 2017

Feature Extractors. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. The Perceptron Update Rule.

SHAPE, SPACE & MEASURE

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Olmo S. Zavala Romero. Clustering Hierarchical Distance Group Dist. K-means. Center of Atmospheric Sciences, UNAM.

Instance-based Learning

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Tutorial 3. Jun Xu, Teaching Asistant csjunxu/ February 16, COMP4134 Biometrics Authentication

7. Nearest neighbors. Learning objectives. Centre for Computational Biology, Mines ParisTech

Forestry Applied Multivariate Statistics. Cluster Analysis

Year 8 Set 2 : Unit 1 : Number 1

7. Nearest neighbors. Learning objectives. Foundations of Machine Learning École Centrale Paris Fall 2015

Lecture 6 K- Nearest Neighbors(KNN) And Predictive Accuracy

CS 229 Midterm Review

7 Fractions. Number Sense and Numeration Measurement Geometry and Spatial Sense Patterning and Algebra Data Management and Probability

Instance-based Learning

K- Nearest Neighbors(KNN) And Predictive Accuracy

Course Number: Course Title: Geometry

A Learning Algorithm for Piecewise Linear Regression

3 Nonlinear Regression

CS 343: Artificial Intelligence

Character Recognition

Hidden Loop Recovery for Handwriting Recognition

INF 4300 Classification III Anne Solberg The agenda today:

Unsupervised Learning : Clustering

Cluster Analysis. Mu-Chun Su. Department of Computer Science and Information Engineering National Central University 2003/3/11 1

- number of elements - complement linear, simple quadratic and cubic sequences - exponential sequences and - simple combinations of these

10-701/15-781, Fall 2006, Final

Chemometrics. Description of Pirouette Algorithms. Technical Note. Abstract

PARALLEL CLASSIFICATION ALGORITHMS

Data Mining. 3.5 Lazy Learners (Instance-Based Learners) Fall Instructor: Dr. Masoud Yaghini. Lazy Learners

Image Segmentation. Srikumar Ramalingam School of Computing University of Utah. Slides borrowed from Ross Whitaker

Bumptrees for Efficient Function, Constraint, and Classification Learning

MIT 801. Machine Learning I. [Presented by Anna Bosman] 16 February 2018

Intro to Artificial Intelligence

Kernels and Clustering

Structural and Syntactic Pattern Recognition

Midterm Examination CS540-2: Introduction to Artificial Intelligence

ECG782: Multidimensional Digital Signal Processing

Hierarchical Clustering

Based on Raymond J. Mooney s slides

Beal High School. Mathematics Department. Scheme of Work for years 7 and 8

Stefano Cavuoti INAF Capodimonte Astronomical Observatory Napoli

A Distance-Based Classifier Using Dissimilarity Based on Class Conditional Probability and Within-Class Variation. Kwanyong Lee 1 and Hyeyoung Park 2

Interpolation is a basic tool used extensively in tasks such as zooming, shrinking, rotating, and geometric corrections.

Data Mining. Lecture 03: Nearest Neighbor Learning

Digital Image Processing. Prof. P.K. Biswas. Department of Electronics & Electrical Communication Engineering

PITSCO Math Individualized Prescriptive Lessons (IPLs)

Ohio Tutorials are designed specifically for the Ohio Learning Standards to prepare students for the Ohio State Tests and end-ofcourse

Data mining. Classification k-nn Classifier. Piotr Paszek. (Piotr Paszek) Data mining k-nn 1 / 20

This blog addresses the question: how do we determine the intersection of two circles in the Cartesian plane?

The exam is closed book, closed notes except your one-page cheat sheet.

Network Traffic Measurements and Analysis

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Applying Supervised Learning

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Summary Of Topics covered in Year 7. Topic All pupils should Most pupils should Some pupils should Learn formal methods for

Density estimation. In density estimation problems, we are given a random from an unknown density. Our objective is to estimate

Challenges motivating deep learning. Sargur N. Srihari

Artificial Intelligence. Programming Styles

Birkdale High School - Higher Scheme of Work

X Std. Topic Content Expected Learning Outcomes Mode of Transaction

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Content-based image and video analysis. Machine learning

3.3 Optimizing Functions of Several Variables 3.4 Lagrange Multipliers

Performance Level Descriptors. Mathematics

Number/Computation. addend Any number being added. digit Any one of the ten symbols: 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9

Mathematics of Data. INFO-4604, Applied Machine Learning University of Colorado Boulder. September 5, 2017 Prof. Michael Paul

those items. We observed ratio measurements of size and color for objects from different angles

UNIVERSITY OF OSLO. Faculty of Mathematics and Natural Sciences

Non-Parametric Modeling

Cluster Analysis Gets Complicated

Voronoi Region. K-means method for Signal Compression: Vector Quantization. Compression Formula 11/20/2013

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Transcription:

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science 6.034 Artificial Intelligence, Fall 2008 Recitation October 23 Learning to Learn: additional notes Bob Berwick Agenda 1. Learning: general issues 2. Features; Nearest Neighbors 3. ID trees: learning how to split 1. In general, a pattern classifier carves up (or tesselates or partitions) the feature space into volumes called decision regions. All feature vectors in a decision region are assigned to the same category. The decision regions are often simply connected, but they can be multiply connected as well, consisting of two or more non-touching regions. The decision regions are separated by surfaces called the decision boundaries. These separating surfaces represent points where there are ties between two or more categories. For a minimum-distance classifier like nearest neighbors, the decision boundaries are the points that are equally distant from two or more of the templates. With a Euclidean metric, the decision boundary between Region i and Region j is on the line or plane that is the perpendicular bisector of the line from m i to m j. Analytically, these linear boundaries are a consequence of the fact that the discriminant functions are linear. Nearest-template decision boundaries How well the classifier works depends upon how closely the input patterns to be classified resemble the templates. In the example sketched below, the correspondence is very close, and one can anticipate excellent performance. However, things are not always this good in practice, and one should understand the limitations of simple classifiers.

6.034 Artificial Intelligence, Recitation 7, October 23, 2008 2 Template matching can easily be expressed mathematically. Let x be the feature vector for the unknown input, and let m 1, m 2,..., m c be templates (i.e., perfect, noise-free feature vectors) for the c classes. Then the error in matching x against m k is given by x - m k. Here u is called the norm of the vector u. A minimum-error classifier computes x - m k for k = 1 to c and chooses the class for which this error is minimum. Since x - m k is also the distance from x to m k, we call this a minimum-distance classifier. Clearly, a template matching system is a minimumdistance classifier. There is more than one way to define the norm u, and these correspond to different ways of measuring distance, i.e., to different metrics. Two of the most common are 1. Euclidean metric: u = sqrt( u 1 2 + u 2 2 +... + u d 2 ) 2. Manhattan (or taxicab) metric: u = u 1 + u 2 +... + u d In our template-matching example of classifying characters by counting the number of disagreements, we were implicitly using a Manhattan metric. For the rest of these notes, we will use either the Euclidean distance or something called the Mahalanobis distance. We will see that

6.034 Artificial Intelligence, Recitation 7, October 23, 2008 3 Contours of constant Euclidean distance are circles (or spheres) Contours of constant Manhattan distance are squares (or boxes) Contours of constant Mahalanobis distance are ellipses (or ellipsoids) Limitations If a simple minimum-distance classifier is satisfactory, there is no reason to use anything more complicated. However, it frequently happens that such a classifier makes too many errors. There are several possible reasons for this: 1. The features may be inadequate to distinguish the different classes 2. The features may be highly correlated 3. The decision boundary may have to be curved 4. There may be distinct subclasses in the data 5. The feature space may simply be too complex 1. Inadequate features: If the features simply do not contain the information needed to separate the classes, it doesn't matter how much effort you put into designing the classifier. Solution: go back and design better features. 2. Correlated features. It often happens that two features that were meant to measure different characteristics are influenced by some common mechanism and tend to vary together. For example, the perimeter and the maximum width of a figure will both vary with scale; larger figures will have both larger perimeters and larger maximum widths. This degrades the perfomance of a classifier based on Euclidean distance to a template. A pattern at the extreme of one class can be closer to the template for another class than to its own template. A similar problem occurs if features are badly scaled, for example, by measuring one feature in microns and another in kilometers. Solution: Use the Mahalanobis metric (what s that??)

6.034 Artificial Intelligence, Recitation 7, October 23, 2008 4 The use of the Mahalanobis metric removes several of the limitations of the Euclidean metric: 1. It automatically accounts for the scaling of the coordinate axes 2. It corrects for correlation between the different features 3. It can provide curved as well as linear decision boundaries You don t need to know about this for this course (but you should ). You can visualize this as a set of ellipses, as shown above. More precisely, The quantity r in is called the Mahalanobis distance from the feature vector x to the mean vector mx, where Cx is the covariance matrix for x. It can be shown that the surfaces on which r is constant are ellipsoids that are centered about the mean mx. In the special case where the features are uncorrelated and the variances in all directions are the same, these surfaces are spheres, and the Mahalanobis distance becomes equivalent to the Euclidean distance. One can use the Mahalanobis distance in a minimum-distance classifier as follows. Let m 1, m 2,..., m c be the means (templates) for the c classes, and let C 1, C 2,..., C c be the corresponding covariance matrices. We classify a feature vector x by measuring the Mahalanobis distance from x to each of the means, and assigning x to the class for which the Mahalanobis distance is minimum. 4. Curved boundaries. The linear boundaries produced by a minimum-euclidean-distance classifier may not be flexible enough. For example, if x 1 is the perimeter and x 2 is the area of a figure, x 1 will grow linearly with scale, while x 2 will grow quadratically. This will warp the feature space and prevent a linear discriminant function from performing well. Solutions: 1. Redesign the feature set (e.g., let x 2 be the square root of the area) 2. Try using Mahalanobis distance, which can produce quadratic decision boundaries (see later, or maybe, not here actually, see the end of these notes) 3. Try using a neural network or a support vector machine (see later in the course) 4. Subclasses. It frequently happens that the classes defined by the end user are not the natural classes. For example, while the goal in classifying English letters might be just to decide on one of 26 categories, if both upper-case and lower-case letters will be encountered, it might make more sense to treat them separately and to build a 52-category classifier, OR-ing the outputs at the end. Of course, it is not always so obvious that there are distinct subclasses. Solution: Use a clustering procedure to find subclasses in the data, and treat each subclass as a distinct class.

6.034 Artificial Intelligence, Recitation 7, October 23, 2008 5 4. Complex feature space. It can happen that the variability that makes it difficult to distinguish patterns is due to complexity rather than noise. For example, it would make no sense to try to use a nearest-distance classifier to detect syntactic errors in expressions in a programming language. Similar problems arise in classifying sensory data. Whenever the patterns can be subject to well-defined transformations such as translation or rotation, there is a danger of introducing a large amount of complexity into a primitive feature space. In general, one should employ features that are invariant to such transformations, rather than forcing the classifier to handle them. In human communication, larger patterns are frequently composed of smaller paterns, which in turn are composed of smaller patterns. For example, sentences are made up of words which are made up of letters which are made up of strokes. Validation. One of the key questions about any classifier is how well it will perform, i.e., what is the error rate. One way to estimate the error rate is to see how well the classifier classifies the examples used to estimate the parameters. This is called testing on the training data. It is a good necessary condition if the classifier cannot correctly classify the examples used to train it, it is definitely worthless. However, it is a very poor measure of generalization ability. It inevitably promises much better performance than will be obtained with independent test data. One simple validation procedure is to divide the available data into two disjoint subsets -- a subset used to train the classifier, and a subset used to test it. This is called the holdout method. It works reasonably well, but it often results in suboptimal performance because if you holdout enough examples to get a good test you will not have enough examples left for training. There are several other more sophisticated alternatives. For example, in k-fold cross-validation, the examples are divided into k subsets of equal size. (A common choice is k = 10.) The system is designed k times, each time leaving out one of the subsets from training, and using the omitted subset for testing. Although this approach is time consuming, most of the data can be used for training while still having enough independent tests to estimate the error rate. In a related approach called bootstrapping, one forms the the test set by randomly sampling the set of examples.