Unsupervised learning in Vision

Similar documents
Recognition, SVD, and PCA

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Dimension Reduction CS534

Unsupervised Learning

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

Applications Video Surveillance (On-line or off-line)

Supervised vs. Unsupervised Learning

Segmentation Computer Vision Spring 2018, Lecture 27

Lecture 10: Semantic Segmentation and Clustering

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD

Clustering and Visualisation of Data

Facial Expression Recognition using Principal Component Analysis with Singular Value Decomposition

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

Numerical Analysis and Statistics on Tensor Parameter Spaces

Face detection and recognition. Many slides adapted from K. Grauman and D. Lowe

CSC 411 Lecture 18: Matrix Factorizations

CS 2750: Machine Learning. Clustering. Prof. Adriana Kovashka University of Pittsburgh January 17, 2017

University of Florida CISE department Gator Engineering. Clustering Part 2

CSC321: Neural Networks. Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis. Geoffrey Hinton

CPSC 340: Machine Learning and Data Mining. Principal Component Analysis Fall 2016

The exam is closed book, closed notes except your one-page (two-sided) cheat sheet.

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

INF 4300 Classification III Anne Solberg The agenda today:

Recognition: Face Recognition. Linda Shapiro EE/CSE 576

Programming Exercise 7: K-means Clustering and Principal Component Analysis

FACE RECOGNITION USING SUPPORT VECTOR MACHINES

Clustering. Supervised vs. Unsupervised Learning

Network Traffic Measurements and Analysis

CHAPTER 3 PRINCIPAL COMPONENT ANALYSIS AND FISHER LINEAR DISCRIMINANT ANALYSIS

Machine Learning : Clustering, Self-Organizing Maps

Applications. Foreground / background segmentation Finding skin-colored regions. Finding the moving objects. Intelligent scissors

Clustering. CS294 Practical Machine Learning Junming Yin 10/09/06

Factorization with Missing and Noisy Data

Programming assignment 3 Mean-shift

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Deep Learning for Computer Vision

Name: Math 310 Fall 2012 Toews EXAM 1. The material we have covered so far has been designed to support the following learning goals:

The Singular Value Decomposition: Let A be any m n matrix. orthogonal matrices U, V and a diagonal matrix Σ such that A = UΣV T.

Classification. Vladimir Curic. Centre for Image Analysis Swedish University of Agricultural Sciences Uppsala University

Contents Machine Learning concepts 4 Learning Algorithm 4 Predictive Model (Model) 4 Model, Classification 4 Model, Regression 4 Representation

Chapter 9 Object Tracking an Overview

An Approximate Singular Value Decomposition of Large Matrices in Julia

CSE 5243 INTRO. TO DATA MINING

Collaborative Filtering for Netflix

CSE 5243 INTRO. TO DATA MINING

Grundlagen der Künstlichen Intelligenz

The Design and Application of Statement Neutral Model Based on PCA Yejun Zhu 1, a

Facial Recognition Using Eigenfaces

General Instructions. Questions

Singular Value Decomposition, and Application to Recommender Systems

Clustering Part 4 DBSCAN

Facial Expression Recognition Using Non-negative Matrix Factorization

Principal Component Analysis and Neural Network Based Face Recognition

Hierarchical Clustering 4/5/17

University of Florida CISE department Gator Engineering. Clustering Part 4

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Visual object classification by sparse convolutional neural networks

CSE 573: Artificial Intelligence Autumn 2010

Case-Based Reasoning. CS 188: Artificial Intelligence Fall Nearest-Neighbor Classification. Parametric / Non-parametric.

CS 188: Artificial Intelligence Fall 2008

Adaptive Video Compression using PCA Method

Robotics Programming Laboratory

Learning Low-rank Transformations: Algorithms and Applications. Qiang Qiu Guillermo Sapiro

The Detection of Faces in Color Images: EE368 Project Report

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

ECG782: Multidimensional Digital Signal Processing

Introduction to Machine Learning Prof. Anirban Santara Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Epipolar Geometry and the Essential Matrix

Face Recognition using Eigenfaces SMAI Course Project

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

Figure (5) Kohonen Self-Organized Map

Exploratory data analysis for microarrays

CS231A Course Project Final Report Sign Language Recognition with Unsupervised Feature Learning

Announcements. Recognition I. Gradient Space (p,q) What is the reflectance map?

Clustering Lecture 3: Hierarchical Methods

How do microarrays work

Hierarchical Clustering

Lecture Notes for Chapter 7. Introduction to Data Mining, 2 nd Edition. by Tan, Steinbach, Karpatne, Kumar

Kernels and Clustering

Face recognition based on improved BP neural network

ECG782: Multidimensional Digital Signal Processing

Handwritten Digit Classification and Reconstruction of Marred Images Using Singular Value Decomposition

CSE 255 Lecture 5. Data Mining and Predictive Analytics. Dimensionality Reduction

Clustering: Classic Methods and Modern Views

Dimension reduction : PCA and Clustering

Outlier detection using autoencoders

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Clustering Part 3. Hierarchical Clustering

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

Lecture: k-means & mean-shift clustering

CSE 473/573 Computer Vision and Image Processing (CVIP) Ifeoma Nwogu. Lectures 21 & 22 Segmentation and clustering

Clustering CS 550: Machine Learning

Image Processing and Image Representations for Face Recognition

Image Analysis Lecture Segmentation. Idar Dyrdal

Particle Swarm Optimization applied to Pattern Recognition

CS 664 Segmentation. Daniel Huttenlocher

Classification: Feature Vectors

22 October, 2012 MVA ENS Cachan. Lecture 5: Introduction to generative models Iasonas Kokkinos

Transcription:

Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual data, and the aim of the latter is to give a computer the ability to find patterns in data in order to understand what is happening and to predict what will happen in new situations. It should therefore not be surprising that Machine Learning has become a necessary and very powerful tool to solve many challenging problems in Computer Vision. In this chapter we ll look at a few very basic unsupervised learning approaches in the context of Computer Vision; specifically dimensionality reduction (a type of unsupervised feature learning), and clustering (for segmenting images into meaningful parts). 7.1 Dimensionality reduction Consider the set of 2D points in Figure 7.1(a). Each of these points is described by two values: its x- and y-coordinate. However, by transforming the data as in Figure 7.1(b) and (c), we see that it is actually possible to describe every point (in an approximate sense) by only one value: its x -coordinate. The idea with dimensionality reduction is to automatically learn the lower-dimensional subspace in which a given dataset can be described more efficiently. We will limit the discussion here to linear dimensionality reduction by way of the SVD; a technique sometimes referred to as principal component analysis (PCA). Other more powerful techniques exist, including kernel PCA, generalized discriminant analysis, and more recently ones that rely on variational autoencoding (a type of deep neural network). Let us first review an important property of the SVD. 77

(a) y (b) y x x (c) y (d) x x Figure 7.1: Toy example of dimensionality reduction: this set of 2D points can be transformed to a 1D subspace. 7.1.1 The SVD and low-rank approximation The singular value decomposition (SVD) of an m n matrix A, A = UΣV T, (7.1) factorizes A into an m m orthogonal matrix U, an m n diagonal matrix Σ, and an n n orthogonal matrix V. The main diagonal of Σ contains the singular values of A in non-increasing order, and the number of nonzero singular values is r = rank(a). The first r columns of U form an orthonormal basis for the column space of A. Let σ j be the jth singular value of A, and u j and v j the jth columns of U and V respectively. It follows that r A = σ j u j v T j. (7.2) j=1 One of the many powerful consequences of the SVD is that it allows us to approximate A with a lower rank matrix, by simply ignoring the smallest singular values. We calculate ν A ν = σ j u j v T j, (7.3) j=1 with ν < r. It can be shown that A ν is the best approximation to A (in both the 2-norm and Frobenius norm sense) over all m n matrices of rank less than or equal to ν. 78

7.1.2 Learning a lower dimensional subspace from images One of the uses of dimensionality reduction in Computer Vision is to represent images of particular objects (we ll use faces for example) more efficiently. The idea is to learn this representation from a given set of images. Such lower-dimensional representations can then feed into a face recognition system, as we ll briefly discuss at the end of the section. Suppose the pixels in a face image are stored in the p q matrix A. We stack the columns of A sequentially into a vector of length m = pq. Note that all p q matrices can be reshaped in this way, so that they all occupy an m-dimensional space. The value of m may be quite large, e.g. for small 256 256 images we already have m = 65, 536. There is a possibility, however, that the specific matrices under consideration occupy (by approximation) some lower dimensional subspace. To find such a space we train the system on a representative set of face images and then use the SVD to calculate a reduced basis for the space they span. To this end, consider a collection of n vectors with m components each, f i, i = 1,..., n. This collection will be called the training set of the system, and n is typically much smaller than m. We calculate the average vector as a = 1 n n f i, (7.4) i=1 and subtract it from every vector in the training set to obtain column vectors x i = f i a, i = 1,..., n. An m n matrix X is built as X = 1 n [ x1 x n ]. (7.5) The average is subtracted because vectors constructed from similar images (such as face images) are likely to be clustered around their average, distant from the origin. We wish to determine an orthogonal basis for the space spanned by them, and centring around the origin improves the ability of such a basis to describe a larger range of vectors. Finding an orthogonal basis for the training set is now a matter of finding a basis for the column space of X. Because the columns are somewhat similar with respect to the entire m-dimensional space, the singular values of X should decrease rapidly. As described in section 7.1.1 we can now infer an approximate dimension α for the column space by regarding singular values below a certain cut-off as zero. In the case of face images, the first α columns of U in the SVD of X are called the eigenfaces of the training set. These vectors span, by approximation, the column space of X. Note that since α < n and n m, we have α m. 79

If the training set is sufficiently representative the vectors constructed from arbitrary face images should also be contained within that α-dimensional subspace. Indeed, an arbitrary m-dimensional vector f is projected orthogonally onto the subspace spanned by the eigenfaces by solving for y the over-determined linear system U α y = f a, in a least squares sense. Here U α contains the eigenfaces as columns. Therefore, since those columns are orthogonal, y = U T α(f a). (7.6) The α 1 vector y in the above expression is called the eigenface representation of f. We can reconstruct f from its eigenface representation as f = Uα y + a. (7.7) Note that y is determined as a least squares solution so that, in general, f is slightly different from the original vector f. However if the training set is representative then f should include sufficient information to distinguish it from other individuals. Figure 7.2(a) shows a few example images from a database 1 taken of 40 individuals. The calculated average face is shown in (b), reshaped into a p q image, and a few eigenfaces (columns of U α ) in (c). (a) the first five face images of our training set (b) average (c) the first four eigenfaces, reshaped and scaled to images Figure 7.2: Examples of face images from the AT&T database, an average face and the first few eigenfaces (reshaped and represented as images). 1 The AT&T Database of Faces, www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html. 80

7.1.3 A simple face recognition system In this section we provide brief details of a classic algorithm for automatic face recognition, that uses the concepts discussed in the previous section. This algorithm is based on one introduced by Sirovich and Kirby 2, which has been developed into a baseline for image based face recognition systems. The first step is to learn the feature representation, that is to construct a and U α from a representative set of images of faces. In practice this vector and matrix are calculated once, and all face images that the system will encounter are mapped to their eigenface representations using these parameters. The idea is now that eigenface representations carry more distinctiveness within the class of all faces, and that these representations of different images should be compared rather than the raw pixels directly. One option might be to measure the distance between two eigenface representations y 1 and y 2 as the L 2 norm y 1 y 2, and if this value is lower than some threshold the system can classify the two faces as belonging to the same person. 7.2 Clustering Clustering is an example of unsupervised learning, that attempts to group an unlabelled set of datapoints into separate classes. The points in every class should share some similarity, while points in different classes should be different somehow. One use of this in Computer Vision is image segmentation, where the goal is to break up an image into meaningful, perceptually similar regions. In this section we very briefly sketch the main ideas behind a few common clustering techniques. 7.2.1 Agglomerative clustering One of the simplest forms of clustering is to iteratively merge similar clusters until some desired level has been reached. We may start with each pixel as its own cluster, then iteratively merge the closest pair of clusters. Here we would need a metric to compare two clusters and options include an average distance between points in the two clusters, a maximum distance, a minimum distance, the distance between means, etc. Note that the points can be the raw pixel values (in 3D, if we re considering colour images) or some other feature. 2 I. Sirovich & M. Kirby, Low-dimensional procedures for the characterization of human faces, Journal of the Optical Society of America, 4:519 524, 1987. 81

Advantages: Agglomerative clustering is simple to implement. No assumptions are made on cluster shapes, and these adapt to the image content. We obtain a hierarchy of clusters which can be useful in some cases. Disadvantages: Clusters may become imbalanced. There are threshold to be chosen (e.g. in deciding when to stop). 7.2.2 K-means clustering The idea behind the k-means algorithm is to initialize k cluster centres, and to then iteratively re-assign points to their nearest centres and update those centres. In image segmentation we consider every pixel as a point in some d-dimensional feature space. We may consider the RGB values of a pixel as its features, then d = 3, or supplement the colour with the position of the pixel in the image for a 5D feature space. K-means can be performed as follows: 1. select k cluster centres randomly in feature space 2. repeat until convergence: 2.1 assign every point to the cluster centre closest to it 2.2 compute new cluster centres based on the points in the clusters Advantages: K-means is very popular, and also relatively simple to implement. Disadvantages: We need to choose k beforehand, and there is a risk of undersegmentation if k is too small or over-segmentation if k is too large. The method is sensitive to outliers. One distant point can pull the cluster mean towards itself and impede that cluster from converging correctly. The method can converge to a local minimum, and is sensitive to initialization. Multiple restarts can help here. 7.2.3 Mean shift clustering The idea behind mean shift clustering is to view points in the feature space as samples from a non-parametric density function. The aim is to find the modes of this density. We start by subdividing the feature space into windows of a fixed size. The mean of the datapoints in every window is determined, and the window is shifted to be centred around this mean. The process repeats until convergence, and the final window means are the modes of the density function. Note that multiple windows can converge to the same mode. 82

Every mode has a basin of attraction: all points that converged to this mode during the iterations of the mean shift algorithm. The various basins of attraction form the clusters. Advantages: Mean shift clustering is regarded as a good, general purpose segmentation algorithm. It adapts the number of clusters and cluster shapes according to the data, and is robust against outliers. Disadvantages: A window size has to be chosen, and the method struggles with high-dimensional features (where there are many modes to locate). A straightforward implementation is also very slow, but computational speedups to the algorithm do exist. 83