Unsupervised learning in Vision

Chapter 7 Unsupervised learning in Vision The fields of Computer Vision and Machine Learning complement each other in a very natural way: the aim of the former is to extract useful information from visual data, and the aim of the latter is to give a computer the ability to find patterns in data in order to understand what is happening and to predict what will happen in new situations. It should therefore not be surprising that Machine Learning has become a necessary and very powerful tool to solve many challenging problems in Computer Vision. In this chapter we ll look at a few very basic unsupervised learning approaches in the context of Computer Vision; specifically dimensionality reduction (a type of unsupervised feature learning), and clustering (for segmenting images into meaningful parts). 7.1 Dimensionality reduction Consider the set of 2D points in Figure 7.1(a). Each of these points is described by two values: its x- and y-coordinate. However, by transforming the data as in Figure 7.1(b) and (c), we see that it is actually possible to describe every point (in an approximate sense) by only one value: its x -coordinate. The idea with dimensionality reduction is to automatically learn the lower-dimensional subspace in which a given dataset can be described more efficiently. We will limit the discussion here to linear dimensionality reduction by way of the SVD; a technique sometimes referred to as principal component analysis (PCA). Other more powerful techniques exist, including kernel PCA, generalized discriminant analysis, and more recently ones that rely on variational autoencoding (a type of deep neural network). Let us first review an important property of the SVD. 77

(a) y (b) y x x (c) y (d) x x Figure 7.1: Toy example of dimensionality reduction: this set of 2D points can be transformed to a 1D subspace. 7.1.1 The SVD and low-rank approximation The singular value decomposition (SVD) of an m n matrix A, A = UΣV T, (7.1) factorizes A into an m m orthogonal matrix U, an m n diagonal matrix Σ, and an n n orthogonal matrix V. The main diagonal of Σ contains the singular values of A in non-increasing order, and the number of nonzero singular values is r = rank(a). The first r columns of U form an orthonormal basis for the column space of A. Let σ j be the jth singular value of A, and u j and v j the jth columns of U and V respectively. It follows that r A = σ j u j v T j. (7.2) j=1 One of the many powerful consequences of the SVD is that it allows us to approximate A with a lower rank matrix, by simply ignoring the smallest singular values. We calculate ν A ν = σ j u j v T j, (7.3) j=1 with ν < r. It can be shown that A ν is the best approximation to A (in both the 2-norm and Frobenius norm sense) over all m n matrices of rank less than or equal to ν. 78

7.1.2 Learning a lower dimensional subspace from images One of the uses of dimensionality reduction in Computer Vision is to represent images of particular objects (we ll use faces for example) more efficiently. The idea is to learn this representation from a given set of images. Such lower-dimensional representations can then feed into a face recognition system, as we ll briefly discuss at the end of the section. Suppose the pixels in a face image are stored in the p q matrix A. We stack the columns of A sequentially into a vector of length m = pq. Note that all p q matrices can be reshaped in this way, so that they all occupy an m-dimensional space. The value of m may be quite large, e.g. for small 256 256 images we already have m = 65, 536. There is a possibility, however, that the specific matrices under consideration occupy (by approximation) some lower dimensional subspace. To find such a space we train the system on a representative set of face images and then use the SVD to calculate a reduced basis for the space they span. To this end, consider a collection of n vectors with m components each, f i, i = 1,..., n. This collection will be called the training set of the system, and n is typically much smaller than m. We calculate the average vector as a = 1 n n f i, (7.4) i=1 and subtract it from every vector in the training set to obtain column vectors x i = f i a, i = 1,..., n. An m n matrix X is built as X = 1 n [ x1 x n ]. (7.5) The average is subtracted because vectors constructed from similar images (such as face images) are likely to be clustered around their average, distant from the origin. We wish to determine an orthogonal basis for the space spanned by them, and centring around the origin improves the ability of such a basis to describe a larger range of vectors. Finding an orthogonal basis for the training set is now a matter of finding a basis for the column space of X. Because the columns are somewhat similar with respect to the entire m-dimensional space, the singular values of X should decrease rapidly. As described in section 7.1.1 we can now infer an approximate dimension α for the column space by regarding singular values below a certain cut-off as zero. In the case of face images, the first α columns of U in the SVD of X are called the eigenfaces of the training set. These vectors span, by approximation, the column space of X. Note that since α < n and n m, we have α m. 79

If the training set is sufficiently representative the vectors constructed from arbitrary face images should also be contained within that α-dimensional subspace. Indeed, an arbitrary m-dimensional vector f is projected orthogonally onto the subspace spanned by the eigenfaces by solving for y the over-determined linear system U α y = f a, in a least squares sense. Here U α contains the eigenfaces as columns. Therefore, since those columns are orthogonal, y = U T α(f a). (7.6) The α 1 vector y in the above expression is called the eigenface representation of f. We can reconstruct f from its eigenface representation as f = Uα y + a. (7.7) Note that y is determined as a least squares solution so that, in general, f is slightly different from the original vector f. However if the training set is representative then f should include sufficient information to distinguish it from other individuals. Figure 7.2(a) shows a few example images from a database 1 taken of 40 individuals. The calculated average face is shown in (b), reshaped into a p q image, and a few eigenfaces (columns of U α ) in (c). (a) the first five face images of our training set (b) average (c) the first four eigenfaces, reshaped and scaled to images Figure 7.2: Examples of face images from the AT&T database, an average face and the first few eigenfaces (reshaped and represented as images). 1 The AT&T Database of Faces, www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html. 80

7.1.3 A simple face recognition system In this section we provide brief details of a classic algorithm for automatic face recognition, that uses the concepts discussed in the previous section. This algorithm is based on one introduced by Sirovich and Kirby 2, which has been developed into a baseline for image based face recognition systems. The first step is to learn the feature representation, that is to construct a and U α from a representative set of images of faces. In practice this vector and matrix are calculated once, and all face images that the system will encounter are mapped to their eigenface representations using these parameters. The idea is now that eigenface representations carry more distinctiveness within the class of all faces, and that these representations of different images should be compared rather than the raw pixels directly. One option might be to measure the distance between two eigenface representations y 1 and y 2 as the L 2 norm y 1 y 2, and if this value is lower than some threshold the system can classify the two faces as belonging to the same person. 7.2 Clustering Clustering is an example of unsupervised learning, that attempts to group an unlabelled set of datapoints into separate classes. The points in every class should share some similarity, while points in different classes should be different somehow. One use of this in Computer Vision is image segmentation, where the goal is to break up an image into meaningful, perceptually similar regions. In this section we very briefly sketch the main ideas behind a few common clustering techniques. 7.2.1 Agglomerative clustering One of the simplest forms of clustering is to iteratively merge similar clusters until some desired level has been reached. We may start with each pixel as its own cluster, then iteratively merge the closest pair of clusters. Here we would need a metric to compare two clusters and options include an average distance between points in the two clusters, a maximum distance, a minimum distance, the distance between means, etc. Note that the points can be the raw pixel values (in 3D, if we re considering colour images) or some other feature. 2 I. Sirovich & M. Kirby, Low-dimensional procedures for the characterization of human faces, Journal of the Optical Society of America, 4:519 524, 1987. 81

Advantages: Agglomerative clustering is simple to implement. No assumptions are made on cluster shapes, and these adapt to the image content. We obtain a hierarchy of clusters which can be useful in some cases. Disadvantages: Clusters may become imbalanced. There are threshold to be chosen (e.g. in deciding when to stop). 7.2.2 K-means clustering The idea behind the k-means algorithm is to initialize k cluster centres, and to then iteratively re-assign points to their nearest centres and update those centres. In image segmentation we consider every pixel as a point in some d-dimensional feature space. We may consider the RGB values of a pixel as its features, then d = 3, or supplement the colour with the position of the pixel in the image for a 5D feature space. K-means can be performed as follows: 1. select k cluster centres randomly in feature space 2. repeat until convergence: 2.1 assign every point to the cluster centre closest to it 2.2 compute new cluster centres based on the points in the clusters Advantages: K-means is very popular, and also relatively simple to implement. Disadvantages: We need to choose k beforehand, and there is a risk of undersegmentation if k is too small or over-segmentation if k is too large. The method is sensitive to outliers. One distant point can pull the cluster mean towards itself and impede that cluster from converging correctly. The method can converge to a local minimum, and is sensitive to initialization. Multiple restarts can help here. 7.2.3 Mean shift clustering The idea behind mean shift clustering is to view points in the feature space as samples from a non-parametric density function. The aim is to find the modes of this density. We start by subdividing the feature space into windows of a fixed size. The mean of the datapoints in every window is determined, and the window is shifted to be centred around this mean. The process repeats until convergence, and the final window means are the modes of the density function. Note that multiple windows can converge to the same mode. 82

Every mode has a basin of attraction: all points that converged to this mode during the iterations of the mean shift algorithm. The various basins of attraction form the clusters. Advantages: Mean shift clustering is regarded as a good, general purpose segmentation algorithm. It adapts the number of clusters and cluster shapes according to the data, and is robust against outliers. Disadvantages: A window size has to be chosen, and the method struggles with high-dimensional features (where there are many modes to locate). A straightforward implementation is also very slow, but computational speedups to the algorithm do exist. 83