Nonlinear Image Interpolation using Manifold Learning

Similar documents
Nonlinear Manifold Learning for Visual Speech Recognition

Bumptrees for Efficient Function, Constraint, and Classification Learning

Robust Pose Estimation using the SwissRanger SR-3000 Camera

Head Frontal-View Identification Using Extended LLE

Dimension Reduction CS534

Non-linear dimension reduction

ESTIMATING HEAD POSE WITH AN RGBD SENSOR: A COMPARISON OF APPEARANCE-BASED AND POSE-BASED LOCAL SUBSPACE METHODS

A Hierarchial Model for Visual Perception

Computer Vision 2. SS 18 Dr. Benjamin Guthier Professur für Bildverarbeitung. Computer Vision 2 Dr. Benjamin Guthier

Optimal Grouping of Line Segments into Convex Sets 1

ORT EP R RCH A ESE R P A IDI! " #$$% &' (# $!"

Principal Component Analysis and Neural Network Based Face Recognition

Generalized Principal Component Analysis CVPR 2007

Example Based Image Synthesis of Articulated Figures

SELECTION OF THE OPTIMAL PARAMETER VALUE FOR THE LOCALLY LINEAR EMBEDDING ALGORITHM. Olga Kouropteva, Oleg Okun and Matti Pietikäinen

3D Models and Matching

Nonlinear Generative Models for Dynamic Shape and Dynamic Appearance

Face recognition based on improved BP neural network

Dimension Reduction of Image Manifolds

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information

Local Features Tutorial: Nov. 8, 04

Learning by a Generation Approach to Appearance-based Object Recognition

Kapitel 4: Clustering

Unsupervised learning in Vision

The Analysis of Parameters t and k of LPP on Several Famous Face Databases

Dimension reduction : PCA and Clustering

Classifier C-Net. 2D Projected Images of 3D Objects. 2D Projected Images of 3D Objects. Model I. Model II

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

From Gaze to Focus of Attention

Categorization by Learning and Combining Object Parts

Automatic Modularization of ANNs Using Adaptive Critic Method

Vision-based Manipulator Navigation. using Mixtures of RBF Neural Networks. Wolfram Blase, Josef Pauli, and Jorg Bruske

Locally Weighted Learning for Control. Alexander Skoglund Machine Learning Course AASS, June 2005

Image-Based Face Recognition using Global Features

REPRESENTATION OF BIG DATA BY DIMENSION REDUCTION

Linear Discriminant Analysis in Ottoman Alphabet Character Recognition

Training Algorithms for Robust Face Recognition using a Template-matching Approach

Transformation-Invariant Clustering and Dimensionality Reduction Using EM

An Approach for Reduction of Rain Streaks from a Single Image

Human Motion Synthesis by Motion Manifold Learning and Motion Primitive Segmentation

Using Perspective Rays and Symmetry to Model Duality

Subspace Clustering with Global Dimension Minimization And Application to Motion Segmentation

EigenTracking: Robust Matching and Tracking of Articulated Objects Using a View-Based Representation

A Robust Two Feature Points Based Depth Estimation Method 1)

A Survey of Light Source Detection Methods

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

On Modelling Nonlinear Shape-and-Texture Appearance Manifolds

Applying Synthetic Images to Learning Grasping Orientation from Single Monocular Images

Tree-based Cluster Weighted Modeling: Towards A Massively Parallel Real- Time Digital Stradivarius

Mobile Face Recognization

A Short SVM (Support Vector Machine) Tutorial

Extended Isomap for Pattern Classification

Ground Plane Motion Parameter Estimation For Non Circular Paths

Tangent Prop - A formalism for specifying selected invariances in an adaptive network

Model-Based Human Motion Capture from Monocular Video Sequences

Three-Dimensional Computer Vision

Comparative Analysis in Medical Imaging

Expression Detection in Video. Abstract Expression detection is useful as a non-invasive method of lie detection and

Face Hallucination Based on Eigentransformation Learning

Numerical Analysis and Statistics on Tensor Parameter Spaces

Last week. Multi-Frame Structure from Motion: Multi-View Stereo. Unknown camera viewpoints

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers

Training-Free, Generic Object Detection Using Locally Adaptive Regression Kernels

HOUGH TRANSFORM CS 6350 C V

EE368 Project Report CD Cover Recognition Using Modified SIFT Algorithm

SIFT - scale-invariant feature transform Konrad Schindler

Adaptive Metric Nearest Neighbor Classification

MACHINE LEARNING: CLUSTERING, AND CLASSIFICATION. Steve Tjoa June 25, 2014

Surface Reconstruction. Gianpaolo Palma

Bayes Risk. Classifiers for Recognition Reading: Chapter 22 (skip 22.3) Discriminative vs Generative Models. Loss functions in classifiers

Machine Learning 13. week

Efficiency of k-means and K-Medoids Algorithms for Clustering Arbitrary Data Points

Locating ego-centers in depth for hippocampal place cells

Visual Learning and Recognition of 3D Objects from Appearance

Classifiers for Recognition Reading: Chapter 22 (skip 22.3)

8. Automatic Content Analysis

CSC321: Neural Networks. Lecture 13: Learning without a teacher: Autoencoders and Principal Components Analysis. Geoffrey Hinton

Manifold Constrained Deep Neural Networks for ASR

Static Gesture Recognition with Restricted Boltzmann Machines

Multiple Model Estimation : The EM Algorithm & Applications

IMPLEMENTATION OF RBF TYPE NETWORKS BY SIGMOIDAL FEEDFORWARD NEURAL NETWORKS

Announcements. Recognition I. Gradient Space (p,q) What is the reflectance map?

Support Vector Regression and Classification Based Multi-view Face Detection and Recognition

ICA mixture models for image processing

NIST. Support Vector Machines. Applied to Face Recognition U56 QC 100 NO A OS S. P. Jonathon Phillips. Gaithersburg, MD 20899

A Connectionist Learning Control Architecture for Navigation

Edges, interpolation, templates. Nuno Vasconcelos ECE Department, UCSD (with thanks to David Forsyth)

BROWN UNIVERSITY Department of Computer Science Master's Project CS-95-M17

The K-modes and Laplacian K-modes algorithms for clustering

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Statistics and Information Technology

Corner Detection. Harvey Rhody Chester F. Carlson Center for Imaging Science Rochester Institute of Technology

Disguised Face Identification Based Gabor Feature and SVM Classifier

STEREO-DISPARITY ESTIMATION USING A SUPERVISED NEURAL NETWORK

Data mining with Support Vector Machine

WE PRESENT in this paper a new approach to event detection

INTERNATIONAL COMPUTER SCIENCE INSTITUTE

Robust Lip Contour Extraction using Separability of Multi-Dimensional Distributions

03 - Reconstruction. Acknowledgements: Olga Sorkine-Hornung. CSCI-GA Geometric Modeling - Spring 17 - Daniele Panozzo

Comparing Natural and Synthetic Training Data for Off-line Cursive Handwriting Recognition

Transcription:

Nonlinear Image Interpolation using Manifold Learning Christoph Bregler Computer Science Division University of California Berkeley, CA 94720 bregler@cs.berkeley.edu Stephen M. Omohundro'" Int. Computer Science Institute 1947 Center Street Suite 600 Berkeley, CA 94704 om@research.nj.nec.com Abstract The problem of interpolating between specified images in an image sequence is a simple, but important task in model-based vision. We describe an approach based on the abstract task of "manifold learning" and present results on both synthetic and real image sequences. This problem arose in the development of a combined lip-reading and speech recognition system. 1 Introduction Perception may be viewed as the task of combining impoverished sensory input with stored world knowledge to predict aspects of the state of the world which are not directly sensed. In this paper we consider the task of image interpolation by which we mean hypothesizing the structure of images which occurred between given images in a temporal sequence. This task arose during the development of a combined lipreading and speech recognition system [3], because the time windows for auditory and visual information are different (30 frames per second for the camera vs. 100 feature vectors per second for the acoustic information). It is an excellent visual test domain in general, however, because it is easy to generate large amounts of test and training data and the performance measure is largely "theory independent". The test consists of simply presenting two frames from a movie and comparing the "'New address: NEe Research Institute, Inc., 4 Independence Way, Princeton, NJ 08540

974 Christoph Bregler. Stephen M. Omohundro Figure 1: Linear interpolated lips. Figure 2: Desired interpolation. hypothesized intermediate frames to the actual ones. It is easy to use footage of a particular visual domain as training data in the same way. Most current approaches to model-based vision require hand-constructed CADlike models. We are developing an alternative approach in which the vision system builds up visual models automatically by learning from examples. One of the central components of this kind of learning is the abstract problem of inducing a smooth nonlinear constraint manifold from a set of examples from the manifold. We call this "manifold learning" and have developed several approaches closely related to neural networks for doing it [2]. In this paper we apply manifold learning to the image interpolation problem and numerically compare the results of this "nonlinear" process with simple linear interpolation. We find that the approach works well when the underlying model space is low-dimensional. In more complex examples, manifold learning cannot be directly applied to images but still is a central component in a more complex system (not discussed here). We present several approaches to using manifold learning for this task. We compare the performance of these approaches to that of simple linear interpolation. Figure 1 shows the results of linear interpolation of lip images from the lip-reading system. Even in the short period of 33 milliseconds linear interpolation can produce an unnatural lip image. The problem is that linear interpolation of two images just averages the two pictures. The interpolated image in Fig. 1 has two lower lip parts instead of just one. The desired interpolated image is shown in Fig. 2, and consists of a single lower lip positioned at a location between the lower lip positions in the two input pictures. Our interpolation technique is nonlinear, and is constrained to produce only images from an abstract manifold in "lip space" induced by learning. Section 2 describes the procedure, Section 4 introduces the interpolation technique based on the induced manifold, and Sections 5 and 6 describe our experiments on artificial and natural images. 2 Manifold Learning Each n * m gray level image may be thought of as a point in an n * m-dimensional space. A sequence of lip-images produced by a speaker uttering a sentence lie on a

Nonlinear Image Interpolation Using Manifold Learning 975 -\ ------ Graylevel Dimensions (l6x16 pixel = 256 dim. space) Figure 3: Linear vs nonlinear interpolation. 1-dimensional trajectory in this space (figure 3). If the speaker were to move her lips in all possible ways, the images would define a low-dimensional submanifold (or nonlinear surface) embedded in the high-dimensional space of all possible graylevel images. If we could compute this nonlinear manifold, we could limit any interpolation algorithm to generate only images contained in it. Images not on the manifold cannot be generated by the speaker under normal circumstances. Figure 3 compares a curve of interpolated images lying on this manifold to straight line interpolation which generally leaves the manifold and enters the domain of images which violate the integrity of the model. To represent this kind of nonlinear manifold embedded in a high-dimensional feature space, we use a mixture model of local linear patches. Any smooth nonlinear manifold can be approximated arbitrarily well in each local neighborhood by a linear "patch". In our representation, local linear patches are "glued" together with smooth "gating" functions to form a globally defined nonlinear manifold [2]. We use the "nearest-point-query" to define the manifold. Given an arbitrary point near the manifold, this returns the closest point on the manifold. We answer such queries with a weighted sum of the linear projections of the point to each local patch. The weights are defined by an "influence function" associated with each linear patch which we usually define by a Gaussian kernel. The weight for each patch is the value of its influence function at the point divided by the sum of all influence functions ("partition of unity"). Figure 4 illustrates the nearest-point-query. Because Gaussian kernels die off quickly, the effect of distant patches may be ignored, improving computational performance. The linear projections themselves consist of a dot product and so are computationally inexpensive. For learning, we must fit such a mixture of local patches to the training data. An initial estimate of the patch centers is obtained from k-means clustering. We fit a patch to each local cluster using a local principal components analysis. Fine tuning

976 Christoph Bregler. Stephen M. Omohundro Influence Function PI P2 \ Linear Patch l:,gi(x) ' Pi(x) P(x) =...:...' -=--- l:,gi(x), Figure 4: Local linear patches glued together to a nonlinear manifold. of the model is done using the EM (expectation-maximization) procedure. This approach is related to the mixture of expert architecture [4], and to the manifold representation in [6]. Our EM implementation is related to [5], which uses a hierarchical gating function and local experts that compute linear mappings from one space to another space. In contrast, our approach uses a "one-level" gating function and local patches that project a space into itself. 3 Linear Preprocessing Dealing with very high-dimensional domains (e.g. 256 * 256 gray level images) requires large memory and computational resources. Much of this computation is not relevant to the task, however. Even if the space of images is nonlinear, the nonlinearity does not necessarily appear in all of the dimensions. Earlier experiments in the lip domain [3] have shown that images projected onto a lo-dimensional linear subspace still accurately represents all possible lip configurations. We therefore first project the high-dimensional images into such a linear subspace and then induce the nonlinear manifold within this lower dimensional linear subspace. This preprocessing is similar to purely linear techniques [7, 10, 9]. 4 Constraint Interpolation Geometrically, linear interpolation between two points in n-space may be thought of as moving along the straight line joining the two points. In our non-linear approach to interpolation, the point moves along a curve joining the two points which lies in the manifold of legal images. We have studied several algorithms for estimating the shortest manifold trajectory connecting two given points. For the performance results, we studied the point which is halfway along the shortest trajectory.

Nonlinear Image Interpolation Using Manifold Learning 977 4.1 "Free-Fall" The computationally simplest approach is to simply project the linearly interpolated point onto the nonlinear manifold. The projection is accurate when the point is close to the manifold. In cases where the linearly interpolated point is far away (i.e. no weight of the partition of unity dominates all the other weights) the closest-pointquery does not result in a good interpolant. For a worst case, consider a point in the middle of a circle or sphere. All local patches have same weight and the weighted sum of all projections is the center point itself, which is not a manifold point. Furthermore, near such "singular" points, the final result is sensitive to small perturbations in the initial position. 4.2 "Manifold-Walk" A better approach is to "walk" along the manifold itself rather than relying on the linear interpolant. Each step of the walk is linear and in the direction of the target point but the result is immediately projected onto the manifold. This new point is then moved toward the target point and projected onto the manifold, etc. When the target is finally reached, the arc length of the curve is approximated by the accumulated lengths of the individual steps. The point half way along the curve is chosen as the interpolant. This algorithm is far more robust than the first one, because it only uses local projections, even when the two input points are far from each other. Figure 5b illustrates this algorithm. 4.3 "Manifold-Snake" This approach combines aspects of the first two algorithms. It begins with the linearly interpolated points and iteratively moves the points toward the manifold. The Manifold-Snake is a sequence of n points preferentially distributed along a smooth curve with equal distances between them. An energy function is defined on such sequences of points so that the energy minimum tries to satisfy these constraints (smoothness, equidistance, and nearness to the manifold): (1) E has value 0 if all Vi are evenly distributed on a straight line and also lie on the manifold. In general E can never be 0 if the manifold is nonlinear, but a minimum for E represents an optimizing solution. We begin with a straight line between the two input points and perform gradient descent in E to find this optimizing solution. 5 Synthetic Examples To quantify the performance of these approaches to interpolation, we generated a database of 16 * 16 pixel images consisting of rotated bars. The bars were rotated for each image by a specific angle. The images lie on a one-dimensional nonlinear manifold embedded in a 256 dimensional image space. A nonlinear manifold represented by 16 local linear patches was induced from the 256 images. Figure 6a shows

978 Christoph Bregler, Stephen M. Omohundro a) "Free Fall" b) "Surface Walk" c) "Surface Snake" Figure 5: Proposed interpolation algorithms. -~/ Figure 6: a) Linear interpolation, b) nonlinear interpolation. two bars and their linear interpolation. Figure 6b shows the nonlinear interpolation using the Manifold- Walk algorithm. Figure 7 shows the average pixel mean squared error of linear and nonlinear interpolated bars. The x-axis represents the relative angle between the two input points. Figure 8 shows some iterations of a Manifold-Snake interpolating 7 points along a 1 dimensional manifold embedded in a 2 dimensional space.... pi... ~/ /'./ VV.t.,/ --- --- -- --- --- - -- Figure 7: Average pixel mean squared error of linear and nonlinear interpolated bars.

Nonlinear Image Interpolation Using Manifold Learning 979 QQOOOO o Iter. I Iter. 2 Iter. 5 Iter. to Iter. 30 Iter. Figure 8: Manifold-Snake iterations on an induced 1 dimensional manifold embedded in 2 dimensions. Figure 9: 16x16 images. Top row: linear interpolation. Bottom row: nonlinear "manifold-walk" interpolation. 6 Natural Lip Images We experimented with two databases of natural lip images taken from two different subjects. Figure 9 shows a case of linear interpolated and nonlinear interpolated 16 * 16 pixel lip images using the Manifold- Walk algorithm. The manifold consists of 16 4-dimensional local linear patches. It was induced from a training set of 1931 lip images recorded with a 30 frames per second camera from a subject uttering various sentences. The nonlinear interpolated image is much closer to a realistic lip configuration than the linear interpolated image. Figure 10 shows a case of linear interpolated and nonlinear interpolated 45 * 72 pixel lip images using the Manifold-Snake algorithm. The images were recorded with a high-speed 100 frames per second camera l. Because of the much higher dimensionality of the images, we projected the images into a 16 dimensional linear subspace. Embedded in this subspace we induced a nonlinear manifold consisting of 16 4-dimensionallocallinear patches, using a training set of 2560 images. The linearly interpolated lip image shows upper and lower teeth, but with smaller contrast, because it is the average image of the open mouth and closed mouth. The nonlinearly interpolated lip images show only the upper teeth and the lips half way closed, which is closer to the real lip configuration. 7 Discussion We have shown how induced nonlinear manifolds can be used to constrain the interpolation of gray level images. Several interpolation algorithms were proposed IThe images were recorded in the UCSD Perceptual Science Lab by Michael Cohen

980 Christoph Bregler. Stephen M. Omohundro Figure 10: 45x72 images projected into a 16 dimensional subspace. Top row: linear interpolation. Bottom row: nonlinear "manifold-snake" interpolation. and experimental studies have shown that constrained nonlinear interpolation works well both in artificial domains and natural lip images. Among various other nonlinear image interpolation techniques, the work of [1] using a Gaussian Radial Basis Function network is most closely related to our approach. Their approach is based on feature locations found by pixelwise correspondence, where our approach directly interpolates graylevel images. Another related approach is presented in [8]. Their images are also first projected into a linear subspace and then modelled by a nonlinear surface but they require their training examples to lie on a grid in parameter space so that they can use spline methods. References [1] D. Beymer, A. Shahsua, and T. Poggio Example Based Image Analysis and Synthesis M.I.T. A.1. Memo No. 1431, Nov. 1993. [2] C. Bregler and S. Omohundro, Surface Learning with Applications to Lip-Reading, in Advances in Neural Information Processing Systems 6, Morgan Kaufmann, 1994. [3] C. Bregler and Y. Konig, "Eigenlips" for Robust Speech Recognition in Proc. ofieee Int. Conf. on Acoustics, Speech, and Signal Processing, Adelaide, Australia, 1994. [4] R.A. Jacobs, M.1. Jordan, S.J. Nowlan, and G.E. Hinton, Adaptive mixtures of local experts in Neural Compuation, 3, 79-87. [5] M.1. Jordan and R. A. Jacobs, Hierarchical Mixtures of Experts and the EM Algorithm Neural Computation, Vol. 6, Issue 2, March 1994. [6] N. Kambhatla and T.K. Leen, Fast Non-Linear Dimension Reduction in Advances in Neural Information Processing Systems 6, Morgan Kaufmann, 1994. [7] M. Kirby, F. Weisser, and G. DangeImayr, A Model Problem in Represetation of Digital Image Sequences, in Pattern Recgonition, Vol 26, No.1, 1993. [8] H. Murase, and S. K. Nayar Learning and Recognition of 3-D Objects from Brightness Images Proc. AAAI, Washington D.C., 1993. [9] P. Simard, Y. Le Cun, J. Denker Efficient Pattern Recognition Using a New Transformation Distance Advances in Neural Information Processing Systems 5, Morgan Kaufman, 1993. [10] M. Turk and A. Pentland, Eigenfaces for Recognition Journal of Cognitive Neuroscience, Volume 3, Number 1, MIT 1991.