Stacked Denoising Autoencoders for Face Pose Normalization

Similar documents
Extracting and Composing Robust Features with Denoising Autoencoders

Autoencoders, denoising autoencoders, and learning deep networks

Neural Networks: promises of current research

Introduction to Deep Learning

Akarsh Pokkunuru EECS Department Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

An Empirical Evaluation of Deep Architectures on Problems with Many Factors of Variation

Deep Similarity Learning for Multimodal Medical Images

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Tutorial Deep Learning : Unsupervised Feature Learning

To be Bernoulli or to be Gaussian, for a Restricted Boltzmann Machine

Traffic Signs Recognition using HP and HOG Descriptors Combined to MLP and SVM Classifiers

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Transfer Learning Using Rotated Image Data to Improve Deep Neural Network Performance

SPEECH FEATURE DENOISING AND DEREVERBERATION VIA DEEP AUTOENCODERS FOR NOISY REVERBERANT SPEECH RECOGNITION. Xue Feng, Yaodong Zhang, James Glass

Deep Learning for Generic Object Recognition

Depth Image Dimension Reduction Using Deep Belief Networks

Vulnerability of machine learning models to adversarial examples

Neural Networks and Deep Learning

Machine Learning 13. week

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Outlier detection using autoencoders

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Learning to Recognize Faces in Realistic Conditions

Learning Two-Layer Contractive Encodings

Neural Networks. CE-725: Statistical Pattern Recognition Sharif University of Technology Spring Soleymani

Using neural nets to recognize hand-written digits. Srikumar Ramalingam School of Computing University of Utah

CAP 6412 Advanced Computer Vision

Emotion Detection using Deep Belief Networks

Deep Learning. Deep Learning provided breakthrough results in speech recognition and image classification. Why?

Energy Based Models, Restricted Boltzmann Machines and Deep Networks. Jesse Eickholt

An Exploration of Computer Vision Techniques for Bird Species Classification

CPSC 340: Machine Learning and Data Mining. Deep Learning Fall 2016

Deep Learning. Volker Tresp Summer 2014

Novel Lossy Compression Algorithms with Stacked Autoencoders

Advanced Introduction to Machine Learning, CMU-10715

arxiv: v1 [cs.lg] 24 Jan 2019

Detecting Salient Contours Using Orientation Energy Distribution. Part I: Thresholding Based on. Response Distribution

A supervised strategy for deep kernel machine

Robust PDF Table Locator

Image Restoration Using DNN

A Fast Learning Algorithm for Deep Belief Nets

Neural Network Neurons

Head Frontal-View Identification Using Extended LLE

Selection of Scale-Invariant Parts for Object Class Recognition

Autoencoders. Stephen Scott. Introduction. Basic Idea. Stacked AE. Denoising AE. Sparse AE. Contractive AE. Variational AE GAN.

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on

Image Denoising and Inpainting with Deep Neural Networks

Static Gesture Recognition with Restricted Boltzmann Machines

Short Paper Boosting Sex Identification Performance

Contractive Auto-Encoders: Explicit Invariance During Feature Extraction

Face Recognition using SURF Features and SVM Classifier

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Component-based Face Recognition with 3D Morphable Models

Radial Basis Function Neural Network Classifier

Deep Learning With Noise

Bilevel Sparse Coding

An Efficient Learning Scheme for Extreme Learning Machine and Its Application

Feature Extraction and Learning for RSSI based Indoor Device Localization

Limited Generalization Capabilities of Autoencoders with Logistic Regression on Training Sets of Small Sizes

Neural Network and Deep Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

An ICA based Approach for Complex Color Scene Text Binarization

Seminars in Artifiial Intelligenie and Robotiis

Deep Belief Network for Clustering and Classification of a Continuous Data

Neural Networks for Machine Learning. Lecture 15a From Principal Components Analysis to Autoencoders

Deep Learning with Tensorflow AlexNet

Handwritten Hindi Numerals Recognition System

Stacked Denoising Autoencoders for the Automatic Recognition of Microglial Cells State

CS229 Final Project: Predicting Expected Response Times

Deep Learning for Computer Vision II

Label-denoising Auto-Encoder for Classification with Inaccurate Supervision Information

TONGUE CONTOUR EXTRACTION FROM ULTRASOUND IMAGES BASED ON DEEP NEURAL NETWORK

3D Object Recognition with Deep Belief Nets

DEEP LEARNING TO DIVERSIFY BELIEF NETWORKS FOR REMOTE SENSING IMAGE CLASSIFICATION

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Dynamic Routing Between Capsules

Model Generalization and the Bias-Variance Trade-Off

Day 3 Lecture 1. Unsupervised Learning

Learning Social Graph Topologies using Generative Adversarial Neural Networks

Deep Generative Models Variational Autoencoders

Learning visual odometry with a convolutional network

Efficient Algorithms may not be those we think

Multi-Task Learning of Facial Landmarks and Expression

Lecture 13. Deep Belief Networks. Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen

Machine Learning. The Breadth of ML Neural Networks & Deep Learning. Marc Toussaint. Duy Nguyen-Tuong. University of Stuttgart

Generalized Autoencoder: A Neural Network Framework for Dimensionality Reduction

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Parallel Implementation of Deep Learning Using MPI

Edge Detection via Objective functions. Gowtham Bellala Kumar Sricharan

ImageNet Classification with Deep Convolutional Neural Networks

Neural Network Optimization and Tuning / Spring 2018 / Recitation 3

Face Detection Using Convolutional Neural Networks and Gabor Filters

A Real Time Facial Expression Classification System Using Local Binary Patterns

Implicit Mixtures of Restricted Boltzmann Machines

Deep Learning. Volker Tresp Summer 2015

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Image Features: Local Descriptors. Sanja Fidler CSC420: Intro to Image Understanding 1/ 58

JOINT INTENT DETECTION AND SLOT FILLING USING CONVOLUTIONAL NEURAL NETWORKS. Puyang Xu, Ruhi Sarikaya. Microsoft Corporation

Filtering Images. Contents

Transcription:

Stacked Denoising Autoencoders for Face Pose Normalization Yoonseop Kang 1, Kang-Tae Lee 2,JihyunEun 2, Sung Eun Park 2 and Seungjin Choi 1 1 Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 790-784, Korea 2 KT Advanced Institute of Technology 17 Woomyeon-dong, Seocho-gu, Seoul, Korea 1 {e0en,seungjin}@postech.ac.kr 2 {kangtae.lee,eunjihyun,sungeun.park}@kt.com Abstract. The performance of face recognition systems are significantly degraded by the pose variations of face images. In this paper, a global pose normalization method is proposed for pose-invariant face recognition. The proposed method uses a deep network to convert non-frontal face images into frontal face images. Unlike existing part-based methods that require complex appearence models or multiple face part detectors, the proposed method relies only on a face detector. The experimental results using the Georgia tech face database demonstrate the advantages of the proposed method. Keywords: Pose normalization, face recognition, autoencoder, stacked denoising autoencoder 1 Introduction Unlike traditional face recognition systems that recognize large number of people, mobile devices or televisions need to recognize only a small set of people with high accuracy. Although there are large amount of benchmark datasets available, collecting enough amount of training images of the set of people of interest is still very di cult. Therefore, we need a face recognition system that guarantees high accuracy with a small amount of training data. When given a small number of training data, it is hard to generalize over the many kinds of variations including pose and illumination. While e ect of changes in illumination can be reduced by using preprocessing techniques including histogram normalization, the changes in pose is di cult to handle with simple pre-processing. One of the popular approaches to overcome the di culty of generalization over pose changes is to use pose normalization on face images. Pose normalization refers to methods that infers a frontal face when given a non-frontal face images. Most of pose normalization algorithms are part-based:

2 Authors Suppressed Due to Excessive Length They locate face parts, and re-locate them to their standard positions [1][2]. The main drawback of the part-based methods is that they require all face parts to be located with high accuracy. Locating face parts is done by sophisticated models including AAMs [3], and these models are not suitable for mobile devices with limited processor speed and memory size. Instead of part-based methods, we suggest a global method that only uses a single detector: a face detector. The proposed method takes a whole nonfrontal face image and converts it into a frontal face image. To learn the complex mapping between non-frontal and frontal faces, the proposed method uses a deep network called stacked denoising autoencoder. In this paper, we first review the stacked denoising autoencoders, and then describe our framework for pose normalization and face recognition. Experiments on a benchmark dataset shows the usefulness of the proposed method in improving face recognition accuracy. 2 Stacked Denoising Autoencoder 2.1 Denoising Autoencoder The term autoencoder refers to an unsupervised, deterministic neural network that 1) generates a hidden representation from input, 2) and then reconstructs input from the hidden representation [4]. Therefore, an autoencoder is composed of two parts: an encoding function and a decoding function (Fig. 1). Fig. 1. Structure of an autoencoder with encoding function f(x, f ) and decoding function g(y, g). An encoding function f(x, f ) maps an input x to a hidden representation y using a a ne transformation with a projection matrix W and a bias b, followed by a non-linear squashing function. Sigmoid function (x) = 1/(1 + exp( x)) is typically used as a squashing function. y = f(x, f )= (Wx+ b) (1) Then a decoding function g(y, g ) maps the hidden representation back to a reconstruction of input z. A decoding function can be either linear or nonlinear. A ne transformation is often used when the input takes real values, and sigmoid

Stacked Denoising Autoencoders for Face Pose Normalization 3 squashing function is applied when the input is binary: z = g(y, g )= W 0 y + b 0 or (W 0 y + b 0 ) (2) Training an autoencoder is done by minimizing the mean-squared reconstruction error with respect to parameters f = {W, b} and g = {W 0, b 0 }. arg min f, g E{ x z 2 2} (3) Although autoencoders learn an e ective encodings that are capable of reconstructing inputs, they su er from overfitting when the dimension of hidden representations becomes higher. Moreover, it is likely that such autoencoders to learn a trivial identity mapping, instead of learning useful features from data. X X Fig. 2. Structure of an denoising autoencoder with encoding function f(x, f ), decoding function g(y, g), and stochastic corruption q D( x x). Denoising autoencoder (DAE) was proposed to overcome the limitations of autoencoders by reconstructing denoised inputs x from corrupted, noisy inputs x [5]. DAEs avoids overfitting and learns better, non-trivial features by introducing stochastic noises to training samples. One may generate corrupted inputs x from their original value x with several di erent stochastic corruption criteria q D ( x x), including adding Gaussian random noise, randomly masking dimensions to zero, and adding salt-pepper noise (Fig. 2). x q D ( x x) (4) y = f( x, f )= (W x + b) (5) z = g(y, g )=W 0 y + b 0 (6) The objective function of DAEs remains the same as typical autoencoders. Note that the objective function minimizes the discrepancy between reconstructions and original, uncorrupted inputs x, not the corrupted inputs x. A DAE is trained using back-propagation just as ordinary multi-layer perceptrons.

4 Authors Suppressed Due to Excessive Length 2.2 Stacked DAEs Stacking DAEs on top of each other allows the model to learn more complex mapping from input to hidden representations [6]. Just as other deep models including deep belief networks [7], training stacked DAEs is also done in twophase: layerwise, greedy pre-training and fine-tuning. (frontal faces) (Non-frontal faces) (a) (b) Fig. 3. Pre-training of stacked DAEs. (a) Train a bottom layer DAE with clean and corrupted inputs (in our case, frontal faces and non-frontal faces), then (b) train another DAE that reconstructs z (2) from the hidden representation y (1) extracted from the bottom layer DAE. Unlike typical deep models that are extended by adding layers from bottom to top in pre-training, stacked DAEs are extended by adding layers in the middle of them. More specifically, the pre-training of stacked DAEs is done by the following steps. First, train bottom layer DAE with encoding function y (1) = f (1) (x, (1) f ) and decoding function z (1) = g (1) (y (1), g (1) ) (Fig. 3(a)). Once the bottom layer DAE is trained, train a new DAE that takes the hidden representations of the bottom layer DAE y (1) as training data. Stochastic noise q D (ỹ (1) y (1) ) is added to y (1) to generate corrupted input ỹ (1). y (1) = f (1) (x, (1) f ) (7) ỹ (1) q D (ỹ (1) y (1) ) (8) y (2) = f (2) (ỹ (1), (2) f ) (9) z (2) = g (2) (ỹ (2), (2) g ) (10) Train more DAEs in a similar way until the desired number of layers is achieved. After pre-training, the weights and biases of stacked DAE are fine-tuned by back-propagation as ordinary neural networks.

Stacked Denoising Autoencoders for Face Pose Normalization 5 3 Pose Normalization using Stacked DAEs non-frontal faces Stacked DAE reconstructed frontal faces Support Vector Machine Subject #1 Labels Fig. 4. The schematics of the proposed pose normalization and face recognition system composed of stacked DAEs and SVMs. The number of face images collected from users is often insu cient to cover various changes in poses. On the other hand, it is relatively easy to obtain large amount of face images with pose variations from the existing databases. Therefore, it is necessary to utilize the face databases to improve the performance of classifiers that are trained with relatively small number of images provided by users. Pose normalization methods can assist classifiers with the following procedure. 1. Train a pose normalization algorithm that learns a general mapping from non-frontal faces to frontal faces, using the existing face image databases. 2. Collect face images from users and train classifier with the collected images. 3. Given a non-frontal face image as a query, run the pose normalization on the query, then feed the pose-normalized image to classifier for recognition. As mapping from non-frontal faces to frontal faces is highly complex, deep models like stacked DAEs are ideal choice for learning such mappings. Moreover, the fact that the learning procedure of DAEs is not a ected by the type of corruption applied to inputs suggests that one can use more sophisticated procedures to corrupt inputs for DAEs, instead of just adding random noises to samples. Therefore, we consider non-frontal face images as corrupted versions of a frontal face, and learn mapping between non-frontal and frontal face images using stacked DAEs for pose normalization and face recognition. As images take real-values, a ne decoding function is used for the bottom layer of stacked DAE, and sigmoid function for the rest of layers. Sigmoid function was used for encoding function for all layers. As described above, Corrupted inputs for the bottom layer was given as non-frontal images, and inputs for higher layers were corrupted by Gaussian noises with standard deviation (i.e. q D (ỹ (1) y (1) )=N(y (1), 2 I)). 4 Face Recognition on Georgia Tech Face Database 4.1 Data Pre-processing Georgia Tech face database [8] consists of pictures 50 subjects in 15 di erent poses. One of the 15 poses is frontal, and the variations among poses are relatively

6 Authors Suppressed Due to Excessive Length Fig. 5. Frontal and 10 non-frontal faces of 10 subjects from Georgia Tech face database. high (Fig. 5). We consider the frontal faces as uncorrupted inputs for stacked DAEs, and the remaining 14 non-frontal faces as corrupted inputs. The resulting dataset is still too small to train a large deep network. Therefore, we applied additional corruptions as below on every frontal and non-frontal image to generate more corrupted samples: Translate horizontally and vertically by up to 2 pixels. Rotate in -30 to 30 degrees. Flip horizontally. By applying these corruption procedures, we expanded the original dataset with 750 images into a larger dataset with 26,550 images. All corrupted images were converted into grayscale, and histogram-normalized. 4.2 Experimental settings Two di erent experiment settings were tested to measure the performance of the proposed method. 1. Setting #1: To simulate the situation of having multiple training samples for each subject, whole dataset was randomly partitioned into training set and test set that contains 80% and 20% of the samples. Training set was used to train both stacked DAE and SVMs, and test set was used to test the proposed face recognition system. 2. Setting #2: To simulate an extreme case of having only single image for each subject, we used the face images of 80% of subjects for training stacked DAE, and used the remaining 20% as the training set for SVMs and test set for the proposed face recognition system. For pose normalization, we trained 3-layer stacked DAE with 2000 and 1000 latent dimensions for each hidden layers (resulting into a network with 1024-2000-1000-2000-1024 nodes), and ran the stochastic gradient updates for 1,200

Stacked Denoising Autoencoders for Face Pose Normalization 7 Table 1. Face recognition accuracies on Georgia Tech face database. settings baseline(linear) baseline(rbf) proposed(linear) proposed(rbf) setting #1 0.108 0.098 0.843 0.806 setting #2 0.235 0.289 0.454 0.409 epochs with batch size 100. The noise level was set to 0.2. We used linear support vector machine (SVM) and kernel SVM with RBF kernel to classify the faces. We also ran SVMs on the the raw non-frontal face images (without passing them through stacked DAE) as a baseline. 4.3 Experimental Results Before quantitatively analyzing the face recognition results, we first visualized the weights of stacked DAE learned from the pairs of corrupted non-frontal faces and frontal faces. Most of the weights contained circular filters which captures the rotations of the faces (Fig. 6(a)). (a) (b) Fig. 6. (a) Subset of weights learned by the first layer of stacked DAE trained on Georgia Tech face database. Each small square corresponds to weight values between a hidden node and all input pixels. Higher pixel intensity indicates larger weight value. (b) 10 examples of corrupted face images from Georgia tech face database (left), their pose-normalized version obtained by the proposed method (middle), and ground-truth frontal face images (c). The reconstructed frontal images obtained by processing non-frontal corrupted face images using stacked DAEs were a bit blurry, but still retained important characteristics of the ground-truth frontal face images (Fig. 6(b)). The quantitative comparison reveals the e ectiveness of the proposed approach more clearly. When relatively large number of training samples were provided (setting #1), the improvement of accuracy over naive baseline method was dramatic (84.3% vs. 10.8%). Even when only one training image was given (setting #2), the proposed method significantly improved the classification accuracy (45.4% vs. 23.5%).

8 Authors Suppressed Due to Excessive Length 5 Conclusion In this paper, we introduced a pose normalization and face recognition system that uses stacked DAEs. By learning a general mapping from non-frontal faces to frontal faces using stacked DAEs, the proposed method performs pose normalization without any sophisticated appearence models. Experiments on Georgia Tech face database evaluated the e ectiveness of the proposed system in terms of face recognition accuracy in usual settings and extreme settings with only single training sample per subject. Acknowledgments. This work was supported by POSTECH & KT Open R & D Program. References 1. Du, S., Ward, R.: Component-wise pose normalization for pose-invariant face recognition. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP). (2009) 873 876 2. Asthana, A., Marks, T.K., Jones, M.J., Tieu, K.H., MV, R.: Fully automatic poseinvariant face recognition via 3d pose normalization. In: Proceedings of the International Conference on Computer Vision (ICCV). (2011) 937 944 3. Cootes, T., Walker, K., Talyor, C.J.: View-based active appearance models. In: Proceedings of the 4th IEEE International Conference on Automatic Face and Gesture Recognition. (2000) 227 232 4. Becker, S.: Unsupervised learning procedures for neural networks. The International Journal of Neural Systems 1&2(1991) 17 33 5. Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.: Extracting and composing robust features with denoising autoencoders. In: Proceedings of the International Conference on Machine Learning (ICML). (2008) 1096 1103 6. Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.: Stacked denoising autoencoders: Learning useful representations in a deep network with a local denoising criterion. Journal of Machine Learning Research 11 (2010) 3371 3408 7. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Computation 18(7) (2006) 1527 1554 8. Nefian, A.V.: Georgia tech face database. http://www.anefian.com/research/ face_reco.htm (1999)