Robust LSTM-Autoencoders for Face De-Occlusion in the Wild

Similar documents
Implementing Ray Casting in Tetrahedral Meshes with Programmable Graphics Hardware (Technical Report)

A Matching Algorithm for Content-Based Image Retrieval

Learning in Games via Opponent Strategy Estimation and Policy Search

STEREO PLANE MATCHING TECHNIQUE

Video-Based Face Recognition Using Probabilistic Appearance Manifolds

Joint Feature Learning With Robust Local Ternary Pattern for Face Recognition

Recovering Joint and Individual Components in Facial Data

CAMERA CALIBRATION BY REGISTRATION STEREO RECONSTRUCTION TO 3D MODEL

Robust Multi-view Face Detection Using Error Correcting Output Codes

MORPHOLOGICAL SEGMENTATION OF IMAGE SEQUENCES

Stereoscopic Neural Style Transfer

TrackNet: Simultaneous Detection and Tracking of Multiple Objects

A Hierarchical Object Recognition System Based on Multi-scale Principal Curvature Regions

Analysis of Various Types of Bugs in the Object Oriented Java Script Language Coding

arxiv: v2 [cs.cv] 20 May 2018

Real Time Integral-Based Structural Health Monitoring

Image segmentation. Motivation. Objective. Definitions. A classification of segmentation techniques. Assumptions for thresholding

Visual Indoor Localization with a Floor-Plan Map

A Bayesian Approach to Video Object Segmentation via Merging 3D Watershed Volumes

The Impact of Product Development on the Lifecycle of Defects

CENG 477 Introduction to Computer Graphics. Modeling Transformations

Sam knows that his MP3 player has 40% of its battery life left and that the battery charges by an additional 12 percentage points every 15 minutes.

Michiel Helder and Marielle C.T.A Geurts. Hoofdkantoor PTT Post / Dutch Postal Services Headquarters

MODEL BASED TECHNIQUE FOR VEHICLE TRACKING IN TRAFFIC VIDEO USING SPATIAL LOCAL FEATURES

Wheelchair-user Detection Combined with Parts-based Tracking

Probabilistic Detection and Tracking of Motion Discontinuities

Visual Perception as Bayesian Inference. David J Fleet. University of Toronto

Lecture 18: Mix net Voting Systems

Improved TLD Algorithm for Face Tracking

Deep Appearance Models for Face Rendering

Coded Caching with Multiple File Requests

Gender Classification of Faces Using Adaboost*

IntentSearch:Capturing User Intention for One-Click Internet Image Search

4 Error Control. 4.1 Issues with Reliable Protocols

A time-space consistency solution for hardware-in-the-loop simulation system

Evaluation and Improvement of Region-based Motion Segmentation

Simple Network Management Based on PHP and SNMP

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS PART A: SYSTEMS AND HUMANS 1

DAGM 2011 Tutorial on Convex Optimization for Computer Vision

Video Content Description Using Fuzzy Spatio-Temporal Relations

FIELD PROGRAMMABLE GATE ARRAY (FPGA) AS A NEW APPROACH TO IMPLEMENT THE CHAOTIC GENERATORS

Multiple View Discriminative Appearance Modeling with IMCMC for Distributed Tracking

An Adaptive Spatial Depth Filter for 3D Rendering IP

Robust Visual Tracking for Multiple Targets

ACQUIRING high-quality and well-defined depth data. Online Temporally Consistent Indoor Depth Video Enhancement via Static Structure

Early and Late Integration of Audio Features for Automatic Video Description

Quantitative macro models feature an infinite number of periods A more realistic (?) view of time

Viewpoint Invariant 3D Landmark Model Inference from Monocular 2D Images Using Higher-Order Priors

4. Minimax and planning problems

J. Vis. Commun. Image R.

MATH Differential Equations September 15, 2008 Project 1, Fall 2008 Due: September 24, 2008

arxiv: v1 [cs.cv] 25 Apr 2017

Learning Generic Diffusion Processes for Image Restoration

Parallel and Distributed Systems for Constructive Neural Network Learning*

EECS 487: Interactive Computer Graphics

Scheduling. Scheduling. EDA421/DIT171 - Parallel and Distributed Real-Time Systems, Chalmers/GU, 2011/2012 Lecture #4 Updated March 16, 2012

AUTOMATIC 3D FACE REGISTRATION WITHOUT INITIALIZATION

A Neural Network Approach to Missing Marker Reconstruction

CONTEXT MODELS FOR CRF-BASED CLASSIFICATION OF MULTITEMPORAL REMOTE SENSING DATA

IDEF3 Process Description Capture Method

A Face Detection Method Based on Skin Color Model

Open Access Research on an Improved Medical Image Enhancement Algorithm Based on P-M Model. Luo Aijing 1 and Yin Jin 2,* u = div( c u ) u

Hierarchical Recurrent Filtering for Fully Convolutional DenseNets

Audio Engineering Society. Convention Paper. Presented at the 119th Convention 2005 October 7 10 New York, New York USA

Network management and QoS provisioning - QoS in Frame Relay. . packet switching with virtual circuit service (virtual circuits are bidirectional);

Real-Time Non-Rigid Multi-Frame Depth Video Super-Resolution

COSC 3213: Computer Networks I Chapter 6 Handout # 7

Computer representations of piecewise

4.1 3D GEOMETRIC TRANSFORMATIONS

A Fast Stereo-Based Multi-Person Tracking using an Approximated Likelihood Map for Overlapping Silhouette Templates

Design Alternatives for a Thin Lens Spatial Integrator Array

Landmarks: A New Model for Similarity-Based Pattern Querying in Time Series Databases

A GRAPHICS PROCESSING UNIT IMPLEMENTATION OF THE PARTICLE FILTER

NEWTON S SECOND LAW OF MOTION

Detection Tracking and Recognition of Human Poses for a Real Time Spatial Game

FACIAL ACTION TRACKING USING PARTICLE FILTERS AND ACTIVE APPEARANCE MODELS. Soumya Hamlaoui & Franck Davoine

Multi-Target Detection and Tracking from a Single Camera in Unmanned Aerial Vehicles (UAVs)

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

An Improved Square-Root Nyquist Shaping Filter

Time Expression Recognition Using a Constituent-based Tagging Scheme

Optimal Crane Scheduling

Real time 3D face and facial feature tracking

Voltair Version 2.5 Release Notes (January, 2018)

Occlusion-Free Hand Motion Tracking by Multiple Cameras and Particle Filtering with Prediction

Moving Object Detection Using MRF Model and Entropy based Adaptive Thresholding

In Proceedings of CVPR '96. Structure and Motion of Curved 3D Objects from. using these methods [12].

Packet Scheduling in a Low-Latency Optical Interconnect with Electronic Buffers

In fmri a Dual Echo Time EPI Pulse Sequence Can Induce Sources of Error in Dynamic Magnetic Field Maps

LAMP: 3D Layered, Adaptive-resolution and Multiperspective Panorama - a New Scene Representation

Tracking Appearances with Occlusions

! errors caused by signal attenuation, noise.!! receiver detects presence of errors:!

Research Article Auto Coloring with Enhanced Character Registration

Graffiti Detection Using Two Views

Scale Recovery for Monocular Visual Odometry Using Depth Estimated with Deep Convolutional Neural Fields

A GRAPHICS PROCESSING UNIT IMPLEMENTATION OF THE PARTICLE FILTER

Weighted Voting in 3D Random Forest Segmentation

Test - Accredited Configuration Engineer (ACE) Exam - PAN-OS 6.0 Version

Relevance Ranking using Kernels

Rao-Blackwellized Particle Filtering for Probing-Based 6-DOF Localization in Robotic Assembly

Multi-camera multi-object voxel-based Monte Carlo 3D tracking strategies

Transcription:

IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 1 Robus LSTM-Auoencoders for Face De-Occlusion in he Wild Fang Zhao, Jiashi Feng, Jian Zhao, Wenhan Yang, Shuicheng Yan arxiv:1612.08534v1 [cs.cv] 27 Dec 2016 Absrac Face recogniion echniques have been developed significanly in recen years. However, recognizing faces wih parial occlusion is sill challenging for exising face recognizers which is heavily desired in real-world applicaions concerning surveillance and securiy. Alhough much research effor has been devoed o developing face de-occlusion mehods, mos of hem can only work well under consrained condiions, such as all he faces are from a pre-defined closed se. In his paper, we propose a robus LSTM-Auoencoders (RLA) model o effecively resore parially occluded faces even in he wild. The RLA model consiss of wo LSTM componens, which aims a occlusion-robus face encoding and recurren occlusion removal respecively. The firs one, named muli-scale spaial LSTM encoder, reads facial paches of various scales sequenially o oupu a laen represenaion, and occlusion-robusness is achieved owing o he fac ha he influence of occlusion is only upon some of he paches. Receiving he represenaion learned by he encoder, he LSTM decoder wih a dual channel archiecure reconsrucs he overall face and deecs occlusion simulaneously, and by fea of LSTM, he decoder breaks down he ask of face de-occlusion ino resoring he occluded par sep by sep. Moreover, o minimize idenify informaion loss and guaranee face recogniion accuracy over recovered faces, we inroduce an ideniy-preserving adversarial raining scheme o furher improve RLA. Exensive experimens on boh synheic and real daases of faces wih occlusion clearly demonsrae he effeciveness of our proposed RLA in removing differen ypes of facial occlusion a various locaions. The proposed mehod also provides significanly larger performance gain han oher deocclusion mehods in promoing recogniion performance over parially-occluded faces. I. INTRODUCTION In recen years, human face recogniion echniques have demonsraed promising performance in many large-scale pracical applicaions. However, in real-life images or videos, various occlusion can ofen be observed on human faces, such as sunglasses, mask and hands. The occlusion, as a ype of spaially coniguous and addiive gross noise, would severely conaminae discriminaive feaures of human faces and harm he performance of radiional face recogniion approaches ha are no robus o such noise. To address his issue, a promising soluion is o auomaically remove facial occlusion before recognizing he faces [1], [2], [3], [4], [5]. However, mos of exising mehods can only remove facial occlusions well under raher consrained environmens, e.g., faces are from a predefined closed se or here is only a single ype of occlusion. Fang Zhao, Jiashi Feng, Jian Zhao and Shuicheng Yan are wih Deparmen of Elecrical and Compuer Engineering, Naional Universiy of Singapore, Singapore, e-mail: {elezhf, elefjia}@nus.edu.sg, zhaojian90@u.nus.edu, eleyans@nus.edu.sg. Wenhan Yang is wih Insiue of Compuer Science and Technology, Peking Universiy, Beijing, 100080, P.R. China, e-mail: yangwenhan@pku.edu.cn. (a) (b) (c) Fig. 1. We address he ask of face de-occlusion wih various ypes of occlusions under he condiion of open es ses (i.e., es samples have no idenical subjec wih raining samples). (a) Original occlusion-free faces; (b) Occluded faces; (c) Recovered faces by our proposed mehod. Thus hose mehods are no applicable for he complex real scenarios like surveillance. In his work, we aim o address his challenging problem face de-occlusion in he wild where he faces can be from an open es se and he occlusions can be of various ypes (see Fig. 1). To solve his problem, we propose a novel face de-occlusion framework buil upon our developed robus LSTM-Auoencoders (RLA). In real scenarios, facial occlusion ofen presens raher complex paerns and i is difficul o recover clean faces from he occluded one in a single sep. Differen from exising mehods pursuing one-sop soluion o de-occlusion, he proposed RLA model removes occlusion in several successive processes o resore occluded face pars progressively. Each sep can benefi from recovered resuls provided by he previous sep. More concreely, he RLA model works as follows. Given a new face image wih occlusion, RLA firs employs a muli-scale spaial LSTM encoder o read paches of he image sequenially o alleviae he conaminaion from occlusion in he encoding process. RLA produces a occlusion-robusness laen represenaion of he face because he influence of occlusion is only upon some of he paches. Then, a dualchannel LSTM decoder akes his represenaion as inpu and joinly reconsrucs he occlusion-free face and deecs he occluded regions from coarse o fine. The dual-channel LSTM decoder conains wo complemenary sub-neworks, i.e. a face reconsrucion nework and an occlusion deecion nework. These wo neworks collaborae wih each oher

IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 2 o localize and remove he facial occlusion. In paricular, hidden unis of he reconsrucion nework feeds forward he decoding informaion of face reconsrucion a each sep o he deecion nework o help he occlusion localizaion, and he deecion nework back propagaes he occlusion deecion informaion ino he reconsrucion nework o make i focus on reconsrucing occluded pars. Finally, he reconsruced face is inegraed wih he occluded face in an occlusion-aware manner o produce he recovered occlusion-free face. We rain he overall RLA in an end-o-end way, hrough minimizing he mean square error (MSE) beween paired recovered face and ground ruh face. We observe ha purely minimizing MSE usually over-smoohes he resored facial pars and leads o loss of he discriminaive feaures for recognizing person ideniy. This would hur he performance of face recogniion. Therefore, in order o preserve he ideniy informaion of recovered faces, we inroduce an ideniy based supervised CNN o encourage RLA o preserve he discriminaive deails during face recovery. However, his kind of supervised CNN resuls in severe arifacs in he recovered faces. We hus furher inroduce an adversarial discriminaor [6], which learns o disinguish recovered and original occlusion-free faces, o remove he arifacs and enhance visual qualiy of recovered faces. As can be seen in he experimens, inroducing such discriminaive regularizaion indeed effecively preserves he ideniy informaion of recovered faces and faciliaes he following face recogniion. Our main conribuions include he following hree aspecs. 1) We propose a novel LSTM auoencoders o remove facial occlusion sep by sep. To he bes of our knowledge, his is he firs research aemp o exploi he poenial of he LSTM auoencoders for face de-occlusion in he wild. 2) We inroduce a dual-channel decoding process for joinly reconsrucing faces and deecing occlusion. 3) We furher develop a person ideniy diagnosic de-occlusion model, which is able o preserve more facial deails and idenify informaion in he recovered faces hrough employing a supervised and adversarial learning mehod. wo SSDAs, which requires faces from raining and es ses have he same occluded locaion. All of hose mehods do no consider open es ses. Tes samples in heir experimens have he idenical subjecs wih raining samples, which is oo limied for pracical applicaions. B. Image Inpaining Our work is also relaed o image inpaining which mainly aims o fill in small image gaps or resore large background regions wih similar srucures. Classical image inpaining mehods usually is based on local non-semanic algorihms. Beralmio e al. [7] proposed o smoohly propagae informaion from he surrounding areas in he isophoes direcion for digial inpaining of sill images. Criminisi e al. [8] inroduced a bes-firs algorihm o propagae he confidence in he synhesized pixel values in a manner similar o he propagaion of informaion in inpaining and compue he acual colour values using exemplar-based synhesis. Osher e al. [9] proposed an ieraive regularizaion procedure for resoring noisy and blurry images hrough using oal variaion regularizaion. I is difficul for hose mehods o remove gross spaially coniguous noise like facial occlusion because oo much srucural informaion is los in ha case, e.g., he enire eye or mouh is occluded. Recenly, some mehods based on global conex feaures have been developed. Xie e al. [10] proposed he sacked sparse denoising auoencoders (SSAD) for image denoising and inpaining hrough combining sparse coding and prerained deep neworks. Pahak e al. [11] rained he conex encoders o generae images for inpaining or hole-filling and simulaneously learned feaure represenaions which capures appearances and semanics of visual srucures. However, he locaions of image regions which require o be filled in are provided beforehand. By conras, our mehod dose no need o know locaions of he corruped regions and auomaically idenify hose regions. A. Face De-Occlusion II. RELATED WORK There are some exising mehods based on analyic synheic echniques for face de-occlusion. Wrigh e al. [1] proposed o apply sparse represenaion o encoding faces and demonsraed cerain robusness of he exraced feaures o occlusion. Park e al. [2] showed ha eye areas occluded by glasses can be recovered using PCA reconsrucion and recursive error compensaion. Li e al. [3] proposed a local non-negaive marix facorizaion (LNMF) mehod o learn spaially localized and par-based subspace represenaion o recover and recognize occluded faces. Tang e al. [4] presened a robus Bolzmann machine based model o deal wih occlusion and noise. This unsupervised model uses a muliplicaive gaing o induce a scale mixure of wo Gaussians over pixels. Cheng e al. [5] inroduced a sacked sparse denoising auoencoder wih wo channels o deec noise hrough exploiing he difference beween acivaions of he III. ROBUST LSTM-AUTOENCODERS FOR FACE DE-OCCLUSION In his secion we firs briefly review he Long Shor- Term Memory (LSTM). Then we elaborae he proposed robus LSTM-Auoencoders in deails, including he muliscale spaial LSTM encoder, he dual-channel LSTM decoder and he ideniy preserving componen. A. Long Shor-Term Memory Long Shor-Term Memory (LSTM) [12] is a popular archiecure of recurren neural neworks. I consiss of a memory uni c, a hidden sae h and hree ypes of gaes he inpu gae i, he forge gae f and he oupu gae o. These gaes are used o regulae reading and wriing o he memory uni. More concreely, for each ime sep, LSTM firs receives an inpu x and he previous hidden sae h 1, hen compues acivaions of he gaes and finally updaes he memory uni

IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 3 LSTM encoder LSTM decoder Face reconsrucion channel Spaial LSTM + + + Mean square error Muli-scale inpu paches + + + Occlusion deecion channel Fig. 2. Illusraion on he framework of our proposed robus LSTM-Auoencoders and is raining process. I consiss of a muli-scale spaial LSTM encoder and a dual-channel LSTM decoder for concurren face reconsrucion and occlusion deecion. o c and he hidden sae o h. The involved compuaion is given as follows, i = (W xi x + W hi h 1 + b i ), f = (W xf x + W hf h 1 + b f ), c = f c 1 + i anh(w xc x + W hc h 1 + b c ), (1) o = (W xo x + W ho h 1 + b o ), h = o anh(c ) where (x) = 1/(1+exp( x)) is a logisic sigmoid funcion, denoes he poin-wise produc, and W and b are weighs and biases for he hree gaes and he memory uni. A major obsacle in using gradien descen o opimize sandard RNN models is ha he gradien migh vanish quickly during back propagaion along he sequence. LSTM alleviaes his issue effecively. Is memory uni sums up aciviies over all ime seps, which guaranees ha he gradiens are disribued over he facors of summaion. Thus, back propagaion would no suffer from he vanishing issue anymore when applying LSTM o long sequence daa. This makes LSTM memorize beer long-range conex informaion. Due o such an excellen propery, LSTM has been exensively exploied o address a variey of problems concerning sequenial daa analysis, e.g., speech recogniion [13], image capioning [14], acion recogniion [15] and video represenaion learning [16], as well as some problems ha can be cased o sequence analysis, e.g., scene labeling [17] and image generaion [18]. Here we uilize LSTM newroks o build our face de-occlusion model where facial occlusion is removed by a sequenial processing o eliminae he effec of occlusion sep by sep. B. Robus LSTM-Auoencoders In his work, we are going o solve he problem of recovering a occlusion-free face from is noisy observaion wih occlusion. Le X occ denoe an occluded face and le X denoe is corresponding occlusion-free face. Face de-occlusion hen aims o find a funcion f ha removes he occlusion on X occ by minimizing he difference beween he recovered face f(x occ ) and he occlusion-free face X: min f f(x occ ) X 2 F. (2) We propose o parameerize he recovering funcion f using an auoencoder, which has been exploied for image denoising and inpaining [10]. The recovering funcion hen can be expressed as f(x occ ) = f dec (f enc (X occ ; W, b); W, b ), (3) where {W, b} and {W, b } encapsulae weighs and biases of he encoder funcion and decoder funcion respecively. In he image denoising and inpaining, he goal is o remove disribued noise, e.g., Gaussian noise, and coniguous noise wih low magniude, e.g., ex. Unlike hem, one canno apply he auoencoder direcly o remove facial occlusion. I is difficul o remove such a large area of spaially coniguous noise like occlusion in one sep, especially in unconsrained environmens where face images probably have various resoluions, illuminaions, poses and expressions, or even never appear in raining daa. Inspired by divide-and-conquer algorihms [19] in compuer science, here we propose an LSTM based auoencoder o divide he problem of de-occlusion ino a series of sub-problems of occlusion deecion and removal. Fig. 2 illusraes he framework of our proposed robus LSTM-Auoencoders (RLA) model. We now proceed o explain each of is componens and how hey work joinly o remove facial occlusion one by one. 1) Muli-scale Spaial LSTM Encoder: Given he archiecure shown in Fig. 2, we firs explain he buil-in LSTM encoder. The LSTM encoder learns represenaions from he inpu occluded face X occ. Here i is worh noing ha if he

IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 4 LSTM encoder akes he whole face as a single inpu, he occlusion will be involved in he overall encoding process and evenually conaminae he generaed represenaion. In order o alleviae he negaive effec of occlusion, as shown in he lef panel of Fig. 2, we firs divide he face image ino M N (here M = N = 2), and feed hem o a spaial LSTM nework sequenially. Spaial LSTM is an exension of LSTM for analyzing wo-dimensional signals [20]. I sequenializes he inpu image in a pre-defined order (here, from lef o righ and op o boom). By doing so, some of encoding seps will see occlusion-free paches and hus no be affeced by noise. Besides, he noisy informaion from occluded paches is no direcly encoded ino feaure represenaions, bu conrolled by he gaes of spaial LSTM for he sake of he subsequen occlusion deecion. A each paches, denoed as {x i,j } M,N i,j=1 sep, he LSTM also encodes a larger region x coarse i,j around he curren pach bu wih a lower resoluion o learn more conexual informaion. Here he whole image is used as x coarse i,j and concaenaed wih x i,j as a join inpu of he encoder. For each locaion (i, j) in he M N grid dividing he image, he muli-scale spaial LSTM encoder learns represenaions from he pach cenered a (i, j) as follows, i i,j f i 1,j f i,j 1 c i,j o i,j = anh F W,b x i,j x coarse i,j h i 1,j h i,j 1, (4) c i,j = f i 1,j c i 1,j + f i,j 1 c i 1,j + i i,j c i,j, h i,j = o i,j anh(c i,j ), where F W,b is an affine ransformaion w.r.. parameers {W, b} of he memory uni and gaes respecively (ref. Eqn. (1)). The memory uni c i,j is conneced wih wo previous memory unis c i 1,j and c i,j 1 in he 2-D space. I akes he informaion of neighboring paches ino consideraion when learning he represenaion for he curren pach. Afer reading in all paches sequenially, he spaial LSTM encoder oupus is las hidden sae h M,N in he sequence as a feaure represenaion h enc of he occluded face. The represenaion is hen recurrenly decoded o exrac face and occlusion informaion for face recovery. 2) Dual-Channel LSTM Decoder: Given he represenaion h enc of an occluded face produced by he encoder, an LSTM decoder follows o map he learned represenaion back ino an occlusion-free face. Tradiional auoencoders, which have been used in image denoising, usually perform he decoding for once only. However, as we explain above, faces may conain a variey of occlusion in he real world. This kind of spaially coniguous noise corrups images in a more malicious way han general sochasic noise such as Gaussian one, because i incurs loss of imporan srucural informaion of faces. As a resul, he face canno be recovered very well by only one-sep decoding. Therefore, we propose o use an LSTM decoder o progressively resore he occluded par. As shown in he op righ panel of Fig. 2, he LSTM decoder akes over h enc as is inpu h rec 0 a he firs sep and iniializes is memory uni wih he las memory sae of he encoder c enc, and hen keeps revising he oupu X a each sep based on he previous oupu X 1. The operaions of he LSTM decoder for face reconsrucion can be summarized as i rec f rec c rec o rec c rec h rec = = f rec = o rec anh F rec W,b(h rec 1), (5) c rec 1 + i rec c rec, (6) anh(c rec ), (7) X = X 1 + W rec h rec + b rec, (8) where rec indicaes ha he parameers are used for he reconsrucion nework. The final reconsruced face X rec = (X T ) is obained by passing he oupu a he las sep T hrough a sigmoid funcion, which can be seen as a resul refined by decoding for muliple imes. In he above decoding and reconsrucion process, we apply he decoder on boh non-occluded and occluded pars. Thus, pixels of non-occluded pars may suffer from he risk of being corruped in he decoding process. To address his issue, we inroduce anoher LSTM decoder which aims o deec he occlusion. Being aware of he locaion of occlusion, one can simply compensae values of he non-occluded pixels using original pixel values in he inpus. In paricular, for each pixel, he occlusion deecor esimaes he probabiliy of is being occluded. As illusraed in Fig. 2 (boom righ), for each sep, he LSTM deecion nework receives he hidden sae h rec of he reconsrucion nework, and updaes is curren occlusion scores S based on he previous deecion resul. Here he cross-nework connecion provides he decoding informaion of face reconsrucion a each sep for he deecion nework o beer localize he occlusion. More formally, he LSTM decoder deecs occlusion as follows, i de f de c de o de c de = = f de anh F de W,b ( h rec h de 1 ), (9) c de 1 + i de c de, (10) h de = o de anh(c de ), (11) S = S 1 + W de h de + b de, (12) where de indicaes ha he parameers are used for he deecion nework. Similar o he reconsrucion nework, he final occlusion scores are given by S de = (S T ). Then combining he reconsruced face X rec and he occluded face X occ according o he occlusion scores S de gives he recovered face X wih compensaed pixels: X = X rec S de + X occ (1 S de ). (13) Noe ha X is acually a weighed sum of X rec and X occ using S de. The pixel value in he reconsruced face X rec is fully preserved if he score is one, and he pixel value is equal o he one from he occluded face X occ if is occlusion score is zero.

IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 5 Robus LSTM- Auoencoder Supervised CNN Ideniy Labels Alernaing opimizaion MSE Adversarial CNN Original or Recovered Fig. 3. Pipeline of ideniy-preserving RLA wih he supervised and adversarial CNNs. The gray box indicae ha parameers of he nework are fixed during he fine-uning sage. MSE: mean square error. 3) Opimizaion: Given a raining daase {X occ i, X i } K i=1, subsiuing Eqn. (13) ino Eqn. (2), we have he following mean square error funcion ha RLA is going o opimize L mse (W, b) = 1 K X i rec Si de 2K i=1 + X occ i (1 S de i ) X i 2 F, (14) which can be minimized by sandard sochasic gradien descen. Taking is derivaives w.r.. Xi rec and Si de gives he gradiens: L X rec i = 1 K ( X i X i ) S de i, (15) L Si de = 1 K ( X i X i ) (Xi rec Xi occ ). (16) Then hey are used in error back propagaion o updae he parameers of each LSTM nework. Noe ha in Eqn. (15), he gradiens according o he non-occluded par are se o zeros by he occlusion sores S de, and hus he reconsrucion nework will prefer o reconsruc he occluded par wih he help of he occlusion deecion nework. Since he model conains hree neworks, i.e., he encoder nework, he face reconsrucion nework and he occlusion deecion nework, direcly raining he hree neworks simulaneously hardly gives a good local opimum and may converge slowly. To ease he opimizaion, we adop a muli-sage opimizaion sraegy. We firs ignore parameers of he occlusion deecion nework, and pre-rain he encoder and decoder o minimize he reconsrucion error K i=1 Xrec i X i 2 F. Then we fix heir parameers and pre-rain he decoder for occlusion deecion o minimize he join loss in Eqn. (14). These wo rounds of separae pre-raining provides us wih sufficienly good iniial parameers and we proceed o rerain all he hree neworks joinly. We observe ha his sraegy usually gives beer resuls and faser convergence rae in he experimens. C. Ideniy-Preserving Face De-Occlusion Alhough i is able o resore facial srucural informaion (e.g., eyes, mouh and heir spaial configuraion) from oc- Fig. 4. Samples from he occluded CASIA-WebFace daase which is used for raining. In oal 9 ypes of occlusion (50 emplaes for each ype) are synhesized, including sunglasses, masks, hands, glasses, eye masks, scarfs, phones, books and cups. cluded faces well, he RLA model inroduced above only considers minimizing he mean squared error beween occlusionfree and recovered faces. Generally, here are muliple plausible appearances for an occluded facial region. For example, when he lower face is occluded, only according o he upper face, i is hard o deermine wha he lower face is acually like. Thus if we force he model o exacly fi he value of each pixel, i would end o generae mean values of all probable appearances for he recovered par. This probably leads o loss of discriminaive deails of faces and harms he performance of face recogniion. Recenly, deep convoluional neural neworks (CNNs) are widely applied o face recogniion and provide sae-of-he-ar performance [21], [22], [23]. Inspired by heir success, we propose o leverage an ideniy based supervised CNN and an adversarial CNN o provide exra guidances for he RLA model on face recovery, in order o preserve he person ideniy informaion and enhance visual qualiy of recovered faces. Fig. 3 illusraes our proposed pipeline for ideniypreserving RLA (IP-RLA). A pre-rained CNN is concaenaed o he decoder for classifying recovered faces wih ideniy labels {y i } K i=1, and helps une RLA o simulaneously minimize he mean squared error beween pixels L mse in Eqn. (14) and he classificaion loss L sup = 1 K K log(p (y i X i )), (17) i=1

IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 6 Occluded face Occlusionfree face Occluded face Occlusionfree face Sep 1 Sep 8 Fig. 5. Occlusion removal process of RLA. The middle columns display oupus of RLA from sep 1 o 8. The firs row shows oupus of he decoder wih boh face reconsrucion and occlusion deecion componens. The second row shows oupus from only face reconsrucion (wihou occlusion deecion). The hird row displays he deecion resuls of occlusion. where P (y i X i ) denoes he probabiliy ha he recovered face X i is assigned o is ideniy label y i by he supervised CNN. So we preserve high-level facial ideniy informaion and meanwhile recover low-level srucural informaion of faces. However, we observe ha he model produces severe arifacs in recovered face images for fiing o he classificaion nework. Similar o generaive adversarial nes (GAN) [6], we inroduce an adversarial discriminaor o alleviaed his effec of arifacs. In paricular, le G denoe he generaor modeled by RLA, and D denoe he adversarial discriminaor modeled by CNN. The opimizaion procedure can be viewed as a minimax game beween G and D, where D is rained o discriminae original occlusion-free faces and recovered faces from G hrough maximizing he log probabiliy of predicing he correc labels (original or recovered) for boh of hem: min max L adv(d, G) = G D 1 K K i=1 log(d(x i )) + log(1 D(G(X occ i ))), (18) while G is rained o recover more real faces which canno be discriminaed by D hrough minimizing L adv (D, G). Boh G and D are opimized alernaely using sochasic gradien descen as described in [6]. We firs rain he RLA model according o Eqn. (14) by using he muli-sage opimizaion sraegy menioned previously, and hen rain he supervised CNN on original occlusionfree face daa and he adversarial CNN boh on original and recovered faces o obain a good supervisor and discriminaor. In he sage of fine-uning, we iniialize he neworks in Fig. 3 using hese pre-rained parameers and updae he parameers of RLA o opimize he following join loss funcion in an end-o-end way: L = L mse + L sup + max D L adv(d, G). (19) Here he parameers of he supervised CNN are fixed because i has learned correc filers from original occlusion-free faces. In he oher side, we updae he parameers of he adversarial CNN o maximize min G L adv(d, G) in Eqn. (18). IV. EXPERIMENTS To demonsrae he effeciveness of he proposed model, we evaluae i on wo occluded face daases, in which one conains synhesized occlusion and he oher one conains real occlusion. We presen qualiaive resuls of occlusion removal as well as quaniaive evaluaion on face recogniion. A. Daases 1) Training Daa: Since i is hard o collec sufficien occluded faces and he corresponding occlusion-free ones in real life o model occluded faces in he wild, we rain our model on a synhesized daase from he CASIA-WebFace daase [24]. CASIA-WebFace conains 10,575 subjecs and 494,414 face images crawled from he Web. We selec around 380,000 near-fronal faces ( 45 +45 ) from he daase and synhesize occlusion caused by 9 ypes of common objecs on hese faces. The occluding objecs we use include glasses,

IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 7 Occlusionfree Occluded PCA AE SRC SSDA Face Rec RLA IP-RLA Fig. 6. Qualiaive comparison of faces recovered by various de-occlusion mehods on he occluded LFW daase. The firs row shows occlusion-free faces, he second row shows occluded faces, and he res rows show he resuls of PCA, AE, SRC, SSDA, face reconsrucion channel of RLA (Face Rec), RLA and IP-RLA, respecively. Bes viewed in hree imes he size. sunglasses, masks, hands, eye masks, scarfs, phones, books and cups. Each ype of occluding objec has 100 differen emplaes, ou of which half are used for generaing occlusion on raining daa and he res are used for esing daa. For each face, we randomly selec one emplae from 9 ypes of occlusion o generae he occluded face. Some occlusion emplaes require a correc locaion, such as sunglasses, glasses and masks. We add hese emplaes ono specific locaions of he faces wih reference o deeced facial landmarks. The oher emplaes are added ono random locaions of he faces o enhance diversiy of he produced daa. All face images are cropped and coarsely aligned by hree key poins locaed a he ceners of eyes and mouh, and hen resized o 128 128 gray level ones. Fig. 4 illusraes some examples of occluded faces generaed using his approach. We will release he daase for raining upon accepance. 2) Tes Daa: We use wo daases for esing, i.e., LFW [25] and 50OccPeople. The laer one is consruced by ourselves. The LFW daase conains a oal of 13,233 face images of 5,749 subjecs, which were colleced from he Web. Noe ha LFW does no have any overlap wih CASIA-WebFace [24]. In order o analyze he effecs of various occlusion for face recogniion, we add all he 9 ypes of occlusion o every face in he daase in a similar way for generaing raining daa. Our 50OccPeople daase conains face images wih real occlusion, which conains 50 subjecs and 1,200 images. Each subjec has one normal face image and 23 face images aken under realisic illuminaion condiions wih he same 9 ypes of occlusions. The es images are preprocessed by he same way wih he raining images. I can be seen ha boh he es daases have compleely differen occlusion emplaes and subjecs from he raining daase. B. Seings and Implemenaion Deails Our model uses a wo-layer LSTM nework for he encoder and he decoder respecively, and each LSTM has 2,048 hidden unis. Each face image is divided ino four nonoverlapped 64 64 paches, which is a reasonable size for capuring facial srucures and reducing he negaive effec of he occlusion. The LSTM encoder reads facial paches from lef o righ and op o boom, and meanwhile, he whole image is resized o he same size as a differen scale inpu of

IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 8 (a) (b) (c) (d) Fig. 8. Qualiaive resuls of recovered faces in presence of 4 caegories of occlusions for he proposed IP-RLA on he occluded LFW daase, including (a) quarer of he face a differen locaions, (b) lef or righ half of face, (c) upper face and (d) lower face. The firs row shows occlusion-free faces, he second row shows occluded faces and he hird row displays recovered resuls of IP-RLA. Noe ha when eyes or mouhs are occluded compleely, he resored resuls may be no very similar o he originals, bu i is sill possible o correcly predic some general facial aribues, such as genders, sizes, skin colors and expressions. Fig. 7. Qualiaive comparison on recovered faces of differen subjecs for he proposed IP-RLA under he condiion of he same ype of occlusion a he same locaion on he occluded LFW daase. The firs row shows occluded faces and he second row displays recovered resuls of IP-RLA. he encoder. We se he number of seps of he decoder o 8 for he rade-off beween he effeciveness and compuaional complexiy. We use he GoogLeNe [26] archiecure for boh he supervised and adversarial CNNs, and he original CASIA- WebFace daase is used o pre-rain he CNNs. For comparison, a sandard auoencoder (AE) wih four 2048-dimensional (he same wih our model) hidden layers is implemened as a baseline mehod. We use Principal Componen Analysis (PCA) as anoher baseline, which projecs an occluded face image ono a 400 dimensional subspace and hen akes PCA reconsrucion o be he recovered face. We also include he comparison wih Sparse Represenaionbased Classificaion (SRC) [1] and Sacked Sparse Denoising Auoencoder (SSDA) [10]. We es SRC using a subse of 20K images on CASIA-WebFace. However, even on his sampled raining se, he esimaion of SRC is already impracically slow. For SSDA, We use he same hyper-parameers wih [10] and he same number and dimensions of hidden layers wih our model. In his paper, all he experimens are conduced on a sandard deskop wih Inel Core i7 CPU and GTX TiTan GPUs. C. Resuls and Comparisons 1) Occlusion Removal: We firs look ino inermediae oupus of he proposed RLA during he process of occlusion removal, which are visualized in Fig. 5. I can be observed ha our model does remove occlusion sep by sep. Specifically, a he firs sep, he face reconsrucion nework of he model produces a probable profile of he face, where occluded pars may no be so clear as non-occluded pars. The occlusion predicion nework provides a coarse esimaion of he occlusion

IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 9 Occlusion-free Occluded RLA IP-RLA Fig. 9. Some examples of recovered faces by RLA and IP-RLA on he 50OccPeople daase. The firs row shows normal faces, he second row shows occluded faces, and he res rows display he resuls of RLA and IP-RLA. region. Then he oupus are refined progressively upon he saes of previous seps. For example, one can see ha more and more srucures and exures are added o he face profile, and he shape of he occlusion region becomes sharper and sharper. To verify he abiliy of our proposed model for occlusion removal, we presen qualiaive comparisons wih several mehods including Principal Componen Analysis (PCA), Auoencoder (AE), Sparse Represenaion-based Classificaion (SRC) [1] and Sacked Sparse Denoising Auoencoder (SSDA) [10], wih differen ypes of occlusion. We also evaluae he conribuion of componens in our model in ablaion sudy, which include face reconsrucion channel of RLA (Face Rec), RLA and he ideniy-preserving RLA (IP-RLA). Fig. 6 gives example resuls of occlusion removal on faces from he occluded LFW daase. From he figure, one can see ha for each ype of occlusion, RLA resores occluded pars well and in he meanime reains he original appearance of non-occluded pars. This also demonsraes ha he deecion of occlusion is raher accurae. Alhough our model is rained on CASIA-WebFace which has no duplicaed subjec and occlusion emplae wih he es daases, our model can sill remove occlusion effecively wihou knowing he ype and locaion of occlusion. Through using he supervised and adversarial CNNs o fine-une i, he proposed IP-RLA furher recovers and sharpens some discriminaive paerns, such as edge and exure, in occluded pars. Noe ha only using he face reconsrucion nework in he decoder of RLA damages fine-grained srucures of non-occluded pars. I is undesired because his migh lose key informaion for he following face recogniion ask. By comparison, PCA canno remove occlusion and only make i blur. SRC does no appropriaely reconsruc occluded pars and severely damages or changes he appearances of non-occluded pars. AE and SSDA remove occlusion bu over-smoohs many deails, which resuls in recovered faces biased oward an average face. This clearly demonsraes he advanage of removing occlusion progressively in a recurren framework. We also es he proposed IP-RLA on faces of differen subjecs corruped by he same ype of occlusion a he same locaion. The resuls are shown in Fig. 7. I can be seen ha our mehod can recover diverse resuls for differen subjecs, which demonsraes our mehod dose no simply produce he mean of occluded facial pars over he raining daase bu predics he meaningful appearances according o non-occluded pars of differen subjecs. Furhermore, based on occluded locaion and area, we divide he occlusion ino 4 caegories: quarer of he face a differen locaions, lef or righ half of he face, upper face and lower face. Fig. 8 compares recovered resuls of IP-RLA under differen occlusion caegories on he occluded LFW daase. As one can see, for quarer of he face, our model can remove i easily. When he lef or righ half of a face is occluded, alhough he occluded area is large, our model sill produces recovered faces wih high similariy o he original occlusionfree faces. The model may exploi facial symmery and learn specific feaure informaion from he non-occluded half face. When he upper or lower par of a face is occluded, our model can also remove he occlusion, bu he resored pars may be no very similar o he original pars, such as he 4h column in Fig. 8 (c). This is because i is exremely challenging o infer exacly he appearances of he lower (upper) face according o he upper (lower) face. However, i is sill possible o correcly predic some general facial aribues, such as genders, sizes, skin colors and expressions. Besides he synheic occluded face daase, we also es our model on he 50OccPeople daase which is a real occluded face daase o verify he performance in pracice. Some resuls are illusraed in Fig. 9. one can see ha our model sill obains good de-occlusion resuls alhough i is rained only using synheic occluded faces.

IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 10 TABLE I EQUAL ERROR RATES (EER) OF FACE RECOGNITION FOR OCCLUSION OF DIFFERENT TYPES ON THE OCCLUDED LFW DATASET. Types of occlusion Our Model IP-RLA RLA Face Rec SSDA SRC AE PCA Occluded face Quarer of face Lef/righ half of face Upper face Lower face Hand 5.6% 6.2% 14.8% 37.0% 40.8% 26.8% 12.4% 6.5% Book 5.9% 6.5% 15.0% 37.7% 42.0% 27.8% 12.5% 7.0% Hand 9.3% 10.5% 20.3% 39.1% 40.1% 30.5% 18.6% 12.8% Book 9.8% 11.4% 21.4% 40.4% 43.0% 32.6% 21.0% 13.4% Glasses 5.8% 5.8% 12.8% 36.3% 32.7% 25.0% 11.8% 6.7% Sunglasses 9.9% 10.7% 22.9% 42.9% 38.8% 33.6% 20.7% 10.0% Eye mask 25.5% 27.3% 34.4% 44.2% 43.2% 39.5% 33.3% 27.2% Mask 9.2% 12.1% 20.9% 40.3% 44.7% 31.9% 21.4% 12.1% Phone 7.2% 7.8% 15.3% 37.8% 42.3% 28.9% 14.2% 8.7% Cup 5.7% 6.1% 15.3% 37.9% 41.3% 28.4% 12.7% 5.8% Scarf 9.3% 11.0% 20.9% 39.8% 44.7% 33.7% 17.3% 10.0% TABLE II AVERAGE EER OF FACE RECOGNITION FOR ALL TYPES OF OCCLUSION ON THE 50OCCPEOPLE DATASET. Our Model IP-RLA RLA Face Rec SSDA SRC AE PCA Occluded face 18.0% 18.2% 23.2% 42.6% 45.5% 35.0% 25.6% 19.1% 2) Face Recogniion: We carry ou he experimen of face verificaion on he faces recovered by de-occlusion mehods o furher invesigae he abiliy of our model in recognizing occluded faces. We firs exrac feaure vecors for a pair of face images (one is a occlusion-free face, and he oher is a recovered face or an occluded face) and compue he similariy beween wo feaure vecors using Join Bayesian [27] o decide wheher he pair of faces is from he same subjec. CNN is adoped o exrac face feaures in he experimen. We rain a GoogLene model on CASIA-WebFace, and a 6,144-dimensional feaure vecor is obained by concaenaing acivaion oupus of hidden layers before he hree loss layers. By reducing dimension using PCA, we have a 800- dimensional feaure vecor for each face image. We firs evaluae he recogniion performance for differen ypes of occlusion on he occluded LFW daase. We compue he equal error raes (EER) on pre-defined pairs of faces provided by he daase websie. The pair se conains 10 muually excluded folds, and 300 posiive pairs and 300 negaive pairs for each fold. Through alernaely occluding he wo faces in a pair, a oal of 12,000 pairs are generaed for esing. Table I repors he verificaion resuls for various occlusion and de-occlusion mehods. We compare our proposed model wih oher mehods including PCA, AE, SRC and SSDA, and also lis he performance of verificaion on occluded face images for reference. As one can see, he IP-RLA performs beer for all ypes of occlusion as i produces more discriminaive occlusion-free faces han oher mehods. Noe ha combining wih he occlusion deecion significanly reduces he error rae compared wih recovering faces wihou using occlusion deecion. This is because uilizing occlusion deecion o reain non-occluded pars effecively preserve discriminaive informaion conained in hese pars. SRC does no obain he expeced performance as [1] because he open es se has no idenical subjec wih he raining daase. SSDA performs even worse han he sandard Auoencoder (AE), which shows ha i canno handle well he large area of spaially coniguous noise like occlusion alhough i is effecive for removing Gaussian noise and coniguous noise wih low magniude like ex. Noe ha only using face reconsrucion (Face Rec) sill achieves beer performance han he sandard Auoencoder (AE). This demonsraes he effeciveness of he progressive recovery framework. Similar o he observaions made in he qualiaive analysis, occlusion removal for quarer or lef/righ half of he face improve beer he performance of occluded face recogniion because he appearances of occluded facial pars can be prediced according o he non-occluded pars by uilizing facial symmery. However, recovered faces for upper or lower faces sill achieves lower error rae compared wih occluded faces, which indicaes ha our model can learn relaions beween upper and lower faces and exrac discriminaive feaures from non-occluded upper (lower) faces o recover occluded lower (upper) faces. We also compare he overall verificaion performance for all ypes of occlusion on he 50OccPeople daase. We randomly sample 10,000 pairs (5,000 posiive pairs and 5,000 negaive pairs) of faces for esing. The EER averaged on all ypes of occlusion are lised in Table II. The verificaion resuls shows ha our model ouperforms oher mehods and is able o be generalized o real occluded face daa. V. CONCLUSIONS In his paper we have proposed a robus LSTM- Auoencoders o address he problem of face de-occlusion in he wild. The proposed model is shown o be able o effecively recover occluded facial pars progressively. The

IEEE TRANSACTIONS ON IMAGE PROCESSING, DRAFT 11 proposed model conains a spaial LSTM nework encoding face paches sequenially under differen scales for feaure represenaion exracion, and a dual-channel LSTM nework decoding he represenaion o reconsruc he face and deec occlusion sep by sep. Exra supervised and adversarial CNNs are inroduced o fine-une he robus LSTM auoencoder and enhance he discriminaive informaion abou person ideniy in he recovered faces. Exensive experimens on synheic and real occlusion daases demonsrae ha he proposed model ouperforms oher de-occlusion mehods in erms of boh he qualiy of recovered faces and he accuracy of occluded face recogniion. REFERENCES [1] J. Wrigh, A. Yang, A. Ganesh, S. Sasry, and Y. Ma, Robus face recogniion via sparse represenaion, IEEE T. Paern Analysis Mach. Inelli. (TPAMI), vol. 31, no. 2, pp. 210 227, 2009. [2] J. Park, Y. Oh, S. Ahn, and S. Lee, Glasses removal from facial image using recursive error compensaion, IEEE T. Paern Analysis Mach. Inelli. (TPAMI), vol. 27, no. 5, pp. 805 811, 2005. [3] S. Li, X. Hou, H. Zhang, and Q. Cheng, Learning spaially localized, pars-based represenaion, in Proc. IEEE Conf. Comp. Vis. Paern Recogn. (CVPR), 2001, pp. I 207 I 212. [4] Y. Tang, R. Salakhudinov, and G. Hinon, Robus bolzmann machines for recogniion and denoising, in Proc. IEEE Conf. Comp. Vis. Paern Recogn. (CVPR), 2012, pp. 2264 2271. [5] L. Cheng, J. Wang, Y. Gong, and Q. Hou, Robus deep auo-encoder for occluded face recogniion, in Proc. of he 23s ACM In. Conf. on Mulimedia, 2015, pp. 1099 1102. [6] I. Goodfellow, J. Pouge-Abadie, M. Mirza, and e al., Generaive adversarial nes, in Proc. Adv. Neural Info. Process. Sys. (NIPS), 2014, pp. 2672 2680. [7] M. Beralmio, G. Sapiro, V. Caselles, and C. Balleser, Image inpaining, in Proc. ACM Conf. Comp. Graphics (SIGGRAPH), 2000, pp. 417 424. [8] A. Criminisi, P. Perez, and K. Toyama, Region filling and objec removal by exemplar-based image inpaining, IEEE Trans. on Image Processing (TIP), vol. 13, no. 9, pp. 1200 1212, 2004. [9] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin, An ieraive regularizaion mehod for oal variaion-based image resoraion, Muliscale Modeling & Simulaion, vol. 4, no. 2, pp. 460 489, 2005. [10] J. Xie, L. Xu, and E. Chen, Image denoising and inpaining wih deep neural neworks, in Proc. Adv. Neural Info. Process. Sys. (NIPS), 2012, pp. 341 349. [11] D. Pahak, P. Krahenbuhl, J. Donahue, and e al., Conex encoders: Feaure learning by inpaining, in Proc. IEEE Conf. Comp. Vis. Paern Recogn. (CVPR), 2016, pp. 1 9. [12] S. Hochreier and J. Schmidhuber, Long shor-erm memory, Neural Compuaion, vol. 9, no. 8, pp. 1735 1780, 1997. [13] A. Graves and N. Jaily, Towards end-o-end speech recogniion wih recurren neural neworks, in Proc. In. Conf. Mach. Learn. (ICML), 2014, pp. 1764 1772. [14] K. Xu, J. Ba, R. Kiros, A. Courville, and e al., Show, aend and ell: Neural image capion generaion wih visual aenion, CoRR, vol. abs/1502.03044, 2015. [15] J. Donahue, L. Hendricksa, S. Guadarrama, and M. Rohrbach, Longerm recurren convoluional neworks for visual recogniion and descripion, in Proc. IEEE Conf. Comp. Vis. Paern Recogn. (CVPR), 2015, pp. 2625 2634. [16] N. Srivasava, E. Mansimov, and R. Salakhudinov, Unsupervised learning of video represenaions using lsms, in Proc. In. Conf. Mach. Learn. (ICML), 2015, pp. 843 852. [17] W. Byeon, T. Breuel, F. Raue, and M. Liwicki, Scene labeling wih lsm recurren neural neworks, in Proc. IEEE Conf. Comp. Vis. Paern Recogn. (CVPR), 2015, pp. 3547 3555. [18] K. Gregor, I. Danihelka, A. Graves, and D. Wiersra, Draw: A recurren neural nework for image generaion, in Proc. In. Conf. Mach. Learn. (ICML), 2015, pp. 1462 1471. [19] T. Cormen, C. Leiserson, R. Rives, and C. Sein, Inroducion o Algorihms. MIT Press, 2001. [20] L. Theis and M. Behge, Generaive image modeling using spaial lsms, in Proc. Adv. Neural Info. Process. Sys. (NIPS), 2015, pp. 1918 1926. [21] Y. Taigman, M. Yang, M. Ranzao, and L. Wolf, Deepface: Closing he gap o human-level performance in face verificaion, in Proc. IEEE Conf. Comp. Vis. Paern Recogn. (CVPR), 2014, pp. 1701 1708. [22] Y. Sun, X. Wang, and X. Tang, Deep learning face represenaion from predicing 10,000 classes, in Proc. IEEE Conf. Comp. Vis. Paern Recogn. (CVPR), 2014, pp. 1891 1898. [23] F. Schroff, D. Kalenichenko, and J. Philbin, Facene: A unified embedding for face recogniion and clusering, in Proc. IEEE Conf. Comp. Vis. Paern Recogn. (CVPR), 2015, pp. 815 823. [24] D. Yi, Z. Lei, S. Liao, and S. Li, Learning face represenaion from scrach, CoRR, vol. abs/1411.7923, 2014. [25] G. Huang, M. Ramesh, T. Berg, and E. Learned-Miller, Labeled faces in he wild: A daabase for sudying face recogniion in unconsrained environmens, Universiy of Massachuses, Amhers, Tech. Rep. 07-49, Ocober 2007. [26] C. Szegedy, W. Liu, Y. Jia, and e al., Going deeper wih convoluions, in Proc. IEEE Conf. Comp. Vis. Paern Recogn. (CVPR), 2015, pp. 1 9. [27] D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun, Bayesian face revisied: A join formulaion, in Proc. Eur. Conf. Comp. Vis. (ECCV), 2012, pp. 566 579.