Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601

Size: px

Start display at page:

Download "Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network. Nathan Sun CIS601"

Sherilyn Banks
5 years ago
Views:

1 Disguised Face Identification (DFI) with Facial KeyPoints using Spatial Fusion Convolutional Network Nathan Sun CIS601

2 Introduction Face ID is complicated by alterations to an individual s appearance Beard, glasses, sunglasses, wig, hairstyle, hair color, hat, etc. Results in decreased performance Facial keypoints are required to analyze the shape of the face Two main state-of-the-art methods: 1. Use feature extraction algorithm (e.g. Gabor features) with texture-based and shape-based features to detect different facial key-points 2. Use probabilistic graphical models to capture relationship between pixels and features to detect facial key-points DNN used in this way is very challenge because datasets are small Larger training dataset = better performance

3 Transfer Learning Lack of data means designers have to use transfer learning Transfer Learning is machine learning research problem where knowledge gained from solving a problem is applied to a different but related problem (e.g. knowledge gained identifying cars can be used to identify trucks) Performance might be sufficient but may under-perform because of data insufficiency resulting in inability to fine tune pre-trained DNNs

4 Contributions of this Paper Disguised Face Identification (DFI) Framework: Use Spatial Fusion Deep Convolutional Network (DCN) to extract 14 key-point (essential to describe facial structure) Extracted points connected to form star-net and orientations of points are used by classification framework for face ID Simple and Complex Face Disguise Datasets: Proposed 2 simple and complex Face Disguise (FG) datasets that can be used by researchers in future to train DCN for facial key-point detection

5 14 Essential Facial key-points (S. Zhang et al. 2016)

Simple and Complex Face Disguise Datasets Databases for disguise related research have limited disguise variations DCN requires images of people with beard, glasses, different hairstyles, scarf, cap,

6 Simple and Complex Face Disguise Datasets Databases for disguise related research have limited disguise variations DCN requires images of people with beard, glasses, different hairstyles, scarf, cap, etc. Propose two Face Disguise datasets of 2000 photos each with Simple and Complex backgrounds and varied illuminations 8 different backgrounds, 25 subjects, 10 different disguises Notice how complex backgrounds = higher % of background in picture as a whole

7 Convolutional Neural Networks: A Review

Overview of DCN Process 8 convolution layers to extract increasingly specific data End in Loss 1 function (solves regression problems by comparing output with ground truth) 5 spatial fusion layers

8 Overview of DCN Process 8 convolution layers to extract increasingly specific data End in Loss 1 function (solves regression problems by comparing output with ground truth) 5 spatial fusion layers End in Loss 2 function (solves classification problem by finding mean squared error) Heat Maps generated of 14 key-points and forms star-net structure Classification based on star-net orientation of points

9 Disguised Face Identification (DIC) Framework Spatial Fusion Convolutional Network predicts and temporally aligns the facial key points of all neighboring frames to a particular frame by warping backwards and forwards in time using tracks from dense optical flow Optical flow is pattern of apparent motion caused by relative motion between observer and a scene Dense optical flow takes into account every pixel while sparse optical flow picks a portion of all the pixels The confidence in the particular frame is strengthened with a set of expert opinions ( with corresponding confidences) from frames in the neighborhood, from which the facial key points can be estimated accurately Spatial fusion network more accurate in this respect when compared to other DNNs Points connected to a star-net and used in classification

10 Facial KeyPoint Detection Regression problem modeled by Spatial Fusion Convolutional network CNN takes an image and outputs pixel coordinates of each key-point Output of last layer is ix jx k dimensional cube (here is 64 x 64 x 14 = 14 key-points) Training objective: estimate network weights lambda (λ) with available training data set D = (x, y) and regressor: Φ() is the activation function (rate of action potential firing inn the neurons) Where the Gaussian function G i,j,k (y k ) is: CNNs aren t scale/shift invariant so we apply Gaussian distribution to put feature values in a known range Loss 2 function on squared pixel-wise differences between predicted and ground truth heat-map Use MatConvNet to train and validate Fusion Convolutional Network in MATLAB

11 Facial KeyPoint Detection Cont. Locations (coordinates) produced by networks from last slide are connected into a star network with angles used later for classification Nose key point is used as the reference point in determining angles for other points

12 Disguised Face Classification Compare disguised face to 5 non-disguised faces (including the person in the disguise) Classification is accurate is tau (τ) is the minimum for analysis between disguised image and non-disguised image of the same person Similarity is estimated by computing L1 norm between orientation of different key points (from net structure): τis similarity, θ i is orientation of the i th key point of disguised image, and φ i is corresponding angles in the non-disguised image

13 Experimental Results Split between Simple Background Face Disguise data set and Complex Background Face Disguise data set Individual key point accuracy is presented along with comparison with other architecture Analyze classification performance

14 Spatial Fusion ConvNet Training Spatial Fusion CNN trained on 1000 images (500 validation images and 500 test images) Network trained for 90 cycles with batch size of x248 sub-image randomly cropped from every input image, randomly flipped, randomly rotated between -40 and 40 degrees and resized to 256x256 to be passed as input into CNN Variance of Gaussian set to 1.5 Heat-map size is 64x64 Base learning rate is 10^(-5), decreased to 10^(-6) after 20 iterations Momentum is 0.9 Momentum update results in better convergence on deep networks (based on physical perspective of the optimization problem)

15 Key Point Detection Row 1: disguised images Row 2: key point mapping Row 3: net-star construction

16 Key-Point Detection Performance Key point deemed correct is located within d pixels from marked key point Accuracy increases as d increases Green: Complex background Red: Simple background

17 Key-Point Detection Performance Cont. Simple background higher accuracy than complex background Complex has lower performance b/c background clutter interferes with identifying outer region facial key points

18 Key-Point Performance Analysis with Reference to Background Clutter Background clutter significantly interferes with key point detection performance Background clutter observed by analyzing key-point detection in lips, nose and eye regions

face border) Accuracy at pixel distance closer to

19 Eye Region Key-Points Detection Relevant key points: P1 P10 P1, P4, P5, and P10 prominently affected (closest to face border) Accuracy at pixel distance closer to ground-truth is significantly higher for simple vs complex background

20 Nose Key-Point Detection Performance Nose key-point (P11) is not affected by background clutter Probably because P11 is buffered by surrounding key points

21 Lips Region Key-Point Detection Performance P12, P13, P14 comprise the lips region P12 and P14 are affected by background clutter while P13 is not P12 and P14 affected because they are closer to face edge than P13

Facial Key-Points Detection: Multiple Persons Use Viola Jones Face Detector to find all faces in the image Use DIC on each face The key-point detection classification performance

22 Facial Key-Points Detection: Multiple Persons Use Viola Jones Face Detector to find all faces in the image Use DIC on each face The key-point detection classification performance for each simple and complex datasets: 2 faces in the image are 80% and 50% 3 faces in the image are 76% and 43% Single face: 85% and 56% Decrease in accuracy as number of faces increase

23 Comparison of KeyPoint Detection Performance with Other Architecture CN = CoordinateNet CNE = CoordinateNet Extended SpatialNet d = 5 from ground-truth In accordance with findings from other architectures, background clutter decreases accuracy

24 Classification Performance and comparison with the state-of-the-art More heavily disguise = accuracy decrease State-of-the-art is unnamed This paper s framework outperforms current stateof-the-art

25 Conclusion Proposed two datasets that can be used to train future disguised face recognition networks Background clutter affects outer region key points Images taken should have the simplest background possible for highest accuracy Disguised Face Identification (DFI) Framework outperforms state-ofthe-art by first detecting 14 facial key points and connects them to net-star

26 References

27 Thank you!

Two-Stream Convolutional Networks for Action Recognition in Videos

Two-Stream Convolutional Networks for Action Recognition in Videos Karen Simonyan Andrew Zisserman Cemil Zalluhoğlu Introduction Aim Extend deep Convolution Networks to action recognition in video. Motivation