Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs

Size: px

Start display at page:

Download "Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs"

Camron Carroll
5 years ago
Views:

1 Introduction to Deep Learning for Facial Understanding Part III: Regional CNNs Raymond Ptucha, Rochester Institute of Technology, USA Tutorial-9 May 19, R. Ptucha 18 1 Fair Use Agreement This agreement covers the use of all slides in this document, please read carefully. You may freely use these slides, if: You send me an telling me the conference/venue/company name in advance, and which slides you wish to use. You receive a positive confirmation back from me. My name (R. Ptucha) appears on each slide you use. (c) Raymond Ptucha, rwpeec@rit.edu R. Ptucha

Agenda 9-9:3am Part I: Introduction 9:3-1:am Part II: Convolutional Neural Nets

IV: Facial Understanding 11:15-11:35am Part V: Recurrent Neural Nets

2 Agenda 9-9:3am Part I: Introduction 9:3-1:am Part II: Convolutional Neural Nets 1:-1:3am Part III: Fully Convolutional Nets 1:3-1:45am Break 1:45-11:15am Part IV: Facial Understanding 11:15-11:35am Part V: Recurrent Neural Nets 11:35-11:55pm Part VI: Generative Adversarial Nets 11:55-12:3am Hands-on with NVIDIA DIGITS R. Ptucha 18 3 Vision Tasks Classification Classification + Localization Object Detection Instance Segmentation Single Object Multiple Objects R. Ptucha

3 Classification vs. Classification + Localization Classification Input: Image Output: Class label Evaluation metric: Accuracy CAT Classification + Localization Input: Image Output: Class label, Box coordinates Evaluation metric: Intersection over Union (IoU) (CAT,x,y,w,h) R. Ptucha (,) Classification with Localization.(1,1) Lets allow a few classes: 1. Car 2. Truck 3. Pedestrian 4. Motorcycle Image from: deeplearning.ai, C4W3L1 For now, lets assume one object per image. Each object has {x, y, w, h} For this image, object location {x, y, w, h} = {.4,.6,.4,.3} R. Ptucha

4 Classification with Localization Four classes: 1. Car 2. Truck 3. Pedestrian 4. Motorcycle Localization {x, y, w, h} Define y label: y = Image from: deeplearning.ai, C4W3L1 P c b x b y b w b h C 1 C 2 C 3 C 4 Probability of an object Bounding box location /1 for each class R. Ptucha 18 7 Classification with Localization Four classes: 1. Car 2. Truck 3. Pedestrian 4. Motorcycle Localization {x, y, w, h} Cost function (squared error): Loss = 9 i=1 y i y i 2 Loss = y 1 y 1 2 If y 1 =1 If y 1 = y = P c b x b y b w bh C 1 C 2 C 3 C 4 y = = don t care Image from: deeplearning.ai, C4W3L1 R. Ptucha

Motorcycle Localization {x, y, w, h} Alternate cost function: y 1 can be

entropy Image from: deeplearning.ai, C4W

Ptucha 18 9 = don t care Snapchat Facewarp Traditional approach: Viola

5 Classification with Localization Four classes: 1. Car 2. Truck 3. Pedestrian 4. Motorcycle Localization {x, y, w, h} Alternate cost function: y 1 can be logistic loss y 2 y 5 can be squared error y 6 y 9 can be softmax cross entropy Image from: deeplearning.ai, C4W3L1 y = P c b x b y b w bh C 1 C 2 C 3 C 4 y = R. Ptucha 18 9 = don t care Snapchat Facewarp Traditional approach: Viola Jones Face Detection Search for actual point locations using Mahalanobis distance Average eye and 82 facial feature points Repeat ~3-5 Restrict based on PCA statistics R. Ptucha

Snapchat Facewarp Deep Learning approach:

216] Candidate faces Refine face selection

CNN would output: Face pt1x pt1y pt2x pt2y.

6 Snapchat Facewarp Deep Learning approach: MT-CNN [Zhang et al. 216] Candidate faces Refine face selection Facial feature points R. Ptucha Facial feature points Localization Each face has 68 points, so CNN would output: Face pt1x pt1y pt2x pt2y... pt68x pt68y 137 outputs Of course, need GT for thousands of faces to train model. R. Ptucha

7 Can do same with Body Pose Pishchulin et al. CVPR 16 R. Ptucha Object Detection Image Credit: Redmon, Joseph, et al [4] R. Ptucha

8 More than one object per image Training set: x y 1 Car detection example 1 1 Images from: deeplearning.ai, C4W3L3 R. Ptucha Sliding Window Detection Images from: deeplearning.ai, C4W3L3 R. Ptucha

Testing Testing Training Computing FC layers with Convolution MAX POOL FC FC 14 14 3 (16) 5 5 3 2 2 1 1 16 5 5 16 4 4 y softmax (4) MAX POOL

Ptucha 18 19 Replacing Sliding Windows w/fully Convolutional CNNs MAX POOL (16) 5 5 3 2 2 (4) 5 5 16 (4) 1 1 4 (4) 1 1 4 14 14 3 1 1 16 5 5

9 Testing Testing Training Computing FC layers with Convolution MAX POOL FC FC (16) y softmax (4) MAX POOL (16) (4) (4) (4) Images from: deeplearning.ai, C4W3L4 R. Ptucha Replacing Sliding Windows w/fully Convolutional CNNs MAX POOL (16) (4) (4) (4) R. Ptucha 18 2 Images from: deeplearning.ai, C4W3L4 9

10 Replacing Sliding Windows w/fully Convolutional CNNs Sliding window approach: Sequentially evaluate one window at a time Fully convolutional approach: Evaluate 64 regions at once R. Ptucha Images from: deeplearning.ai, C4W3L4 Replacing Sliding Windows w/fully Convolutional CNNs Can think of this as evaluating 8 8 grid, where each of the 64 cells is independently checked for an object: Each cell has a y label: y = P c b x b y b w b h C 1 C 2 C 3 C 4 Prob. of an object Object location /1 for each class (Four classes in this example) R. Ptucha

11 Replacing Sliding Windows w/fully Convolutional CNNs Overlay GT of object Cell where centroid lie is responsible. Each cell has a y label: y = P c b x b y b w b h C 1 C 2 C 3 C 4 Prob. of an object Object location /1 for each class (Four classes in this example) R. Ptucha Replacing Sliding Windows w/fully Convolutional CNNs y = P c b x b y b w b h C pers C car C truck C motorc y1 = y2 = y44 = Note1: cell upper left (,); cell lower right (1,1) Note2: b w and b h can be > R. Ptucha

How is Done in Practice 1. Region proposals- use algorithms such as selective search or edge boxes to suggest potential bounding box locations. Uijlings et al.

12 How is Done in Practice 1. Region proposals- use algorithms such as selective search or edge boxes to suggest potential bounding box locations. Uijlings et al., Selective search for Object Recognition, IJCV 213. R. Ptucha How is Done in Practice 2. Fully convolutional methods- replace fully connected layers with convolution. Testing Training (16) MAX POOL 2 2 (4) (4) (4) R. Ptucha

13 Anchor 2 Anchor 1 How is Done in Practice 3. Anchor boxes. What if two object centroids land in same cell y = P c b x b y b w b h C pers C car C truck C motorc R. Ptucha How is Done in Practice Anchor boxes. Allow multiple detections per cell. Define multiple anchor boxes per cell. Assign each GT object to closest Anchor shape (use max IoU). Anchor 1: Anchor 2: y = P c b x b y b w b h C pers C car C truck C motorc P c b x b y b w b h C pers C car C truck C motorc R. Ptucha

14 Faster R-CNN: Region Proposal Network Image Credit: Ren et al. [3] Add Additional Region Proposal Network (RPN) after last convolution layer. Train RPN to extract region proposals After RPN, use ROI Pooling which feeds a classifier and bbox regressor. R. Ptucha Faster R-CNN: Region Proposal Network Image Credit: Kaiming He Use 3 3 sliding window over last conv layer (conv5) feature map Create small networks for object classification and bounding box coordinates. At each location, we determine if object present, and the bounding box [x,y,w,h] coordinates gives exact localization in input image. R. Ptucha

15 Faster R-CNN: Region Proposal Network For each of the n anchor boxes: P(object) [x,y,w,h] Image Credit: Ren, Shaoqing, et al.[3] Use n anchor boxes at each location: 3 aspect ratios 2:1,1:1,2:1 3 scales, box areas of 128 2, 256 2, Use same anchor boxes at each location. Regression gives offset/scale of each anchor box. Classification gives probability of object at each anchor box. R. Ptucha Mask R-CNN Extension of Faster R-CNN Faster R-CNN used region proposal network (3 3 sliding window over conv5) predicts n anchor boxes at each location. Each anchor box has [P(object),x,y,w,h] Each anchor box then used for final network to predict [P(object),x,y,w,h,C 1,C 2,,C C ] Extend to also predict pixel mask of object R. Ptucha '

Mask R-CNN Extension of Faster R-CNN Simultaneously

Shown to work well on instance segmentation, bounding-box

16 Mask R-CNN Extension of Faster R-CNN Simultaneously predict bounding box along with object mask. Shown to work well on instance segmentation, bounding-box object detection, and person keypoint (body joint) detection. R. Ptucha '18 35 Mask R-CNN Built on: ResNext [Xie et al., CVPR 17] Feature Pyramid Net [Lin eta al., CVPR 17] R. Ptucha '

17 R. Ptucha '18 39 FAIR Mask R-CNN, COCO + Places Workshop, ICCV 217 R. Ptucha '

18 FAIR Mask R-CNN, COCO + Places Workshop, ICCV 217 R. Ptucha '18 41 How get such nice masks with only pix per mask R. Ptucha '

19 R. Ptucha '18 43 How get such nice masks R. Ptucha '

20 (keypoint==bodyjoint) R. Ptucha '18 45 FAIR Mask R-CNN, COCO + Places Workshop, ICCV 217 R. Ptucha '

21 FAIR Mask R-CNN, COCO + Places Workshop, ICCV 217 R. Ptucha '18 47 Good Object Detection Libraries to Explore Tensorflow library for object detection (good for Tensorflow users, lots of algorithms) /object_detection Facebook library for object detection (includes Mask R- CNN,based on Caffe2) YOLO R. Ptucha '

22 Thank you!! Ray Ptucha R. Ptucha

Spatial Localization and Detection. Lecture 8-1

Lecture 8: Spatial Localization and Detection Lecture 8-1 Administrative - Project Proposals were due on Saturday Homework 2 due Friday 2/5 Homework 1 grades out this week Midterm will be in-class on Wednesday