Todo before next class - PDF Free Download

Todo before next class Each project group should submit a short project report (4 pages presentation slides) including 1. Problem definition 2. Related work 3. Preliminary results 4. Future plan Submission: Email to chad.dechant@columbia.edu by April 5 Note: your slides will be put on course website. The submission must be PDF file, named by your group number

Deep Networks for Image Classification and Detection Liangliang Cao llcao.net/cu-deeplearning17 2

Outline Difference of vision and speech and NLP ImageNet and model adaptation Recent trends in industry and academia Two recent works HyperFace Mask R-CNN with ResNet 3

Image Recognition is Lucky Why? - Data: Images are easier to label than speech/language - Data: Fei-Fei et al. made a lot of to release ImageNet - Platform: Nvidia s cudnn standardizes most important comp. - Platform: A number of great toolkits built on cudnn 4

ImageNet LSVRC 5

Treasure from ImageNet Dataset By adapting models trained from ImageNet, we can build a decent classifier with limited data. Very few new label Tune the last layer Or last layer as feature for SVM Example code : http://caffe.berkeleyvision.org /gathered/examples/finetune_ flickr_style.html Enough new labels New tasks Tune the whole network 6

More Data, More Computation 2006 2010 2016 Caltech101, 8K Image ImageNet, 1.2M Image Yahoo YFCC, 100M Image Will this trend be plateaued or keep expanding? NVidia Stock Price 7

Deep Learning in Industry Data cleaning Startup examples More data Better model Model evolving 8

Deep Learning for Competition Kaggle small-scale image recognition Adapt several ImageNet models Dataaugmentation Study the failure examples, and find ways to conquer them ImageNet LSVRC Fuse complementary features using multi-gpu systems Identify the problem of existing models and fix it Larger scale (e.g., Youtube 8M video) Explore new scalable models 9

Plan for the remaining time: Fusing complementary features HyperFace Ranjan, Patel, Chellappa, arxiv 1603.01249, 2016 Integrating multiple tasks Mask R-CNN (w. ResNet) He, Gkioxari, Dollar, Girshick arxiv 1703.06870, 2017 Deeper understanding of the challenge for existing models 10

HyperFace A Deep Multi-task Learning Framework for Face Detection, Landmark Localization, Pose Estimation, and Gender Recognition Rajeev Ranjan, Vishal M. Patel, Rama Chellappa arxiv 1603.01249, 2016 11

Tasks of HyperFace Face Detection Landmark Localization Pose Estimation Gender Recognition 12

HyperFace: Basic Idea Lower layers respond to edges and corners, and hence contain better localization properties higher layers are class-specific and suitable for semantic recognition including face recognition and gender. Features from lower and higher layers are complimentary. Fuse them! 13

HyperFace Network 14

Baseline 15

Loss function Detection Landmark location Visibility Pose Gender Total 16

Procedure 1. Selective search to get candidate regions 2. Normalize and scale each region 3. Predict the four tasks 4. Refine the prediction based on landmark detections 17

Face Detection Performance Face in the wild (AFW) 18

Face Detection Performance Face Detection Dataset and Benchmark (FDDB) 19

Landmark localization AFW 20

Landmark localization Annotated Facial Landmarks in the Wild (AFLW) 21

Landmark localization Annotated Facial Landmarks in the Wild (AFLW) 22

Pose Estimation AFW 23

Gender Recognition 24

Speed GTX Titan-X GPUs 3 seconds per image 2s for selective search to generate region proposals 0.2s for evaluate HyperFace network Questions or comments? Ok, my question: will a better region detector help HyperFace? 25

Mask R-CNN Kaiming He, Georgia Gkioxari, Piotr Dollár, Ross Girshick arxiv 1703.06870, 2017 26

History Selective Search R-CNN Fast R-CNN Faster R-CNN Residual Network Mask R-CNN 27

R-CNN and Fast R-CNN R-CNN Fast R-CNN 200x faster than R- CNN in testing stage 28

Faster R-CNN à Mask R-CNN No longer use Selective Search Instead use network for region-proposal task Add a segmentation (mask) branch in addition to detection RoI pooling -> RoI aligment 29

RoI Pooling Layer Typical pooling layer: the size of output is (1/wh) of the input size. RoI pooling layer: the size of output is (7x7) no matter how large the input size is. for (int n=0; n<num_rois; n++){ for (int c = 0; c < channels_; ++c){ for (int ph = 0; ph < pooled_height_; ++ph){ for (int pw = 0; pw < pooled_width_; ++pw){ for (int h = hstart; h < hend; ++h){ for (int w = wstart; w < wend; ++w){ if (batch_data[index] > top_data[pool_index]) top_data[pool_index] = batch_data[index]; } } } } } 30

From RoI pooling to RoI align (source code pending) RoI pooling is not designed for pixel-pixel alignment RoI align Use bilinear interpolation instead of hard quantization Sample four locations per RoI bin and aggregate them The idea of RoI align seems simple but it may requires some efforts to implement efficiently on GPUs 31

Use Residual Network structure Benefits of Residual Network in ImageNet/COCO 2016 32

Why Residual Network? Problem: Is learning better networks as simple as stacking more layers? Deep network + residual learning can solve this problem. 33

Residual net 34

Back to Mask R-CNN Combine cost for Classification Detection Mask (new) Training speed: 32-40hours on 8GPU machine to train CoCo data Testing speed: 200ms per image on Tesla M40 35

Experiments 36

Summary of this class Now we have covered a number of vision applications: - Image classification (programming in class 3 ) - Face recognition/detection/alignment - Object detection/segmentation

Any questions so far? - No good results for your projects? - Problem with GoogleCloud/Paperspace? - Problem with Keras/TensorFlow/Caffe? - Others? 38