Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Similar documents
Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Know your data - many types of networks

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Machine Learning 13. week

Deep Learning for Computer Vision II

Computer Vision Lecture 16

Study of Residual Networks for Image Recognition

Deep Learning with Tensorflow AlexNet

ImageNet Classification with Deep Convolutional Neural Networks

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

Convolutional Neural Networks

Machine Learning. MGS Lecture 3: Deep Learning

Spatial Localization and Detection. Lecture 8-1

Intro to Deep Learning. Slides Credit: Andrej Karapathy, Derek Hoiem, Marc Aurelio, Yann LeCunn

Computer Vision Lecture 16

Computer Vision Lecture 16

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Rotation Invariance Neural Network

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

INTRODUCTION TO DEEP LEARNING

Convolutional Neural Networks: Applications and a short timeline. 7th Deep Learning Meetup Kornel Kis Vienna,

Fuzzy Set Theory in Computer Vision: Example 3, Part II

Fully-Convolutional Siamese Networks for Object Tracking

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

Deep Neural Networks:

Structured Prediction using Convolutional Neural Networks

CSE 559A: Computer Vision

Extend the shallow part of Single Shot MultiBox Detector via Convolutional Neural Network

Deconvolutions in Convolutional Neural Networks

Deep Residual Learning

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Classification of objects from Video Data (Group 30)

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Seminars in Artifiial Intelligenie and Robotiis

Object detection with CNNs

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

SEMANTIC COMPUTING. Lecture 8: Introduction to Deep Learning. TU Dresden, 7 December Dagmar Gromann International Center For Computational Logic

Mask R-CNN. Kaiming He, Georgia, Gkioxari, Piotr Dollar, Ross Girshick Presenters: Xiaokang Wang, Mengyao Shi Feb. 13, 2018

Deep Learning for Object detection & localization

Final Report: Smart Trash Net: Waste Localization and Classification

Yiqi Yan. May 10, 2017

3D model classification using convolutional neural network

Fei-Fei Li & Justin Johnson & Serena Yeung

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

Channel Locality Block: A Variant of Squeeze-and-Excitation

Convolutional Neural Networks

Fuzzy Set Theory in Computer Vision: Example 3

Object Detection. CS698N Final Project Presentation AKSHAT AGARWAL SIDDHARTH TANWAR

CNNS FROM THE BASICS TO RECENT ADVANCES. Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague

Weighted Convolutional Neural Network. Ensemble.

Deep Learning Workshop. Nov. 20, 2015 Andrew Fishberg, Rowan Zellers

INF 5860 Machine learning for image classification. Lecture 11: Visualization Anne Solberg April 4, 2018

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Using Machine Learning for Classification of Cancer Cells

Deep Learning Explained Module 4: Convolution Neural Networks (CNN or Conv Nets)

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Lecture 37: ConvNets (Cont d) and Training

Learning Deep Representations for Visual Recognition

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

Project 3 Q&A. Jonathan Krause

Convolu'onal Neural Networks

Learning Deep Features for Visual Recognition

Classifying a specific image region using convolutional nets with an ROI mask as input

Deep Learning. Vladimir Golkov Technical University of Munich Computer Vision Group

ECE 6504: Deep Learning for Perception

Two-Stream Convolutional Networks for Action Recognition in Videos

Object Detection and Its Implementation on Android Devices

Convolutional Neural Networks + Neural Style Transfer. Justin Johnson 2/1/2017

Deep Learning & Neural Networks

Convolutional Neural Networks. CSC 4510/9010 Andrew Keenan

Convolutional Neural Network Layer Reordering for Acceleration

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Tutorial on Keras CAP ADVANCED COMPUTER VISION SPRING 2018 KISHAN S ATHREY

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Content-Based Image Recovery

Advanced Video Analysis & Imaging

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

Deep Tracking: Biologically Inspired Tracking with Deep Convolutional Networks

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

YOLO9000: Better, Faster, Stronger

Introduction to Neural Networks

Convolution Neural Network for Traditional Chinese Calligraphy Recognition

Object detection using Region Proposals (RCNN) Ernest Cheung COMP Presentation

Global Optimality in Neural Network Training

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Fully-Convolutional Siamese Networks for Object Tracking

Real-time convolutional networks for sonar image classification in low-power embedded systems

Perceptron: This is convolution!

Real-time Object Detection CS 229 Course Project

6. Convolutional Neural Networks

Todo before next class

ECE 5470 Classification, Machine Learning, and Neural Network Review

Lecture 2 Notes. Outline. Neural Networks. The Big Idea. Architecture. Instructors: Parth Shah, Riju Pahwa

Advanced Introduction to Machine Learning, CMU-10715

Convolutional Neural Networks for Facial Expression Recognition

Transcription:

Deep Learning in Visual Recognition Thanks Da Zhang for the slides

Deep Learning is Everywhere 2

Roadmap Introduction Convolutional Neural Network Application Image Classification Object Detection Object Tracking Our Work Conclusion and Discussion 3

Roadmap Introduction Convolutional Neural Network Application Image Classification Object Detection Object Tracking Our Work Conclusion and Discussion 4

Computer Visual Recognition Definition Visual Recognition deals with how computers can be made to gain high-level understanding from digital images or videos. It seeks to automate tasks that the human visual system can do. Image classification Object detection Semantic segmentation 5

Computer Visual Recognition Semantic gap between human and computer Colorful image 3-D intensity matrix Human Vision Computer Vision 6

Feature Descriptor Feature Descriptor is an algorithm which takes an image and outputs feature vectors. Feature Descriptors encode interesting information and can be used to differentiate one feature from another. Feature Descriptor f Predictions 7

Feature Descriptor Toy Example Feature Descriptor f Class Scores Sky Grass 8

Feature Descriptor - Before Deep Learning Finding better features Corner, blob, SIFT, HoG Training better classifiers Logistic regression, Random forest, Adaboost, Support Vector Machine 9

From Feature Descriptor to Deep Learning Feature Descriptor f Training Class Scores Problems with traditional feature descriptors Hand-crafted features. Only machine learning model is tuned during training. Features are low-level and lack generality. Deep Learning Hierarchical and general features are automatically learned through an end-to-end training process. 10

Roadmap Introduction Convolutional Neural Network Application Image Classification Object Detection Object Tracking Our Work Conclusion and Discussion 11

Auto Encoder It is hard to train multiple layers of neurons together How about training layer-by-layer? But what should be the merit of each layered training? What is the cost function? Auto encoder: best way to reproduce inputs Y = W T X implies X = W -T Y = W -T W T X = X In reality, X = UZV, each W preserves some common patterns in the training sets Think about mapping pixels to super pixels and do that recursively 12

Auto Encoder From individual pixel Sample mxn neighborhood from All training images All neighborhoods of mxn Form an mxn vector SVD of such (mxn)xu matrix captures most common patterns Each hidden layer encodes on such pattern Auto encoder can best reproduce the input pattern by using transpose weight matrix From individual super -pixel Sample kxl neighborhood from All training images All mxn super pixel neighborhood Form an kxl vector SVD of such (kxl)xu matrix captures most common patterns Each hidden layer encodes on such pattern Auto encoder can best reproduce the input pattern by using transpose weight matrix 13

From Pixel Collect all mxn patterns in all training images and all neighbors Linearize into mxn vectors Find From Superpixel Collect all mxn superpixel patterns 14

Convolution 64x64x3 image x 5x5x3 filter w 64 height Convolve the filter with the image Slide over the image spatially, computing dot products 64 width 3 depth 15

Convolution 64x64x3 image x 64 5x5x3 filter w 1 number: the result of taking a dot product between the filter and a small 5x5x3 chunk of the image. w T x + b 64 3 16

Convolution 64x64x3 image x activation/feature map 64 60 Convolve over all spatial locations 3 5x5x3 filter w 64 1 60 17

Convolution - Intuition Intuition of convolution Similarity measurement: image chunk and convolutional filter. Looking for local patterns. Patterns are not fixed and will be learned during training. 18

Convolutional Layer Stacking multiple convolutions in one layer activation/feature map 64 Convolution Layer 60 3 5x5x3 filter w 64 4 60 19

Convolutional Feature Maps A feature hierarchy by stacking convolutional layers image x features f1 features f2 64 60 56 CONV, Non-linear CONV, Non-linear CONV, Non-linear 56 3 64 4 60 16 20

Max Pooling Max pooling operation Slides a small non-overlapping window. Picks the maximum value inside each window. 2x2 max pooling 1 maximum number: the maximum value inside 2x2 window. 21

Max Pooling Max pooling operation Slides a small non-overlapping window. Picks the maximum value inside each window. 2x2 max pooling 32 64 Max pooling 32 64 22

Max Pooling Layer Advantages Reduces the redundancy of convolutions. Makes the representations smaller and more manageable. Reduces the number of parameters, controls overfitting. Invariant to small transformations, distortions and transitions. 224x224x256 112x112x256 Max pooling Layer 23

Fully Connected Layer A classifier on top of feature maps Maps high-dimensional matrix to predictions (1-D vector) feature maps fully connected layers Predictions 24

Convolutional Neural Network First CNN trained with backpropagation Three different layers: convolution, subsampling, fully-connected Training CNN: gradient-decent optimization (backpropagation) [1] LeCun, Yann, et al. "Gradient-based learning applied to document recognition." Proceedings of the IEEE 86.11 (1998): 2278-2324. 25

Training Convolution Neural Network X Gradient Descent Randomly initialize weights at each layer. Compute a forward pass and calculate the cost function. Calculate the with respect to each weight during a backward pass. Update weights. Compute activations Y Loss Compute gradients and update weights L 26

Traditional vs. Deep Learning Feature Descriptor f Training Predictions f Training Predictions 27

Roadmap Introduction Convolutional Neural Network Application Image Classification Object Detection Object Tracking Our Work Conclusion and Discussion 28

AlexNet 2012 IMAGENET classification task winner Advances in AlexNet ReLU activation function Dropout regularization GPU Implementation and dataset benchmark [1] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012. 29

Rectified Linear Unit (ReLU) Calculate gradient with backpropagation Advantages of ReLU Fast convergence. Better for image data. 30

Dropout Regularization Randomly set some neurons to zero in the forward pass. Interpretation Hidden units cannot co-adapt to other units. Hidden units must be more generally useful. 31

Benchmark and GPU Computation Dataset benchmarking GPU Computation 32

Visualizing Convolutional Networks Deconvolutional Network Map activations back to the input pixel space, show what input pattern originally caused a given activation. Steps Compute activations at a specific layer. Keep one activation and set all others to zero. Unpooling and deconvolution Construct input image [1] Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." European Conference on Computer Vision. Springer International Publishing, 2014. 33

Visualizing Convolutional Networks Layer1 [1] Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." European Conference on Computer Vision. Springer International Publishing, 2014. 34

Visualizing Convolutional Networks Layer2 [1] Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." European Conference on Computer Vision. Springer International Publishing, 2014. 35

Visualizing Convolutional Networks Layer3 [1] Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." European Conference on Computer Vision. Springer International Publishing, 2014. 36

Visualizing Convolutional Networks Layer4 [1] Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." European Conference on Computer Vision. Springer International Publishing, 2014. 37

Visualizing Convolutional Networks Layer5 [1] Zeiler, Matthew D., and Rob Fergus. "Visualizing and understanding convolutional networks." European Conference on Computer Vision. Springer International Publishing, 2014. 38

VGGNet Characteristics Much deeper Only 3x3 CONV and 2x2 MAX POOL are used [1] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arxiv preprint arxiv:1409.1556(2014). 39

ResNet Problems with stacking more layers Optimization Problem [1] He, Kaiming, et al. "Deep residual learning for image recognition." arxiv preprint arxiv:1512.03385 (2015). 40

ResNet Train weight layers to fit F(x) rather than H(x) Easy to find identity mapping and small fluctuations. [1] He, Kaiming, et al. "Deep residual learning for image recognition." arxiv preprint arxiv:1512.03385 (2015). 41

ResNet [1] He, Kaiming, et al. "Deep residual learning for image recognition." arxiv preprint arxiv:1512.03385 (2015). 42

ResNet [1] He, Kaiming, et al. "Deep residual learning for image recognition." arxiv preprint arxiv:1512.03385 (2015). 43

Evolution of CNN Evolutional trend Performance getting better Architecture goes deeper Design becomes simpler Features CNN Prediction CNN is the basic building block for recent visual system Hierarchical feature descriptor. Designed to solve different recognition problems. 44

Roadmap Introduction Convolutional Neural Network Application Image Classification Object Detection Object Tracking Our Work Conclusion and Discussion 45

Object Detection Image Classification CNN CNN Class Scores Object Detection CNN? Class Scores Location 46

Region Proposal Find potential regions that might have objects. Example: Edge Boxes The number of contours that are wholly contained in a bounding box is indicative of the likelihood of the box containing an object Good Bad [1] Zitnick, C. Lawrence, and Piotr Dollár. "Edge boxes: Locating object proposals from edges." European Conference on Computer Vision. Springer International Publishing, 2014. 47

Region CNN [1] Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014. 48

Fast R-CNN [1] Girshick, Ross. "Fast r-cnn." Proceedings of the IEEE International Conference on Computer Vision. 2015. 49

Faster R-CNN An end-to-end trainable neural network architecture using the same CNN feature map for both region proposal and proposal classification. [1] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015. 50

Model Comparison R-CNN Traditional region proposal + CNN classifier for each proposal Fast R-CNN Traditional region proposal + CNN classifier for entire image Faster R-CNN An unified CNN architecture for region proposal & proposal classification R-CNN Fast R-CNN Faster R-CNN Test time 50 s 2 s 0.2 s map (%) 66.0 66.9 66.9 AlexNet VGG-16 ResNet-101 map (%) 62.1 73.2 76.4 Using different CNNs in faster R-CNN 51

Roadmap Introduction Convolutional Neural Network Application Image Classification Object Detection Object Tracking Our Work Conclusion and Discussion 52

Object Tracking Image Classification & Object Detection (discrete, static) Object Tracking (continuous, temporal) 53

Tracking by Classification Tracking by classification Train a CNN classifier. Randomly sample potential bounding boxes in new frame. Classify each bounding box and pick the optimal one. Update CNN weights during tracking. [1] Wang, Naiyan, et al. "Transferring rich feature hierarchies for robust visual tracking." arxiv preprint arxiv:1501.04587 (2015). [2] Wang, Lijun, et al. "Visual tracking with fully convolutional networks."proceedings of the IEEE International Conference on Computer Vision. 2015. [3] Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." arxiv preprint arxiv:1510.07945 (2015). 54

MDNet Train a CNN classifier offline Training videos Negative sample Classification score Positive sample [1] Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." arxiv preprint arxiv:1510.07945 (2015). 55

MDNet Apply the classifier for each consecutive potential regions [1] Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." arxiv preprint arxiv:1510.07945 (2015). 56

MDNet Fine-tune the classifier online during tracking [1] Nam, Hyeonseob, and Bohyung Han. "Learning multi-domain convolutional neural networks for visual tracking." arxiv preprint arxiv:1510.07945 (2015). 57

Tracking by Regression Two Convolutional neural networks A search region from the current frame. A target from the previous frame. Compare crops and find target object. [1] Held, David, Sebastian Thrun, and Silvio Savarese. "Learning to Track at 100 FPS with Deep Regression Networks." arxiv preprint arxiv:1604.01802(2016). 58

Roadmap Introduction Convolutional Neural Network Application Image Classification Object Detection Object Tracking Our Work Conclusion and Discussion 59

Recurrent Neural Network Recurrent Neural Network Better for processing sequential information. Creates and updates a hidden state. The hidden state captures a history of inputs. s t = f(ux t + Ws t 1 ) 60

Different RNN Cells s t RNN Cell s t+1 s t = f W (x t, s t 1 ) x t RNN Cells Basic RNN Cell, LSTM Cell, GRU Cell and etc. 61

Proposed Framework 62

Preliminary Results 63

Discussion Computer Visual Recognition Discrete, static images Continuous, temporal videos Image Classification: image => class scores Convolution Neural Network AlexNet, VGGNet, ResNet Object Detection: image => location+scores Region proposal+cnn End-to-end CNN design Object Tracking: video => location sequence Tracking by classification Tracking by regression Our work: Recurrent Neural Network Reinforcement Learning End-to-end tracker design [1] https://docs.google.com/spreadsheets/d/1q-61cqxzt3ehisumsnzeqmmmviorftijtjjoewrh3-0/edit#gid=0 64

Others: LSTM Traditional RNN have very simple internal structures Long-term propagation can be tricky 65

Others: LSTM LSTM has more elaborate internal structure Think about fancy Kalman Filter or HMM 66

Others: LSTM It maintains an internal state (C t a bit string) 67

Others: LSTM State vector evolves with time Influenced by inputs (x t ), prior state (C t-1 ) and prior output (h t-1 ) Can be forgotten (f t : remembering, or forgotten, factor) 68

Others: LSTM State vector evolves with time Influenced by inputs (x t ), prior state (C t-1 ) and prior output (h t-1 ) Can be augmented and changed by current input i t : changed (new) factor C t : changed (new)state 69

Others: LSTM State vector evolves with time Influenced by inputs (x t ), prior state (x t-1 ) and prior output (h t-1 ) These factors are then combined to generate next state 70

Others: LSTM Output Vectors Determined by state and input 71

Others: LSTM Previously, state (C) is assumed to be hidden (not directly observable), similar to Kalman Filter HMM 72

Others: Style Transfer Network Networks with two kinds of losses Content losses Style losses 73

74

Others: GAN Dual competing networks: Generative + discriminative Generator Seeded with randomized inputs with predefined distribution Back propagate with negation of classification error of the discriminator Discriminator CNN: trained with certain data sets Goal: Generator to learn the Discriminator distribution as to generate results that are indistinguishable from training data to the Discriminator 75

Conclusion and Discussion Advances make CNN very efficient today Big data and great computational power. Research advancements: ReLU, dropout and etc. CNN is the basic building block for computer vision system Hierarchical and general features. End-to-end training. Still lots of unsolved problems in visual recognition Special image analysis (satellite and etc.) Video analysis Multi-view analysis (activity recognition and etc.) New architecture + new application 76