Deep Neural Networks:

Similar documents
Computer Vision Lecture 16

CS 2750: Machine Learning. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh April 13, 2016

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

CS 1674: Intro to Computer Vision. Neural Networks. Prof. Adriana Kovashka University of Pittsburgh November 16, 2016

Deep Learning for Computer Vision II

Computer Vision Lecture 16

Computer Vision Lecture 16

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Study of Residual Networks for Image Recognition

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Part Localization by Exploiting Deep Convolutional Networks

Deep Learning with Tensorflow AlexNet

Object Detection Based on Deep Learning

An Exploration of Computer Vision Techniques for Bird Species Classification

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Object Detection Lecture Introduction to deep learning (CNN) Idar Dyrdal

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Convolutional Neural Networks

Know your data - many types of networks

A FRAMEWORK OF EXTRACTING MULTI-SCALE FEATURES USING MULTIPLE CONVOLUTIONAL NEURAL NETWORKS. Kuan-Chuan Peng and Tsuhan Chen

A Sparse and Locally Shift Invariant Feature Extractor Applied to Document Images

Advanced Introduction to Machine Learning, CMU-10715

Image and Video Understanding

In Defense of Fully Connected Layers in Visual Representation Transfer

Spatial Localization and Detection. Lecture 8-1

Multi-Glance Attention Models For Image Classification

Object Recognition II

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Intro to Deep Learning for Computer Vision

Machine Learning. MGS Lecture 3: Deep Learning

Category Recognition. Jia-Bin Huang Virginia Tech ECE 6554 Advanced Computer Vision

Deep Learning & Neural Networks

Robust Scene Classification with Cross-level LLC Coding on CNN Features

Efficient Convolutional Network Learning using Parametric Log based Dual-Tree Wavelet ScatterNet

Return of the Devil in the Details: Delving Deep into Convolutional Nets

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

CS 523: Multimedia Systems

Object detection with CNNs

Introduction to Neural Networks

Ryerson University CP8208. Soft Computing and Machine Intelligence. Naive Road-Detection using CNNS. Authors: Sarah Asiri - Domenic Curro

Deep Convolutional Neural Networks. Nov. 20th, 2015 Bruce Draper

3D Densely Convolutional Networks for Volumetric Segmentation. Toan Duc Bui, Jitae Shin, and Taesup Moon

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Deconvolutions in Convolutional Neural Networks

Deep Learning for Vision: Tricks of the Trade

Microscopy Cell Counting with Fully Convolutional Regression Networks

arxiv: v1 [cs.cv] 4 Dec 2014

Application of Convolutional Neural Network for Image Classification on Pascal VOC Challenge 2012 dataset

Tiny ImageNet Visual Recognition Challenge

COMP9444 Neural Networks and Deep Learning 7. Image Processing. COMP9444 c Alan Blair, 2017

REGION AVERAGE POOLING FOR CONTEXT-AWARE OBJECT DETECTION

Convolutional Neural Networks. CSC 4510/9010 Andrew Keenan

Weighted Convolutional Neural Network. Ensemble.

Convolu'onal Neural Networks

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Seminars in Artifiial Intelligenie and Robotiis

Introduction to Neural Networks

Learning-based Methods in Vision

Introduction to Deep Learning

A Deep Learning Framework for Authorship Classification of Paintings

Rich feature hierarchies for accurate object detection and semantic segmentation

Deep Learning Based Large Scale Handwritten Devanagari Character Recognition

Structured Prediction using Convolutional Neural Networks

Deep Neural Network Hyperparameter Optimization with Genetic Algorithms

Towards Large-Scale Semantic Representations for Actionable Exploitation. Prof. Trevor Darrell UC Berkeley

Learning of Visualization of Object Recognition Features and Image Reconstruction

Cultural Event Recognition by Subregion Classification with Convolutional Neural Network

ImageNet Classification with Deep Convolutional Neural Networks

Learning Visual Semantics: Models, Massive Computation, and Innovative Applications

DEEP LEARNING REVIEW. Yann LeCun, Yoshua Bengio & Geoffrey Hinton Nature Presented by Divya Chitimalla

arxiv: v1 [cs.lg] 16 Jan 2013

Yiqi Yan. May 10, 2017

Channel Locality Block: A Variant of Squeeze-and-Excitation

Rotation Invariance Neural Network

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Pedestrian Detection with Deep Convolutional Neural Network

On the Effectiveness of Neural Networks Classifying the MNIST Dataset

Deformable Part Models

Real-time Object Detection CS 229 Course Project

The Treasure beneath Convolutional Layers: Cross-convolutional-layer Pooling for Image Classification

Learning a Representative and Discriminative Part Model with Deep Convolutional Features for Scene Recognition

Inception and Residual Networks. Hantao Zhang. Deep Learning with Python.

Residual Networks And Attention Models. cs273b Recitation 11/11/2016. Anna Shcherbina

Groupout: A Way to Regularize Deep Convolutional Neural Network

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Mocha.jl. Deep Learning in Julia. Chiyuan Zhang CSAIL, MIT

Machine Learning 13. week

Performance Evaluation of Hybrid CNN for SIPPER Plankton Image Classification

Comparison of Fine-tuning and Extension Strategies for Deep Convolutional Neural Networks

arxiv: v1 [cs.cv] 24 May 2016

arxiv: v1 [cs.cv] 9 Nov 2015

Unified, real-time object detection

Face Recognition A Deep Learning Approach

CENG 783. Special topics in. Deep Learning. AlchemyAPI. Week 11. Sinan Kalkan

Locally Scale-Invariant Convolutional Neural Networks

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

Deep Convolutional Neural Networks for Efficient Pose Estimation in Gesture Videos

Transfer Learning. Style Transfer in Deep Learning

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

arxiv: v1 [cs.cv] 6 Jul 2016

Transcription:

Deep Neural Networks: Part II Convolutional Neural Network (CNN) Yuan-Kai Wang, 2016 Web site of this course: http://pattern-recognition.weebly.com source: CNN for ImageClassification, by S. Lazebnik, 2014.

A Convolutional Neural Network(CNN) for Image Classification It is deep: 6 hidden layers, each with a special purpose Deep Neural Network 2

Overview Shallow vs. deep architectures Background Stages of CNN architecture Visualizing CNNs State-of-the-art results Packages 3

Traditional Recognition Approach Image/ Video Pixels Hand-designed feature extraction Trainable classifier Object Class Features are not learned Trainable classifier is often generic (e.g. SVM) 4

Traditional Recognition Approach Features are key to recent progress in recognition Multitude of hand-designed features currently in use SIFT, HOG,. Where next? Better classifiers? Or keep building more features? Felzenszwalb, Girshick, McAllester andramanan, PAMI2007 Yan & Huang (Winner of PASCAL 2010classification competition) 5

What about learning the features? Learn a feature hierarchy all the way from pixels to classifier Each layer extracts features from the output of previous layer Train all layers jointly Image/ Video Pixels Layer 1 Layer 2 Layer 3 Simple Classifier 6

Shallow vs. deep architectures Traditional recognition: Shallow architecture Image/ Video Pixels Hand-designed feature extraction Trainable classifier Object Class Deep learning: Deep architecture Image/ Video Pixels Layer 1 Layer N Simple classifier Object Class 7

Background: Perceptrons Input x 1 Weights w 1 x 2 x 3... x d w 2 w 3 w d Output: σ(w x + b) Sigmoid function: 1 σ(t) = 1+ e t 8

Inspiration: Neuron cells 9

Background: Multi-Layer Neural Networks Nonlinear classifier Training: find network weights w to minimize the error between true training labels y i and estimated labels f w (x i ): N 2 E(w) = (y i f w (x i )) i=1 Minimization can be done by gradient descent provided f is differentiable This training method is called back-propagation 10

Hubel/Wiesel Architecture D. Hubel and T. Wiesel (1959, 1962, Nobel Prize 1981) Visual cortex consists of a hierarchy of simple, complex, and hyper-complex cells Source 11

Convolutional Neural Networks (CNN, Convnet) Neural network with specialized connectivity structure Stack multiple stages of feature extractors Higher stages compute more global, more invariant features Classification layer at the end Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based learning applied to document recognition, Proceedings of the IEEE 86(11): 2278 2324, 1998. 12

Convolutional Neural Networks (CNN, Convnet) Feed-forward feature extraction: 1. Convolve input with learned filters 2. Non-linearity 3. Spatial pooling 4. Normalization Supervised training of convolutional filters by back-propagating classification error Feature maps Normalization Spatial pooling Non-linearity Convolution (Learned) Input Image 13

1. Convolution Dependencies are local Translation invariance Few parameters (filter weights) Stride can be greater than 1 (faster, less memory)... Input Feature Map 14

2. Non-Linearity Per-element (independent) Options: Tanh Sigmoid: 1/(1+exp(-x)) Rectified linear unit (ReLU) Simplifies backpropagation Makes learning faster Avoids saturation issues à Preferred option 15

3. Spatial Pooling Sum or max Non-overlapping / overlapping regions Role of pooling: Invariance to small transformations Larger receptive fields (see more of input) Max Sum 16

4. Normalization Within or across feature maps Before or after spatial pooling Feature Maps Feature Maps After Contrast Normalization 17

Compare: SIFT Descriptor Image Pixels Apply oriented filters Lowe [IJCV 2004] Spatial pool (Sum) Normalize to unit length Feature Vector 18

Compare: Spatial Pyramid Matching SIFT features Filter with Visual Words Lazebnik, Schmid, Ponce [CVPR 2006] Take max VW response (L-inf normalization) Multi-scale spatial pool (Sum) Global image descriptor 19

Convnet Successes Handwritten text/digits MNIST (0.17% error [Ciresan et al. 2011]) Arabic & Chinese [Ciresan et al. 2012] Simpler recognition benchmarks CIFAR-10 (9.3% error [Wan et al. 2013]) Traffic sign recognition 0.56% error vs 1.16% for humans [Ciresan et al. 2011] But until recently, less good at more complex datasets Caltech-101/256 (few training examples) 20

ImageNet Challenge 2012 ~14 million labeled images, 20k classes Images gatheredfrom Internet Human labels viaamazon Turk [Deng et al. CVPR 2009] Challenge: 1.2 million training images, 1000 classes A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012 21

ImageNet Challenge 2012 Similar framework to LeCun 98 but: Bigger model (7 hidden layers, 650,000 units, 60,000,000 params) More data (10 6 vs. 10 3 images) GPU implementation (50x speedup over CPU) Trained on two GPUs for a week Better regularization for training (DropOut) A. Krizhevsky, I. Sutskever, and G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, NIPS 2012 22

ImageNet Challenge 2012 Krizhevsky et al. -- 16.4% error (top-5) Next best (non-convnet) 26.2% error 35 30 25 Top-5 error rate % 20 15 10 5 0 SuperVision ISI Oxford INRIA Amsterdam 23

Visualizing Convnets M. Zeiler and R. Fergus, Visualizing and Understanding Convolutional Networks, arxiv preprint, 2013 24

Meaning of Each Layer in Convnets Itis deep: 7 hidden layers Deep Neural Network 25

Layer 1 Filters 26

Layer 1: Top-9 Patches 27

Layer 2: Top-9 Patches Patches from validation images that give maximal activation of a given feature map 28

Layer 2: Top-9 Patches 29

Layer 3: Top-9 Patches Layer 3: Top-9 Patches 30

Layer 3: Top-9 Patches 31

Layer 4: Top-9 Patches 32

Layer 4: Top-9 Patches 33

Layer 5: Top-9 Patches 34

Layer 5: Top-9 Patches 35

Evolution of Features During Training 36

Evolution of Features During Training 37

Diagnosing Problems Visualization of Krizhevsky et al. s architecture showed some problems with layers 1 and 2 Large stride of 4 used Alter architecture: smaller stride & filter size Visualizations look better Performance improves Krizhevsky et al. Zeiler and Fergus 38

Occlusion Experiment Mask parts of input with occluding square Monitor output 39

Input image p(true class) Most probable class 40

Input image Total activation in most active 5 th layer feature map Other activations from same feature map 41

Input image p(true class) Most probable class 42

Input image Total activation in most active 5 th layer feature map Other activations from same feature map 43

Input image p(true class) Most probable class 44

Input image Total activation in most active 5 th layer feature map Other activations from same feature map 45

ImageNet Classification 2013 Results http://www.image-net.org/challenges/lsvrc/2013/results.php Demo: http://www.clarifai.com/ 0.17 0.16 Test error (top-5) 0.15 0.14 0.13 0.12 0.11 0.1 46

How important is depth? Softmax Output Architecture of Krizhevsky et al. 8 layers total Trained on ImageNet 18.1% top-5 error Layer 7: Full Layer 6: Full Layer 5: Conv + Pool Layer 4: Conv Layer 3: Conv Layer 2: Conv + Pool Layer 1: Conv + Pool Input Image 47

How important is depth? Remove top fully connected layer Layer 7 Softmax Output Layer 6: Full Layer 5: Conv + Pool Drop 16 million parameters Layer 4: Conv Layer 3: Conv Layer 2: Conv + Pool Only 1.1% drop in performance! Layer 1: Conv + Pool Input Image 48

How important is depth? Softmax Output Remove both fully connected layers Layer 6 & 7 Drop ~50 million parameters 5.7% drop in performance Layer 5: Conv + Pool Layer 4: Conv Layer 3: Conv Layer 2: Conv + Pool Layer 1: Conv + Pool Input Image 49

How important is depth? Now try removing upper feature extractor layers: Layers 3 & 4 Drop ~1 million parameters Softmax Output Layer 7: Full Layer 6: Full Layer 5: Conv + Pool 3.0% drop in performance Layer 2: Conv + Pool Layer 1: Conv + Pool Input Image 50

How important is depth? Softmax Output Now try removing upper feature extractor layers & fully connected: Layers 3, 4, 6,7 Now only 4 layers Layer 5: Conv + Pool 33.5% drop in performance àdepth of network is key Layer 2: Conv + Pool Layer 1: Conv + Pool Input Image 51

Tapping off Features at each Layer Plug features from each layer into linear SVM or soft-max

CNN Libraries (open source) Cuda-convnet (Google): C/C++, Python Caffe (Berkeley): C/C++, Matlab, Python Overfeat (NYU): C/C++ TensorFlow (Google): C++, Python Torch: Python ConvNetJS: Java script MatConvNet (VLFeat): Matlab DeepLearn Toolbox: Matlab 53

Using CNN Features on Other Datasets Take model trained on, e.g., ImageNet 2012 training set Take outputs of 6 th or 7 th layer before or after nonlinearity Classify test set of new dataset Optional: fine-tune features and/or classifier on new dataset 54

Results on misc. benchmarks [1] Caltech-101 (30 samples per class) [1] SUN 397 dataset (DeCAF) [1] Caltech-UCSD Birds (DeCAF) [2] MIT-67 Indoor Scenes dataset (OverFeat) [1] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell, DeCAF:ADeep Convolutional Activation Feature for Generic Visual Recognition, arxiv preprint, 2014 [2]A. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, CNN Featuresoff-the-shelf: anastounding Baseline for Recognition, arxiv preprint, 2014 55

CNN features for detection Object detection system overview. Our system (1) takes an input image, (2) extracts around 2000 bottom-up region proposals, (3) computes features for each proposal using a large convolutional neural network (CNN), and then (4) classifies each region using class-specific linear SVMs. R-CNN achieves a mean average precision (map) of 53.7% on PASCAL VOC 2010. For comparison, Uijlings et al. (2013) report 35.1% map using the same region proposals, but with a spatial pyramid and bag-of-visualwords approach. The popular deformable part models perform at 33.4%. R. Girshick, J. Donahue, T. Darrell, and J. Malik, Rich Feature Hierarchies foraccurate Object Detection and Semantic Segmentation, CVPR 2014, to appear. 56

CNN features for face verification Y. Taigman, M. Yang, M. Ranzato, L. Wolf, DeepFace: Closing the Gap to Human-Level Performance in Face Verification, CVPR 2014, to appear. 57

References : CNNs Classical paper of CNN (LeNet) Y. LeCun, L. Bottou, Y. Bengio, P. Huffier, "Gradient-based learning applied to document recognition." Proceedings of the IEEE, 86(11), 2278-2324. 1998. Stochastic Gradient Descent G. E. Hinton, S. Osindero, and Y.-W. Teh, A Fast Learning Algorithm for Deep Belief Nets, Neural Computation, vol. 18, no. 7, pp. 1527 1554, Jul. 2006. Dropout N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting, The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929 1958, Jan. 2014. Transfer learning and Pretrain Y. Bengio, Deep learning of representations for unsupervised and transfer learning, JMLR W&CP: Proc. Unsupervised and Transfer Learning, 2012. 58

References : R-CNN R. Girshick, J. Donahue, T. Darrell, and J. Malik, Region- Based Convolutional Networks for Accurate Object Detection and Segmentation, IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 1, pp. 142 158, 2016. S. Ren, K. He, R. Girshick, and J. Sun, Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Advances in Neural Information Processing Systems, 2015, pp. 91 99. 2015. 59