Learning Hierarchical Features for Scene Labeling

Size: px

Start display at page:

Download "Learning Hierarchical Features for Scene Labeling"

Adela Goodman
6 years ago
Views:

1 Learning Hierarchical Features for Scene Labeling FB Informatik Knowledge Engineering Group Prof. Dr. Johannes Fürnkranz Seminar Machine Learning Author : Tanya Harizanova Seminar aus maschinellem Lernen 11

2 Contents Introduction Multiscale Feature Extraction For Scene Parsing Scene Labeling Strategies Experinments Important insights on the experiments Conclusion Questions / Discussion Seminar aus maschinellem Lernen 2

3 Introduction Scene Parsing Scene Parsing (full-scene-labeling) labeling every pixel in image to the category of the object it belongs to Seminar aus maschinellem Lernen 3

4 Introduction Scene Parsing(2) Questions to Scene Parsing : How to produce a good internal representation of the visual information? How to use contextual information to ensure the self-consistency of the interpretation? This Paper presents a Scene Parsing System, that relies on deep learning methods to approach both questions. Main Idea use a Convolutional Network operating on a large input window to produce label hypotheses for each pixel location Seminar aus maschinellem Lernen 4

Introduction Convolutional Network Convolutional Networks - are hierarchical architectures, which can be trained and are compose of multiple stage,each of which contains three layers : filter bank

5 Introduction Convolutional Network Convolutional Networks - are hierarchical architectures, which can be trained and are compose of multiple stage,each of which contains three layers : filter bank module,non-linarity module und spatial pooling module.the typical convolutional network are composed from two or three such stages,followed by classifying module Seminar aus maschinellem Lernen 5

6 Introduction Convolutional Network(2) Problem Labeling each Pixel by looking at a small region around is difficult, the category of a pixel may depend on relatively short-range information, but may also depend on long-range information. Solution of the problem Use of Multi-scale Convolutional Networks can take into account a large input windows, while keeping the number of free parameters to minimum Seminar aus maschinellem Lernen 6

Introduction Scene Parsing Architecture Scene Parsing Architektur of this system relies on two main components : 1. Multi-scale convolution representation 2.

7 Introduction Scene Parsing Architecture Scene Parsing Architektur of this system relies on two main components : 1. Multi-scale convolution representation 2. Graph-based classifikation Superpixels, Conditional random field over superpixels, Multilevel cut with class purity criterion Seminar aus maschinellem Lernen 7

8 Contents Introduction Multiscale Feature Extraction For Scene Parsing Scene Labeling Strategies Experinments Important insights on the experiments Conclusion Questions / Discussion Seminar aus maschinellem Lernen 8

Multiscale feature extraction for scene parsing Scene invariant, scene-level feature extraction Good iternal representations are hierarchical Convolutional networks provides a simple framework to

9 Multiscale feature extraction for scene parsing Scene invariant, scene-level feature extraction Good iternal representations are hierarchical Convolutional networks provides a simple framework to learn such hierarchies of features, composed of multiple stages Feature extractor of this model is a three-stage convolutional network The convulational kernels are the actuall subject to training Seminar aus maschinellem Lernen 9

10 Multiscale feature extraction for scene parsing Scene invariant, scene-level feature extraction Convention : Bank of images as 3D arrays The maps of the pyramid computed using scaling/normalisation function g s as X s= g s ( I ) For network f s s 1,..., N with L layers f s ( X s ; θ s)=w L H L 1 where the vector of hidden units at layer l is H l = pool (tanh(w l H l 1 +b l )) H lp= pool (tanh (b lp+ W lp H l 1,q )) q parents ( p ) The outputs of the N networks unsampled und concatenated so as to produce F F =[ f 1,u( f 2 ),..., u( f N )], where u is an unsampling function Seminar aus maschinellem Lernen 10

11 Multiscale feature extraction for scene parsing Learning discriminative scale-invariant features Multiclass cross entropy loss function Normalized prediction vector Normalized predicticted probability destributions over classes c i,a c i Compute using softmax finction c i,a =e T Wa Fi / e T Wb Fi b classes W is a temporary weigth matrix only used to learn features The cross entropy between the predicted class distribution c and the target class distribution c penalizes their deviation and is measured by L = cat Seminar aus maschinellem Lernen 11 i pixels a classes ci, a ln( c i, a )

12 Contents Introduction Multiscale Feature Extraction For Scene Parsing Scene Labeling Strategies Experinments Important insights on the experiments Conclusion Questions / Discussion Seminar aus maschinellem Lernen 12

Scene Labeling Strategien The simplest strategy for scene labeling is to use a linear classifier and assign each pixel with argmax of the prediction of its location.

13 Scene Labeling Strategien The simplest strategy for scene labeling is to use a linear classifier and assign each pixel with argmax of the prediction of its location. The resulting labeling l, although fairly accurate, is not satisfying visually, as it lacks spatial consistency, and precise delineation of objects Seminar aus maschinellem Lernen 13

Scene Labeling Strategien Superpixels Predicting the class of each pixel indipendantly from its neighbors yields noisy prediction Classify each location of the image densely and aggregate these

14 Scene Labeling Strategien Superpixels Predicting the class of each pixel indipendantly from its neighbors yields noisy prediction Classify each location of the image densely and aggregate these predictions in each superpixel, by computing the average class distribution within the superpixel. Superpixel not involve global understanding of the scene Seminar aus maschinellem Lernen 14

15 Scene Labeling Strategien Conditional Random Fields Classical CRF Model constructed on Superpixels. Each pixel in image is a vertex in graph, the edges are added between every neightbor nodes and it is defined an energy function. CRF energy minimized using alpha expansions Seminar aus maschinellem Lernen 15

16 Scene Labeling Strategien Parameter-free Multilevel Parsing Observation Level Problem Parameter-free Multilevel parsing method to analyze a family of segmentation and automatically discover the best observation level for each pixel in the image Seminar aus maschinellem Lernen 16

17 Scene Labeling Strategien Parameter-free Multilevel Parsing(2) Optimal Purity Cover optimization problem for search for most adapted neighborhood of a pixel k (i) of component that best explains this pixel this with the min cost S k (i) For each pixel i,we wish to find an index k (i)=argmin S k k i C k Seminar aus maschinellem Lernen 17

Scene Labeling Strategien Parameter-free Multilevel Parsing(3) Producing the confidence costs the construction of the cost function that is minimized S k with given set of components C k and using

18 Scene Labeling Strategien Parameter-free Multilevel Parsing(3) Producing the confidence costs the construction of the cost function that is minimized S k with given set of components C k and using the set of (N ) O c :O [0,1] object descriptors k we define a function as predicting k the destribution( d ) of classes presents in component C k k Confidence Costs c Seminar aus maschinellem Lernen 18

19 Scene Labeling Strategien Parameter-free Multilevel Parsing(4) Training Procedure training procedure used by producing the confidence costs Segmentation collections (T )T τ are constructed on the entire training set, and, for all T τ train the classifier c to predict the destribution of the classes in component,as well as the costs S k Seminar aus maschinellem Lernen 19

20 Contents Introduction Multiscale Feature Extraction For Scene Parsing Scene Labeling Strategies Experinments Important insights on the experiments Conclusion Questions / Discussion Seminar aus maschinellem Lernen 20

21 Experiments Semantic scene understanding results on three different datasets Stanford Background contains 715 images of outdoor scenes composed in 8 classes,all of the images with 320x240 pixels, with atleast one foreground object. 5-fold cross validation : 572 images used for training and 143 for testing SIFT Flow composed of 2688, thoroughly labeled by LabelMe users,slitt in 2488 trainig images and 200 test images. Synonim correction used to obtain 33 semantic labels. Barcelona - has 14,871 training and 279 test images.the test set consists of street scenes from Barcelona, while the training set ranges in scene type but has no street scenes from Barcelona. Manually consolidated the synonyms in the label set to 170 unique labels Seminar aus maschinellem Lernen 21

22 Experiments on Stanford Background Data Sets Pixel Acc. Class Acc. CT (sec.) System based on convolutional network alone 66.0% 56.5% 0.35s Multiscale convolutional network with raw pixel prediction 78.8% 72.4% 0.6s Superpixel-based predictions 80.4% 74.56% 0.7s CRF-based predictions 80.4% 75.24% 61s Cover-based predictions 81.4% 76.0% 60.5s Seminar aus maschinellem Lernen 22

23 Experiments on Stanford Background Data Sets (2) Building Sky Grass Mountain Tree Object Seminar aus maschinellem Lernen 23

24 Experiments on SIFT Flow dataset Pixel Acc. Class Acc. raw multiscale net 67.9% 45.9% multiscale net + superpixels 71.9% 50.08% multiscale net + cover (1) 72.3% 50.08% multiscale net + cover (2) 78.5% 29.6% Seminar aus maschinellem Lernen 24

Experiments on SIFT Flow dataset (2) 14.

25 Experiments on SIFT Flow dataset (2) Seminar aus maschinellem Lernen 25

26 Experiments on Barcelona dataset Pixel Acc. Class Acc. raw multiscale net 37.8% 12.1% multiscale net + superpixels 44.1% 12.4% multiscale net + cover (1) 46.4% 12.5% multiscale net + cover (2) 67.8% 9.5% Seminar aus maschinellem Lernen 26

27 Real World Experiment For the real-world experiment Multiscale feature combined mit classification using Superpixel strategy trained on SIFT Flow dataset. The test movie build from 4 videos stiched to form a 360 video stream of 1280x256 images Result the system constitutes the first approach achieving real time performance,one of the frame being processed in less then a second using i7 4-core Intel(with dadicated FPGA Software can be reduced to 60 ms) Seminar aus maschinellem Lernen 27

28 Real World Experiment Video Real Time Performance Seminar aus maschinellem Lernen 28

29 Contents Introduction Multiscale Feature Extraction For Scene Parsing Scene Labeling Strategies Experinments Important insights on the experiments Conclusion Questions / Discussion Seminar aus maschinellem Lernen 29

30 Important Insights on the experiments Using high-capacity feature-learning system fed with raw pixels yields excellent result compared with systems using engineered features Feeding the system with a wide contextual window is critical to the quality of the results When a wide context is taken into accounts to produce each pixel label, the role of the post-processing is greatly reduced The use of highly sophisticated post-processing schemes does not seems to improve the results significantly over simple schemes Relying heavily on highly-accurate feed-foward pixel labeling system, while simplifying the post-processing module to its bare minnimum cuts down the inference times considerably Seminar aus maschinellem Lernen 30

31 Contents Introduction Multiscale Feature Extraction For Scene Parsing Scene Labeling Strategies Experinments Important insights on the experiments Conclusion Questions / Discussion Seminar aus maschinellem Lernen 31

32 Conclusion Feed-foward convolutional network can produce state of art performance on standard scene parsing datasets Without relying on engineering features Even in the absense of any post-processing by simply labeling each pixel with the highest scoring category produced by convolutional network for that location, the system yields neat state-of-the-art pixel-wise accuracy, and better per class accuracy then all previous published results Results on datasets with few categories are good, but the accuracy of the best existing scene parsing system is still low by higher number of categories Seminar aus maschinellem Lernen 32

33 Contents Introduction Multiscale Feature Extraction For Scene Parsing Scene Labeling Strategies Experinments Important insights on the experiments Conclusion Questions / Discussion Seminar aus maschinellem Lernen 33

34 Questions?! Discussion Seminar aus maschinellem Lernen 34

35 Sources Seminar aus maschinellem Lernen 35

IMAGE UNDERSTANDING is a task of primary importance

IMAGE UNDERSTANDING is a task of primary importance 1 Learning Hierarchical Features for Scene Labeling Clément Farabet, Camille Couprie, Laurent Najman, Yann LeCun Abstract Scene labeling consists in labeling each pixel in an image with the category of