RGBd Image Semantic Labelling for Urban Driving Scenes via a DCNN

Size: px

Start display at page:

Download "RGBd Image Semantic Labelling for Urban Driving Scenes via a DCNN"

Claire Farmer
5 years ago
Views:

1 RGBd Image Semantic Labelling for Urban Driving Scenes via a DCNN Jason Bolito, Research School of Computer Science, ANU Supervisors: Yiran Zhong & Hongdong Li

2 2 Outline 1. Motivation and Background 2. Proposed Method 3. Implementation, Experiment and Results 4. Conclusion and Future Work

3 Motivation Semantic Segmentation Understanding road scenes. Useful for autonomous cars and drones. Source: cityscape datasets. 3

4 4 Semantic Segmentation vs. Object Recognition Object Recognition Semantic Segmentation Person Source: cityscapes-datasets.com Road Person Vegetation Motorcycle

5 5 What we want from our method Leverage both 3D and colour information. Attain more accurate and robust semantic segmentation.

6 6 Background RGB Semantic Labelling Earlier days: CRFs (low level vision cues). Recently: Deep Neural Nets.

7 7 Background Fully Convolutional Networks Source: FCNs for semantic segmentation by J. Long et al. Pixels to pixels approach. Builds on VGG16. (encoder) Upsampling using deconvolution to get label map. (decoder)

8 8 Background Deconvolution Networks Source: Learning Deconvolution Network for Semantic Segmentation by H. Noh et al. Expands VGG16. (encoder) Uses unpooling + deconv to get label map. (decoder)

9 9 Background SegNet Source: SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation by V. Badrinarayanan et al. Similar encoder-decoder structure. Removes fully connected layers. Prioritises memory efficiency.

10 10 Background RGBd Semantic Labelling HHA representation (Saurabh et all, 2014). Hard mutex constraints (Deng et al 2015). LSTM-F (Li et al, 2016). Fusenet (Hazirbas et al, 2016).

11 11 Background (cont d) Presented methods use depth as a channel. Depth used as generic information. 3D structure not considered/learned.

12 12 Proposed Method Ideas Use depth to partially reconstruct 3D scene. Use 3D convolution to capture structure. Apply encoder-decoder design to achieve rich segmentation maps.

13 13 Proposed Method S3D Feature maps Encoder Decoder Conv3D + ReLU Deconv3D + ReLU Conv3D (2x stride) + ReLU Softmax

14 14 S3D building blocks Input Layer Input RGB image I is voxelised via disparity map D: I 3D (z, x, y, c) := I(x, y, c), for z = bd(x, y)c 0, otherwise 2.5D reconstruction of environment., Points at infinity have disparity 0.

15 S3D building blocks Encoder Feature extraction via 3D convolution: F out (z, x, y, c out )= X F in (z + k, x + i, y + j, c in )K cout (k, i, j, c in ) k,i,j,c in Each 3x3x3 filter is a learnable template. High response = input matches template. 3D structure = 3D input + 3D templates 15

16 16 S3D building blocks Encoder (cont d) Non-linear activation function: ReLU(x) = max(0, x) Good gradients for backprop. Learnable downsampling = strided 3D convolution.

17 17 S3D building blocks Decoder 3D deconvolution = inverse of 3D convolution. F out (z + k, x + i, y + j, c out )+= X c in F in (z, x, y, c in )K cout (k, i, j, c in ) Already implemented as backwards Conv3D pass. Learnable upsampling = strided 3D deconvolution.

18 18 S3D building blocks Decoder (cont d) Skip layers (top down modulation) Shallow = Low level features Conv3D... Helps with convergence and refines features... Deconv3D... Deep = High level knowledge

19 19 S3D building blocks Inference Use softmax to get probability cube: ˆP(z, x, y, c) := exp(f(z, x, y, c)) P c 0 2Classes exp(f(z, x, y, c0 )) Argmax over classes to get 3D labels: ˆL 3D (z, x, y) :=argmax c2classes ˆP(z, x, y, c). Project using D to get 2D labels: ˆL(x, y) :=ˆL 3D (bd(x, y)c, x, y)

20 20 Implementation Implemented using a deep-learning facade API and TensorFlow.

21 21 Experiment and Results Dataset: Cityscapes (urban scene dataset) Splits: 2795 training / 500 test images over 50 cities. GPU: Nvidia GeForce Titan X Pascal.

22 22 Experiment and Results (cont d) Image size: 128x64x128 Iterations: Around 30k Results: (State of Art has miou = 80.1%) G miou C G test miou test C test Learning feature extraction takes a while. Can we de better?

23 23 Experiment and Results (cont d) Trick: Let pre-trained 2D DCNN do feature extraction. Use S3D on extracted features. Method G miou C G test miou test C test time/it (s) S3D-ResNet S3D-ResNet S3D-ResNet Not SoA but matches DeepLab (71.4%)! Depth accuracy/efficiency trade-off!

24 24 Conclusion and Future Work Presented a DNN solution for semantic segmentation. Solution fully utilises 3D structure. Achieves good results especially when used on pre-extracted features. Good results achieved without any extra goodies! (CRFs, data augmentation, ) There is plenty of room for improvement!

25 25 Conclusion and Future Work (cont d) Need to push S3D to the limit. Can be done with post-processing, balancing, upsampling, What happens when we generalise one of the other architectures to 3D?

26 26 Questions? Thank You!

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Encoder-Decoder Networks for Semantic Segmentation Sachin Mehta Outline > Overview of Semantic Segmentation > Encoder-Decoder Networks > Results What is Semantic Segmentation? Input: RGB Image Output: