Stereo Matching, Optical Flow, Filling the Gaps and more

Size: px

Start display at page:

Download "Stereo Matching, Optical Flow, Filling the Gaps and more"

Kelly Simon
6 years ago
Views:

1 Stereo Matching, Optical Flow, Filling the Gaps and more Prof. Lior Wolf The School of Computer Science, Tel-Aviv University ICRI-CI 2017 Retreat, May 9, 2017

2 Since last year, ICRI-CI supported projects of 8 students! Shay Zweig, L. Wolf. InterpoNet, a Brain Inspired Neural Network for Optical Flow Dense Interpolation. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Aviv Eisenschtat, L. Wolf. Linking Image and Text with 2-Way Nets. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), Amit Shaked, L. Wolf. Improved Stereo Matching with Constant Highway Networks and Reflective Confidence Learning. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR). Tal Schuster, L. Wolf, David Gadot. Optical Flow Requires Multiple Strategies (But Only One Network). IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Dotan Kaufman, Gil Levi, Tal Hassner, L. Wolf. Temporal Tessellation: A Unified Approach for Video Analysis. In submission. Ofir Press, L. Wolf. Using the Output Embedding to Improve Language Models. European Chapter of the Association for Computational Linguistics (EACL). Short paper, 2017.

3 Optical flow The problem: Estimating a dense correspondence field between two images - usually consecutive video frames.

4 Optical flow modern pipeline A two stage method: Sparse matching Dense interpolation

5 PatchBatch - Overall Pipeline

6 PatchBatch Minor Improvements Hinge loss instead of DRLIM Keeping the additional SD component

7 Optical Flow as a Multifaceted Problem Methods keep Failing on large displacements MPI-Sintel top results KITTI 2015 average error: Foreground % Background % Possible causes: Matching algorithm Descriptor quality PatchBatch on KITTI 2012 distance between true matches

8 Distractors by displacement #distractors how many pixels have more similar descriptors in a 25px radius Increase with displacement range Goal: improve results for large displacements without reducing for other ranges.

9 Distractors by displacement Goal: improve results for large displacements without reducing for other ranges. Is it possible? Expert models Training only on sub ranges Improving results for large displacements is possible Implies the need for different features for different descriptors

10 Need for variant extracting strategies Large motions are mostly correlated with more changes in appearance: 1. Background changes 2. View point changes -> occluded parts 3. Distance and angle to light source -> illumination 4. Scale (when moving along the Z-axis)

11 Learning for Multiple Strategies and Varying Difficulty Deal with varying difficulty Curriculum (Bengio et al.): Samples are pre-ordered Curriculum by displacement Curriculum by distance (of false sample) Self-Paced (Kumar et al.): No need to pre-order Sample hardness increases with time (by loss value) Hard Sample Mining (Simo-Serra et al.): Backpropagate only some ratio of harder samples Used for training local descriptors with triplets

Interleaving Learning Goal: Deal with multiple sub-tasks Classification: Painting to Artist Motivated by psychological research (Kornell and Bjork) Blocking (Massing) vs.

12 Interleaving Learning Goal: Deal with multiple sub-tasks Classification: Painting to Artist Motivated by psychological research (Kornell and Bjork) Blocking (Massing) vs. Spacing (Interleaving) <<Unintuitive! and goes against previous work>> Experiments on classification tasks, sports, etc. Learning ML models Usually random Applying gradual methods can effect randomness Learning Concepts and Categories Kornell and Bjork (2008)

13 Interleaving Learning for Optical Flow Controlling the negative sample to balance difficulty

14 Interleaving Learning for Optical Flow

15 Self-Paced Curriculum Interleaving Learning (SPCI) l i - validation loss on epoch i l init - initial loss value (epoch #5) m total epoch amount

16 MNIST L= 0..4, H = 5..9 Random noise on top half of H and bottom of L Images from H were rotated by a an angle of [0,45] with correlation to noise amount A general learning paradigm

17 KITTI2012 KITTI2015 MPI-Sintel SOTA on KITTI, what about SPI-Sintel?

18 Optical flow modern pipeline A two stage method: Sparse matching Dense interpolation Sparse to dense interpolation: EpicFlow Edge preserving interpolation (Revaud et al. 2015)

19 Research goal Construct a CNN based solution for sparse to dense OF interpolation. Motivation: Allows more flexibility and increases performance. faster runtime.

Interpolation in the brain an inspiration Perceptual

2015) Spatial propagation top down and lateral

20 Interpolation in the brain an inspiration Perceptual filling-in: Neuronal filling-in: (Zweig et al. 2015) Spatial propagation top down and lateral connections. (Zurawel et al. 2014, Zweig et al. 2015, Huang et al. 2008, Poort et al. 2012) Edges as a barrier (von der Heydt et al. 2003) Multilayer process (Poort et al. 2012, Meng et al. 2005)

21 Interponet architecture overview A Fully convolutional network with no pooling.

22 Interponet Input The edges input boost the network performance, it uses them as a boundary for propagation.

23 Interponet main branch 10-7x7 convolutions No pooling elu (Clevert et al. 2015) non-linearity

24 Standard EPE Loss: Loss function

25 Lateral dependency loss Main concept include the local context in the training process: Encourages smoothness and edginess

26 Detour networks and multi-layer loss Supervision at each layer

27 Detour networks and multi-layer loss

Benchmark results SOTA on Sintel and KITTI 2012.

28 Benchmark results SOTA on Sintel and KITTI Improving over EpicFlow for all the underlying matching algorithms we checked (4 of the leading algorithms)

29 The network learned to interpolate in a similar manner to the visual system

30 Stereo Matching 3D scene reconstruction Robotics Autonomous cars Augmented reality Major challenges Occlusions Highly reflective regions Sparse texture regions Repetitive patters

31 Previous work [Zbontar and LeCun 15] Employ CNN to compute the matching cost for each possible disparity

32 Previous work [Zbontar and LeCun 15] Employ CNN to compute the matching cost for each possible disparity Apply Cost aggregation and smoothness constrains Use Winner takes all rule to compute the disparity image Refine the obtained image height width

33 Research questions Motivation Research question Our solution - Using color information does not improve the quality of the disparity maps. - Adding more layers does not help. - Stacking residual layers does not converge to a meaningful solution. Design a residual architecture that is more suitable for metric learning vs. multiclass classification. - Poor results on reflective and occluded regions. Providing a solution for reflective and occluded regions. - Occluded pixels and mismatch predictions are still common. - They mostly can be assed from their neighbors. How to measure certainty of classifiers? Multilevel Constant Highway Network. Apply a learned criterion and replace the WTA approach (Global Disparity Network). Reflective learning.

34 Conv 112,112 ReLU Conv 112,112 Add Conv 112,112 ReLU Conv 112,112 Add Add Multilevel constant Highway Network Basic residual block: f 1 λ 0 f 2 Constant highway skip-connection: y 0 y 1 y 2 λ 1 λ 2 Outer λ-residual block:

Outerblock1 Outerblock5 Multilevel constant Highway Network A: Outer λ-residual block B: Description Network f 1 λ 0 f 2 y

35 Concat 2X112 ->224 FC 224-> 384 ReLU FC 384-> 384 ReLU FC 384-> 384 ReLU FC 384-> 384 ReLU FC 384-> 1 Conv 112,112 ReLU Conv 112,112 Add Conv 112,112 ReLU Conv 112,112 Add Add Conv1 3 -> 112 ReLU ReLU Conv > 112 Conv > 112 ReLU Outerblock1 Outerblock5 Multilevel constant Highway Network A: Outer λ-residual block B: Description Network f 1 λ 0 f 2 y 0 y 1 y 2 Input 11X11X3 9X9X 112 Descriptor 1X1X112 λ 1 λ 2 C: Full network in training Hybrid Loss Sigmoid BCE Dot Product Hinge

36 Global Disparity Network Matching Cost Network height Global Disparity Network width Training: height width

37 Global Disparity Network Training: height width Criterion[2]: [2] The criterion is similar to W. Luo, A. Schwing, and R. Urtasun: Efficient deep learning for stereo matching.

38 Global Disparity Network Architecture: Reflective confidence: y GT ref = 1 if argmax i y i y GT < λ 0 otherwise GT loss y ref, y ref = 1 y GT ref ln(1 y ref ) y GT ref ln(y ref )

39 Outlier detection and interpolation Pixel labeling: Where: C L (p) - the confidence score at position p of the prediction d = D L (p) C L (pd) - the confidence score at position p d of the prediction d = D L (pd) Pixel interpolation: Mismatch - the median of the nearest neighbors labeled as correct from 16 different directions. Occlusion - move left until the first correct pixel and use its value.

40 Results Benchmark results Fastest methods

41 Results Residual networks comparison

42 Results Confidence measures comparison

43 Theme II: Vision and Language A camera is viewing a scene and outputs a textural description of the activity over time The engine learns from pairs of the form: image + caption (weak supervision) One child is openning a cabinet while the other kid is talking on the phone The kid is checking the refregirator The girl is operating the microwave

44 Goal

45 Goal

46 Model 2-Way Net Loss: L = "H(x ) y" + Hˆ(y) x + H j (x) Hˆj (y) H(x ) and Hˆ(y ) are reconstruction outputs H j (x ) and Hˆj (y ) are middlenetwork representations

47 Architecture Dense layer with shared weights, followed by Highly Leaky ReLU Batch Normalization layer with variance injection Tied Dropout Layer Locally Dense Layer for high-dimensional data

48 Results Image representation using VGG representation layer and GMM-HGLMM fisher vector pooling for sentence representations from Klein et al, 2014 Recall is measured on top-1 and top-5 ranked matches

49 Examples - COCO

50 Examples - COCO

51 Task I Video Annotation Now at a restaurant a waitress serves some food

52 Task II Video Summary Input - Raw Video Output Video Summary

53 Task III Video Action Detection Detecting baseball pitch. Green: machine detection. Blue: ground truth.

54 Tessellation

55 Local Tessellation

56 Unsupervised Tessellation

57 Supervised Tessellation

58 Examples

59 Results

60 Results Video Captioning

61 Results Video Summary

62 Results - Action Detection

State of the Art Stereo Matching B: Description Network A: Outer λ-residual block λ ReLU Outerblock5 Conv5 112 -> 112 9X9X 112 Conv2

based interpolation Outerblock1 Interleaving Learning for Optical Flow for enabling our work Conv1 3 -> 112 Huge thank you to λ

reconstruction outputs H j (x ) and Hˆj (y ) are middlenetwork representations Sigmoid BCE Dot Product Hinge FC 384-> 1 ReLU FC

63 State of the Art Stereo Matching B: Description Network A: Outer λ-residual block λ ReLU Outerblock5 Conv > 112 9X9X 112 Conv > 112 Add 11X11X3 ReLU Input Add ReLU Conv 112,112 Conv 112,112 Add ReLU λ Conv 112,112 Conv 112,112 ReLU Interponet a network based interpolation Outerblock1 Interleaving Learning for Optical Flow for enabling our work Conv1 3 -> 112 Huge thank you to λ Hybrid Loss Tessellation Tessellation 2-Way Net Loss: L = " H(x ) y" + Hˆ(y) x + H j (x) Hˆj (y ) H(x ) and Hˆ(y ) are reconstruction outputs H j (x ) and Hˆj (y ) are middlenetwork representations Sigmoid BCE Dot Product Hinge FC 384-> 1 ReLU FC 384-> 384 ReLU FC 384-> 384 ReLU FC 384-> 384 ReLU FC 224-> 384 Concat 2X112 ->224 C: Full network in training Matching Text with Images Descriptor 1X1X112

CS231N Section. Video Understanding 6/1/2018

CS231N Section. Video Understanding 6/1/2018 CS231N Section Video Understanding 6/1/2018 Outline Background / Motivation / History Video Datasets Models Pre-deep learning CNN + RNN 3D convolution Two-stream What we ve seen in class so far... Image