Learning-based Localization

Size: px

Start display at page:

Download "Learning-based Localization"

Steven Small
5 years ago
Views:

1 Learning-based Localization Eric Brachmann ECCV 2018 Tutorial on Visual Localization - Feature-based vs. Learned Approaches Torsten Sattler, Eric Brachmann

2 Roadmap Machine Learning Basics [10min] Convolutional Neural Networks Random Forests Camera Pose Regression [15min] Pose Parametrization Pose Loss Functions Results Scene Coordinate Regression [35min] Scene Coordinate Regression Forests Scene Coordinate Regression CNNs End-to-End Learning Results 2

3 Roadmap Machine Learning Basics [10min] Convolutional Neural Networks Random Forests Camera Pose Regression [15min] Pose Parametrization Pose Loss Functions Results Scene Coordinate Regression [35min] Scene Coordinate Regression Forests Scene Coordinate Regression CNNs End-to-End Learning Results 3

4 Machine Learning Monday Wednesday Saturday Sunday Wednesday y = f(i, w) Saturday Sunday Training Set 4

5 Convolutional Neural Networks w 1 5

6 Convolutional Neural Networks w 1 6

7 Convolutional Neural Networks w 1 f 1 I, w 1 7

8 Convolutional Neural Networks w w 2 f(x) x f 2 f 1 I, w 1, w 2 8

9 Convolutional Neural Networks Sunday Saturday Wednesday Monday y = f 3 (f 2 f 1 I, w 1, w 2, w 3 ) Sunday l(y) δl δw 1 = δl δf 3 δf 3 δf 2 δf 2 δf 1 δf 1 δw 1 9

10 Convolutional Neural Networks w [Long et al., Fully convolutional networks for semantic segmentation. CVPR15] 10

11 Convolutional Neural Networks w [12 Scenes Dataset: 11

12 Convolutional Neural Networks [Kokkinos, UberNet: Training a Universal Convolutional Neural Network for Low-, Mid-, and High-Level Vision using Diverse Datasets and Limited Memory, CVPR17] 12

13 Convolutional Neural Networks Advantages: Expensive but Parallelizable Powerful due to End-To-End Training Image I Feature Extraction (hand-crafted or learned) Classification (hand-crafted or learned) Output Image I Learned Output Disadvantages: Interpretability Training Data 13

14 Roadmap Machine Learning Basics [10min] Convolutional Neural Networks Random Forests Camera Pose Regression [15min] Pose Parametrization Pose Loss Functions Results Scene Coordinate Regression [35min] Scene Coordinate Regression Forests Scene Coordinate Regression CNNs End-to-End Learning Results 14

15 Random Forests g awake (w 1 ) = 0.5 g smile w 2 = 0.7 g hat (w 3 ) = 0.0 g awake > 0 g smile > 0 Sunday Saturday Monday y = f(i, w) 15

16 Random Forests g awake > 0 Monday Saturday Sunday Saturday Monday Sunday 16

17 Random Forests g awake > 0 Monday Saturday Sunday g smile > 0 Sunday Monday Wednesday Saturday Sunday Saturday Monday Wednesday Saturday Sunday Monday Monday Wednesday Saturday Sunday 17

18 Random Forests [Shotton et al., Real-Time Human Pose Recognition in Parts from Single Depth Images, CVPR11] 18

embeddings of IMAGENET) Rule 34 (of computer science after 2012): There is a paper about

19 Random Forests Advantages Fast Good Performance Need less training data Disadvantages No end-to-end training* Hand-crafted features* Interpretability (Andrej Karpathy, t-sne embeddings of IMAGENET) Rule 34 (of computer science after 2012): There is a paper about it, no exceptions. * Not in [Kontschieder et al., Deep Neural Decision Forests, ICCV15] 19

20 Roadmap Machine Learning Basics [10min] Convolutional Neural Networks Random Forests Camera Pose Regression [15min] Pose Parametrization Pose Loss Functions Results Scene Coordinate Regression [35min] Scene Coordinate Regression Forests Scene Coordinate Regression CNNs End-to-End Learning Results 20

21 Camera Pose Regression r 1 r 2 r 3 t 1 t 2 t 3 = መh Annotated Training Data Input: RGB Image Feed-Forward CNN Output: Pose [7 Scenes Dataset: 21

22 Roadmap Machine Learning Basics [10min] Convolutional Neural Networks Random Forests Camera Pose Regression [15min] Pose Parametrization Pose Loss Functions Results Scene Coordinate Regression [35min] Scene Coordinate Regression Forests Scene Coordinate Regression CNNs End-to-End Learning Results 22

23 Pose Parametrization More infos about the wonderful world of SO(3) in e.g. [Hartley et al., Rotation Averaging, IJCV13] መh SE(3) with SE 3 = { R t 0 1 R SO 3, t R3 } R SO 3 = {R R 3 3 RR T = I, det R = 1} መh Rotation Matrix R = r 1 r 2 r 3 r 4 r 5 r 6 r 7 r 8 r 9 Unit Quaternion e.g. [1] q = (c, v 1, v 2, v 3 ) Axis-Angle e.g. [2] log R = θ u = (u 1, u 2, u 3 ) Log Unit Quaternion e.g. [3] log q = 2 log R [4] Enforce/Map: RR T = I, det R = 1 Enforce/Map: q = 1 Enforce/Map: Enforce/Map: [1] Kendall et al., PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization, ICCV15 [2] Brachmann et al., DSAC - Differentiable RANSAC for Camera Localization, CVPR17 [3] Brahmbhatt et al., Geometry-Aware Learning of Maps for Camera Localization, CVPR18 [4] Sola, Quaternion kinematics for the error-state Kalman filter,

24 Roadmap Machine Learning Basics [10min] Convolutional Neural Networks Random Forests Camera Pose Regression [15min] Pose Parametrization Pose Loss Functions Results Scene Coordinate Regression [35min] Scene Coordinate Regression Forests Scene Coordinate Regression CNNs End-to-End Learning Results 24

25 Pose Loss Functions h = (R, t) l t, t = t t θ How to measure rotation error? Angular Distance [1]: θ R, R = log(r R T ) R R q θ Quaternion Distance[2]: q q Angle-Axis Distance, Log Quaternion Distance[3]: log R log R q q q log R log R log TR log TR [4] [1] Brachmann et al., DSAC - Differentiable RANSAC for Camera Localization, CVPR17 [2] Kendall et al., PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization, ICCV15 [3] Brahmbhatt et al., Geometry-Aware Learning of Maps for Camera Localization, CVPR18 [4] Hartley et al., Rotation Averaging, IJCV13 25

26 Pose Loss Functions How to combine rotation error and translation error? Implicit/Metric [1]: Hand-Tuned [2]: l β h, h = l t, t + βl R, R Self-Tuned [3]: l h, h = l t, t + l R, R or max(l t, t, l R, R ) l σ 2 h, h = l t, t exp s t + s t + l R, R exp s R + s R [1] Brachmann et al., DSAC - Differentiable RANSAC for Camera Localization, CVPR17 [2] Kendall et al., PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization, ICCV15 [3] Kendall and Cipolla, Geometric Loss Functions for Camera Pose Regression with Deep Learning, CVPR 2017 s = log σ 2 26

27 Pose Loss Functions How to combine rotation error and translation error? Measure the reprojection error [1]: l π h, h = v M π h, v π(h, v) [1] Kendall and Cipolla, Geometric Loss Functions for Camera Pose Regression with Deep Learning, CVPR

28 Roadmap Machine Learning Basics [10min] Convolutional Neural Networks Random Forests Camera Pose Regression [15min] Pose Parametrization Pose Loss Functions Results Scene Coordinate Regression [35min] Scene Coordinate Regression Forests Scene Coordinate Regression CNNs End-to-End Learning Results 28

29 PoseNet Discussion [7 Scenes Dataset: PoseNet [1]: ~45cm, 10 (GoogLeNet, Quaternions, l β ) Spatial LSTM [2]: ~31cm, 10 (GoogLeNet+LSTM, Quaternions, l β ) PoseNet [3]: ~23cm, 8 (GoogLeNet, Quaternions, l σ 2) PoseNet [3]: ~23cm, 8 (GoogLeNet, Quaternions, l π ) PoseNet [4]: ~23cm, 8 (ResNet, Quaternions, l π ) PoseNet [4]: ~22cm, 8 (ResNet, log Quat., l σ 2) MapNet+ [4]: ~19cm, 7 (ResNet, log Quat., l σ 2+l Rel ) [1] PoseNet: A Convolutional Network for Real-Time 6-DoF Camera Localization, Kendall et al., ICCV 2015 [2] Image-Based Localization with Spatial LSTMs, Walch et al., ICCV 2017 [3] Geometric Loss Functions for Camera Pose Regression with Deep Learning, Kendall and Cipolla, CVPR 2017 [4] Geometry-Aware Learning of Maps for Camera Localization, Brahmbhatt et al., CVPR18 [5] Efficient & Effective Prioritized Matching for Large-Scale Image-Based Localization, Sattler et al., PAMI 2017 Sparse Features: ~5cm, 2 Advantages PoseNet: Simple End-To-End Learning Fast 29

estimate of last frame Tracking Accuracy: ~5cm, 4 [1] Deep

30 Side Comment: Visual Odometry Pose Regression seems to work very well for estimating relative poses [1] Additional Input: Pose estimate of last frame Tracking Accuracy: ~5cm, 4 [1] Deep Auxiliary Learning For Visual Localization And Odometry, Valada et al., ICRA

31 Feature-Based Pipeline [7 Scenes Dataset: Image I Feature Extraction Feature Matching Estimate Pose መh = PnP({p i, y i }) 6D Pose መh Image Coordinates p i Scene Coordinates y i 31

32 Feature-Based Pipeline [7 Scenes Dataset: Image I Feature Extraction Feature Matching Estimate Pose መh = PnP({p i, y i }) 6D Pose መh Image Coordinates p i Scene Coordinates y i 32

33 Feature-Based Pipeline [7 Scenes Dataset: Image I Feature Extraction Feature Matching RANSAC Estimate Poses (PnP) Score Poses (Inliers) 6D Pose መh Image Coordinates p i Scene Coordinates y i 33

34 Task Knowledge: Local Invariance Training Image Test Images መh?????? [7 Scenes Dataset: 34

35 Task Knowledge: The Camera Model y i p i Estimate Pose መh = PnP({p i, y i }) C Camera Coordinate System Scene Coordinate System Image Coordinate p i = Chy i Camera Matrix Pose Scene Coordinate 35

36 Roadmap Machine Learning Basics [10min] Convolutional Neural Networks Random Forests Camera Pose Regression [15min] Pose Parametrization Pose Loss Functions Results Scene Coordinate Regression [35min] Scene Coordinate Regression Forests Scene Coordinate Regression CNNs End-to-End Learning Results 36

37 Feature-Based Pipeline Image I Feature Extraction Feature Matching RANSAC Estimate Poses (PnP) Score Poses (Inliers) 6D Pose መh Image Coordinates p i Scene Coordinates y i [7 Scenes Dataset: 37

38 Scene Coordinate Regression [Sho13] Image I (RGB-D) Scene Coordinate Regression RANSAC Estimate Poses (Kabsch) Score Poses (Inliers) 6D Pose መh p i R 2 y i R 3 [Sho13] Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images, Shotton et al., CVPR 13 38

39 Scene Coordinate Regression Forest Features: Depth Adapted Pixel Differences f DA RGB p = I RGB p + δ 1 I D (p), c 1 I RGB (p + δ 2 I D (p), c 2) f p < τ f DA D p = I D p + δ 1 I D (p) I D (p + δ 2 I D (p) ) Split Score: Reduction of Spatial Variance S L S R V S L, S R = S L y തy L + S R y S L y S R y തy R 39

40 Scene Coordinate Regression Forest y pred f S Mean-Shift f S X X 40

41 Scene Coordinate Regression Results 41

42 Scene Coordinate Regression Results RGB %Pose Err. < 5cm5 Sparse Features 38.6% RGB-D [Sho13] 72.6% [Sho13] Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images, Shotton et al., CVPR 13 42

43 Scene Coordinate Regression Results RGB %Pose Err. < 5cm5 Sparse Features 38.6% RGB-D [Sho13] 72.6% [Val15] 91.3% yp(y) pred f S X [Sho13] Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images, Shotton et al., CVPR 13 [Val15] Exploiting Uncertainty in Regression Forests for Accurate Camera Relocalization, Valentin et al., CVPR 15 43

44 Scene Coordinate Regression Results RGB %Pose Err. < 5cm5 Sparse Features 38.6% RGB-D [Sho13] 72.6% [Val15] 91.3% [Cal17] 90.7% f S f S f S X [Sho13] Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images, Shotton et al., CVPR 13 [Val15] Exploiting Uncertainty in Regression Forests for Accurate Camera Relocalization, Valentin et al., CVPR 15 [Cav17] On-the-Fly Adaptation of Regression Forests for Online Camera Localization, Cavallari et al., CVPR 17 X X 44

45 Scene Coordinate Regression Results RGB %Pose Err. < 5cm5 Sparse Features 38.6% [Bra16] 55.2% RGB-D [Sho13] 72.6% [Val15] 91.3% [Cal17] 90.7% [Sho13] Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images, Shotton et al., CVPR 13 [Val15] Exploiting Uncertainty in Regression Forests for Accurate Camera Relocalization, Valentin et al., CVPR 15 [Cav17] On-the-Fly Adaptation of Regression Forests for Online Camera Localization, Cavallari et al., CVPR 17 [Bra16] Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image, Brachmann et al., CVPR 16 45

46 Scene Coordinate Regression Forests Advantages: Fast Good Results Training On-The-Fly V S L, S R = S L y തy L + S R y S L y S R y തy R Disadvantages: Needs 3D Model for Training No End-To-End Training 46

47 Roadmap Machine Learning Basics [10min] Convolutional Neural Networks Random Forests Camera Pose Regression [15min] Pose Parametrization Pose Loss Functions Results Scene Coordinate Regression [35min] Scene Coordinate Regression Forests Scene Coordinate Regression CNNs End-to-End Learning Results 47

48 Scene Coordinate Regression with a CNN Image I Scene Coordinate Regression (CNN) RANSAC Estimate Poses (PnP) Score Poses (Inliers) 6D Pose መh p i R 2 y i R 3 48

49 Random Forest vs. CNN Forest Prediction: CNN Prediction: Ground Truth: Pose Estimation Succeeds (< 5cm, 5 ) Pose Estimation Fails (> 5cm, 5 ) 49

50 Random Forest vs. CNN Frequency CNN Prediction: 0,14 Random Forest CNN 0,12 0,1 Inlier Peaks 0,08 0,06 Inlier Threshold Forest Outliers 0,04 Obj. Corrd. Error < 10cm %Pose Err. < 5cm5 Random Forest ~15% 55.2% CNN ~25% 17.2% 0, Object Coordinate Error (mm) Optimized: l L2 y, y = y y 2 50

51 Scene Coordinate Loss Functions l L2 l L1 %Obj. Coor. Err. < 10cm %Pose Err. < 5cm5 l Huber l Tukey Random Forest ~15% 55.2% CNN (L 2 ) ~25% 17.2% CNN (L 1 ) ~45% 55.9% CNN (Huber) ~45% 54.4% CNN (Tukey) ~45% 52.1% 51

52 Scene Coordinate Loss Functions Frequency Random Forest CNN (L 2 ) CNN (L 1 ) 0,14 0,12 %Obj. Coor. Err. < 10cm %Pose Err. < 5cm5 0,1 Random Forest ~15% 55.2% 0,08 0,06 0,04 0,02 Inlier Threshold CNN (L 2 ) ~25% 17.2% CNN (L 1 ) ~45% 55.9% CNN (Huber) ~45% 54.4% CNN (Tukey) ~45% 52.1% Object Coordinate Error (mm) 52

53 Scene Coordinate Regression Nets Advantages: Fast Good Results Disadvantages: Needs 3D Model for Training No End-To-End Training Advantages: Fast Better Results Disadvantages: Needs 3D Model for Training End-To-End Training? 53

54 Even Better Loss Function? Image I Scene Coordinate Regression (CNN) RANSAC Estimate Poses (PnP) Score Poses (Inliers) 6D Pose መh Optimized: l L1 y, y = y y l መh, h DSAC Differentiable RANSAC [Bra17] [Bra17] DSAC Differentiable RANSAC for Camera Localization Brachmann et al., CVPR

55 Roadmap Machine Learning Basics [10min] Convolutional Neural Networks Random Forests Camera Pose Regression [15min] Pose Parametrization Pose Loss Functions Results Scene Coordinate Regression [35min] Scene Coordinate Regression Forests Scene Coordinate Regression CNNs End-to-End Learning Results 55

56 What is RANSAC? h = f({y i }) h y i 56

57 What is RANSAC? 1) Sample multiple hypotheses h j h 1 = f(y 1, y 4 ) h 2 = f y 2, y 3 h 3 = f y 1, y 5 h 1 2) Score each hypothesis: s h j s h 1 = 2 s h 2 = 2 s h 3 = 4 h 3 h 2 3) Take best one (and refine): መh = argmax h j s(h j ) መh = h 3 57

58 Our Pipeline Image I Scene Coordinate Regression (CNN) RANSAC Estimate Poses (PnP) Score Poses 6D Pose መh w h 1 h 2 h 3 h 4 Reprojection Errors of h 2 s(h 1 ) s(h 2 ) s(h 3 ) s(h 4 ) Input RGB Scene Coordinate Regression Hypothesis Sampling Scoring Hypothesis Selection Result መh 58

59 Our Pipeline Image I Scene Coordinate Regression (CNN) RANSAC Estimate Poses (PnP) Score Poses 6D Pose መh w h 1 h 2 h 3 h 4 Reprojection Errors of h 2 s(h 1 ) s(h 2 ) s(h 3 ) s(h 4 ) Input RGB Scene Coordinate Regression Hypothesis Sampling Scoring Hypothesis Selection Result መh w l( መh, h ) 59

60 End-to-End Learning: How? h 1 s(h 1 ) w h 2 h 3 h 4 Reprojection Errors of h 2 s(h 2 ) s(h 3 ) s(h 4 ) መh argmax Selection h AM = argmax h j s(h j ) hard decision non-differentiable Soft argmax Selection exp(s(h j ))h j h SoftAM = σ k exp(s(h k )) j soft decision differentiable 60

61 End-to-End Learning: How? h 1 s(h 1 ) w h 2 h 3 h 4 Reprojection Errors of h 2 s(h 2 ) s(h 3 ) s(h 4 ) መh argmax Selection h AM = argmax h j s(h j ) hard decision non-differentiable Soft argmax Selection exp(s(h j ))h j h SoftAM = σ k exp(s(h k )) j soft decision differentiable Probabilistic Selection h DSAC = h j, where j~ hard decision differentiable exp(s(h j)) σ k exp(s(h k )) 61

DSAC Differentiable RANSAC Probabilistic Selection h DSAC = h j, where j~ exp(s(h j)) σ k exp(s(h k )) = P j w w Differentiation via the expected task loss: w E j~p(j

62 DSAC Differentiable RANSAC Probabilistic Selection h DSAC = h j, where j~ exp(s(h j)) σ k exp(s(h k )) = P j w w Differentiation via the expected task loss: w E j~p(j w) l(h j, h ) = E j~p(j w) l h j, h log P j w + w w l h j, h Scene Coordinate Regression Scoring Derivative of the selection probability Derivative of the task loss 62

63 Roadmap Machine Learning Basics [10min] Convolutional Neural Networks Random Forests Camera Pose Regression [15min] Pose Parametrization Pose Loss Functions Results Scene Coordinate Regression [35min] Scene Coordinate Regression Forests Scene Coordinate Regression CNNs End-to-End Learning Results 63

64 Accuracy on 7-Scenes [Sho13] Error < 5cm, 5 Sparse Features [Sho13] Brachmann et al. [Bra16] Ours [Bra17] RANSAC Soft argmax DSAC 38.6% 55.2% 61.0% (not end-to-end) 57.8% (end-to-end) 66.2% (end-to-end) [Sho13] Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images, Shotton et al., CVPR 13 [Bra16] Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image, Brachmann et al., CVPR 16 [Bra17] Learning to Predict Dense Correspondences For 6D Pose Estimation, Brachmann, Thesis,

Side Comment: Learning Scores [Bra17] learned s h - hard to

Soft Inlier Count s(h 1 ) s(h 2 ) s(h 3 ) s(h 4 ) s(h 1 ) s(h 2 )

Poses 3D Model Overlay 3D Model Overlay Estimation with Learned

[Sho13] Scene Coordinate Regression Forests for Camera

, CVPR 13 [Bra17] DSAC Differentiable RANSAC for Camera

65 Side Comment: Learning Scores [Bra17] learned s h - hard to regularize, overfits [Sho13] Inlier Count: s h = σ i 1 τ r i h, w - not differentiable [Bra18] Soft In. Count: s h = σ i sig(τ βr i h, w ) differentiable Score Regression Soft Inlier Count s(h 1 ) s(h 2 ) s(h 3 ) s(h 4 ) s(h 1 ) s(h 2 ) s(h 3 ) s(h 4 ) Training Set (2 Images Total) Estimated Camera Poses 3D Model Overlay 3D Model Overlay Estimation with Learned Score [Bra17] Estimation with Soft Inlier Count [Bra18] Test Image [Sho13] Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images, Shotton et al., CVPR 13 [Bra17] DSAC Differentiable RANSAC for Camera Localization, Brachmann et al., CVPR 17 [Bra18] Learning Less is More - 6D Camera Localization via 3D Surface Regression, Brachmann and Rother, CVPR 18 65

Accuracy on 7-Scenes [Sho13] Error < 5cm, 5 Sparse Features [Sho13] Brachmann et al. [Bra16] Ours [Bra17] RANSAC Soft argmax DSAC DSAC++ [Bra18] 38.6% 55.2% 61.0% (not end-to-end) 57.

66 Accuracy on 7-Scenes [Sho13] Error < 5cm, 5 Sparse Features [Sho13] Brachmann et al. [Bra16] Ours [Bra17] RANSAC Soft argmax DSAC DSAC++ [Bra18] 38.6% 55.2% 61.0% (not end-to-end) 57.8% (end-to-end) 66.2% (end-to-end) 76.1% (end-to-end) [Bra17] [Bra18] [Sho13] Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images, Shotton et al., CVPR 13 [Bra16] Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image, Brachmann et al., CVPR 16 [Bra17] Learning to Predict Dense Correspondences For 6D Pose Estimation, Brachmann, Thesis, 2017 [Bra18] Learning Less is More - 6D Camera Localization via 3D Surface Regression, Brachmann and Rother, CVPR 18 66

67 Scene Coordinate Regression Nets Advantages: Fast Good Results Disadvantages: Needs 3D Model for Training No End-To-End Training Advantages: Fast End-To-End Training Better Best Results Disadvantages: Needs 3D Model for Training 67

Training without a 3D Model End-to-end training needs no

Minimize 3D distance to ground truth scene coordinates

reprojection error min i π(y i w, h ) p i [Bra17] DSAC

68 Training without a 3D Model End-to-end training needs no 3D model but good initialization [Bra17] Initialization: Minimize 3D distance to ground truth scene coordinates min i y i w y i [Bra18] Initialization: Minimize 2D reprojection error min i π(y i w, h ) p i [Bra17] DSAC Differentiable RANSAC for Camera Localization, Brachmann et al., CVPR 17 [Bra18] Learning Less is More - 6D Camera Localization via 3D Surface Regression, Brachmann and Rother, CVPR 18 68

Accuracy on 7-Scenes [Sho13] Error < 5cm, 5 Sparse Features [Sho13] Brachmann et al. [Bra16] Ours [Bra17] RANSAC Soft argmax DSAC DSAC++ [Bra18] (w/o M) [Bra18] 38.6% 55.2% 61.0% (not end-to-end) 57.

69 Accuracy on 7-Scenes [Sho13] Error < 5cm, 5 Sparse Features [Sho13] Brachmann et al. [Bra16] Ours [Bra17] RANSAC Soft argmax DSAC DSAC++ [Bra18] (w/o M) [Bra18] 38.6% 55.2% 61.0% (not end-to-end) 57.8% (end-to-end) 66.2% (end-to-end) 76.1% (end-to-end) 60.4% (end-to-end) with 3D model w/o 3D model [Sho13] Scene Coordinate Regression Forests for Camera Relocalization in RGB-D Images, Shotton et al., CVPR 13 [Bra16] Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image, Brachmann et al., CVPR 16 [Bra17] Learning to Predict Dense Correspondences For 6D Pose Estimation, Brachmann, Thesis, 2017 [Bra18] Learning Less is More - 6D Camera Localization via 3D Surface Regression, Brachmann and Rother, CVPR 18 69

70 Scene Coordinate Regression Nets Advantages: Fast Good Results Disadvantages: Needs 3D Model for Training No End-To-End Training Advantages: Fast End-To-End Training Better Best Results Disadvantages: Needs 3D Model for Training 70

Conclusion Learning pose estimation works! High accuracy possible. Learning pose estimation is fun! Interesting mixture of learning and geometry.

http://mi.eng.cam.ac.uk/projects/relocalisation/ MapNet: https://research.nvidia.com/publication/2018-06_geometry-aware-learning-of SCoRF [Bra16]: https://hci.iwr.

71 Conclusion Learning pose estimation works! High accuracy possible. Learning pose estimation is fun! Interesting mixture of learning and geometry. Pose Regression fast coarse estimates Scene Coordinate Regression with RFs: fast and accurate with CNNs: best accuracy, no 3D models Code of many methods online: PoseNet: MapNet: SCoRF [Bra16]: VLocNet: DSAC: DSAC++: 71

Uncertainty-Driven 6D Pose Estimation of Objects and Scenes from a Single RGB Image - Supplementary Material -

Uncertainty-Driven 6D Pose Estimation of s and Scenes from a Single RGB Image - Supplementary Material - Eric Brachmann*, Frank Michel, Alexander Krull, Michael Ying Yang, Stefan Gumhold, Carsten Rother