Depth Cameras. Didier Stricker Oliver Wasenmüller Lecture 3D Computer Vision

Size: px

Start display at page:

Download "Depth Cameras. Didier Stricker Oliver Wasenmüller Lecture 3D Computer Vision"

Baldric Summers
6 years ago
Views:

1 1 Depth Cameras Lecture 3D Computer Vision Oliver Wasenmüller Didier Stricker

2 Content Motivation Depth Measurement Techniques Depth Image Enhancement Application Kinect Fusion Body Reconstruction Outlook for next semester(s) 2

3 Motivation 3

4 4 What is a depth camera? A depth camera captured depth images. A depth image indicates in each pixel the distance from the camera to the seen object. (x,y,z) Color Image Depth Image (color encoded) (x,y) z indicates the depth In the following slides: How did we capture depth in the previous lectures. Camera Center

5 Depth from Stereo Images image 1 image 2 Dense disparity map Parts of this slide are adapted from Derek Hoiem (University of Illinois), Steve Seitz (University of Washington) and Lana Lazebnik (University of Illinois) 5

6 Depth from Stereo Images Goal: recover depth by finding image coordinate x that corresponds to x X X x x x z x' f f C Baseline B C Parts of this slide are adapted from Derek Hoiem (University of Illinois), Steve Seitz (University of Washington) and Lana Lazebnik (University of Illinois) 6

7 Stereo and the Epipolar constraint X X X x x x x Potential matches for x have to lie on the corresponding line l. Potential matches for x have to lie on the corresponding line l. Parts of this slide are adapted from Derek Hoiem (University of Illinois), Steve Seitz (University of Washington) and Lana Lazebnik (University of Illinois) 7

8 Simplest Case: Parallel images Image planes of cameras are parallel to each other and to the baseline Camera centers are at same height Focal lengths are the same Then, epipolar lines fall along the horizontal scan lines of the images Parts of this slide are adapted from Derek Hoiem (University of Illinois), Steve Seitz (University of Washington) and Lana Lazebnik (University of Illinois) 8

Basic stereo matching algorithm For each pixel in the first image Find corresponding epipolar line in the right image Examine all pixels on the epipolar line and pick the best match Triangulate

9 Basic stereo matching algorithm For each pixel in the first image Find corresponding epipolar line in the right image Examine all pixels on the epipolar line and pick the best match Triangulate the matches to get depth information Parts of this slide are adapted from Derek Hoiem (University of Illinois), Steve Seitz (University of Washington) and Lana Lazebnik (University of Illinois) 9

10 Depth from disparity X x O x O f z z disparity x x B z f x x f f Baseline B O O Disparity is inversely proportional to depth! Parts of this slide are adapted from Derek Hoiem (University of Illinois), Steve Seitz (University of Washington) and Lana Lazebnik (University of Illinois) 10

11 Depth Measurement Techniques 11

12 Depth Measurement Techniques Parts of this slide are adapted from Victor Castaneda and Nassir Navab (both University of Munich) 12

13 13 Depth Measurement Techniques Laser Scanner Structured Light Projection Time of Flight (ToF)

Structured Light Projection Souce: https://www.youtube.com/watch?

14 Structured Light Projection Souce: Parts of this slide are adapted from Derek Hoiem (University of Illinois) 14

15 Structured Light Projection (see also lectures about structured light) Surface Projector Sensor Parts of this slide are adapted from Derek Hoiem (University of Illinois) 15

16 Structured Light Projection Projector Camera Parts of this slide are adapted from Derek Hoiem (University of Illinois) 16

17 Source: Example: Book vs. No Book Lecture 3D Computer Vision

18 Source: Example: Book vs. No Book Lecture 3D Computer Vision

19 Region-growing Random Dot Matching 1. Detect dots ( speckles ) and label them unknown 2. Randomly select a region anchor, a dot with unknown depth a. Windowed search via normalized cross correlation along scanline Check that best match score is greater than threshold; if not, mark as invalid and go to 2 b. Region growing 1. Neighboring pixels are added to a queue 2. For each pixel in queue, initialize by anchor s shift; then search small local neighborhood; if matched, add neighbors to queue 3. Stop when no pixels are left in the queue 3. Stop when all dots have known depth or are marked invalid Parts of this slide are adapted from Derek Hoiem (University of Illinois) 19

20 Projected IR vs. Natural Light Stereo What are the advantages of IR? Works in low light conditions Does not rely on having textured objects Not confused by repeated scene textures Can tailor algorithm to produced pattern What are advantages of natural light? Works outside, anywhere with sufficient light Uses less energy Resolution limited only by sensors, not projector Difficulties with both Very dark surfaces may not reflect enough light Specular reflection in mirrors or metal causes trouble Parts of this slide are adapted from Derek Hoiem (University of Illinois) 20

21 Example: The Kinect Sensor (v1) Microsoft Kinect (v1) was released in 2011 as a new kind of controller for the Xbox 360. Lecture 3D Computer Vision Parts of this slide are adapted from Rob Miles (University of Hull)

22 Example: The Kinect Sensor The Kinect is able to capture depth and color images. Therefore it contains two cameras and an infrared projector. It has also four microphones. Lecture 3D Computer Vision Parts of this slide are adapted from Rob Miles (University of Hull)

23 Example: The Kinect Sensor The Kinect sensor contains a high quality video camera which can provide up to 1280x1024 resolution at 30 frames a second. Lecture 3D Computer Vision Parts of this slide are adapted from Rob Miles (University of Hull)

24 Example: The Kinect Sensor IR Projector IR Camera The Kinect depth sensor uses an IR projector and an IR camera to measure the depth of objects in the scene in front of the sensor. Lecture 3D Computer Vision Parts of this slide are adapted from Rob Miles (University of Hull)

25 Time of Flight (ToF) Time-of-Flight (ToF)

25 25 Time of Flight (ToF) Time-of-Flight (ToF) Imaging refers to the process of measuring the depth of a scene by quantifying the changes that an emitted light signal encounters when it bounces back from objects in a scene. Two common principals: Pulsed Modulation Continuous Wave Modulation

back, after reflection Speed of light is constant and known, c = 3 10 8 m/s Parts of

26 Time of Flight (ToF) Pulsed Modulation Measure distance to a 3D object by measuring the absolute time a light pulse needs to travel from a source into the 3D scene and back, after reflection Speed of light is constant and known, c = m/s Parts of this slide are adapted from Victor Castaneda and Nassir Navab (both University of Munich) 26

27 Time of Flight (ToF) Pulsed Modulation Advantages: Direct measurement of time-of-flight High-energy light pulses limit influence of background illumination Illumination and observation directions are collinear Disadvantages: High-accuracy time measurement required Measurement of light pulse return is inexact, due to light scattering Difficulty to generate short light pulses with fast rise and fall times Usable light sources (e.g. lasers) suffer low repetition rates for pulses Parts of this slide are adapted from Victor Castaneda and Nassir Navab (both University of Munich) 27

Time of Flight (ToF) Continuous Wave Modulation Microsoft Kinect v2 works with this principal Continuous light waves instead of short light pulses Modulation in terms of frequency of sinusoidal waves

28 Time of Flight (ToF) Continuous Wave Modulation Microsoft Kinect v2 works with this principal Continuous light waves instead of short light pulses Modulation in terms of frequency of sinusoidal waves Detected wave after reflection has shifted phase Phase shift proportional to distance from reflecting surface Parts of this slide are adapted from Victor Castaneda and Nassir Navab (both University of Munich) 28

29 Time of Flight (ToF) Continuous Wave Modulation Microsoft Kinect v2 works with this principal Retrieve phase shift by demodulation of received signal Demodulation by cross-correlation of received signal with emitted signal Emitted sinusoidal signal: Received signal after reflection from 3D surface: Cross-correlation of both signals: Parts of this slide are adapted from Victor Castaneda and Nassir Navab (both University of Munich) 29

instants with different phase offset : Directly obtain sought parameters: Parts of

30 Time of Flight (ToF) Continuous Wave Modulation Microsoft Kinect v2 works with this principal Cross-correlation function simplifies to Sample at four sequential instants with different phase offset : Directly obtain sought parameters: Parts of this slide are adapted from Victor Castaneda and Nassir Navab (both University of Munich) 30

31 Time of Flight (ToF) Continuous Wave Modulation Microsoft Kinect v2 works with this principal Advantages: Variety of light sources available as no short/strong pulses required Applicable to different modulation techniques (other than frequency) Simultaneous range and amplitude images Disadvantages: In practice, integration over time required to reduce noise Frame rates limited by integration time Motion blur caused by long integration time Parts of this slide are adapted from Victor Castaneda and Nassir Navab (both University of Munich) 31

32 32 Depth Quality e.g. Kinect v1 Souce: Main problems: Resolution Noise

33 33 Kinect v1 quality Souce:

34 Depth Image Enhancement 34

35 35 Depth image enhancement - Overview Patch Based Single Depth Image Static Multiple Depth Images Dynamic Joint Filtering Additional Color Information Cost Volume Markov Random Field

36 36 Single Depth-Image: Patch-Based Learning: Store normalized (artificial) high-resolution patches Procedure (Aodha et al.): Divide low-resolution image into patches For each patch: Find nearest neighbor candidates Solve resulting labelling-problem via MRF De-normalize Aodha et al., [6] Find best possible match for each patch Minimize difference on the border of neighboring patches Oisin Mac Aodha, Neill D. F. Campbell, Arun Nair, and Gabriel J. Brostow. Patch based synthesis for single depth image super-resolution. In Computer Vision ECCV 2012, volume 7574 Lecture Notes in Computer Science, pages Springer Berlin Heidelberg, 2012.

37 Multiple Depth-Images: Static Methods Only small displacement w.r.t. starting position Alignment possible with simple methods Example for laser-measurement (Kil et al.): Iterate after initial registration: Realignment via ICP Local weighted average Yong Joo Kil, B. Mederos, and N. Amenta. Laser scanner super-resolution. In Proceedings of the 3rd Eurographics / IEEE VGTC Conference on Point-Based Graphics, SPBG'06, pages Eurographics Association,

38 Multiple Depth-Images: LidarBoost Schuon et al.: Alignment via optical flow Again formulation as optimization problem Energy term for data corresponds to MRF Smoothness term: multi-scale gradient approximation S. Schuon, C. Theobalt, J. Davis, and S. Thrun. Lidarboost: Depth superresolution for tof 3d shape scanning. In Computer Vision and Pattern Recognition, CVPR IEEE Conference on, pages ,

Multiple Depth-Images: Dynamic Methods Yan Cui et al.: Initial alignment LidarBoost for chunks C Probabilistic scan alignment of resulting point clouds Yan Cui, S. Schuon, D.

39 Multiple Depth-Images: Dynamic Methods Yan Cui et al.: Initial alignment LidarBoost for chunks C Probabilistic scan alignment of resulting point clouds Yan Cui, S. Schuon, D. Chan, S. Thrun, and C. Theobalt. 3d shape scanning with a time-of-fight camera. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages ,

40 Multiple Depth-Images: Dynamic Methods Cui et al.: Probabilistic scan alignment of resulting point clouds: Model rotation, translation, systematic error of ToF-Sensors (shift along projection ray, radial symmetric) Choose reference cloud, for all other clouds construct Gaussian mixture model Maximum-Likelihood in EM-like procedure Yan Cui, S. Schuon, D. Chan, S. Thrun, and C. Theobalt. 3d shape scanning with a time-of-fight camera. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages ,

Additional RGB-Image: Overview Obvious choice for

depth-data Basic assumption: Image consistency RGB

41 Additional RGB-Image: Overview Obvious choice for RGBD-Sensors Good complement for noisy, low-resolution depth-data Basic assumption: Image consistency RGB RGB(Sobel) Depth Images: Scharstein et al. D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IJCV, 7-42,

42 Joint Filtering: Overview h i = W ij I, L l i j N i J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele. Joint bilateral upsampling. In ACM SIGGRAPH 2007 Papers, SIGGRAPH '07. ACM, D. Scharstein and R. Szeliski. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. IJCV, 7-42,

43 Joint Bilateral Filter and Extensions Bilateral filter: Like Gaussian, additional factor for intensity difference Joint bilateral filter: Intensity difference provided by RGB-image (Kopf et al.) Fast approximation via decomposition in one linear filter for different depth value (Yang et al.) Prevent texture copying, by multilateral filter: Combined Bilateral Filter (CBF) (Wasenmüller et al.): switch between JBF and standard bilateral filter Confident map based on depth-gradients (Garcia et al.) J. Kopf, M. F. Cohen, D. Lischinski, and M. Uyttendaele. Joint bilateral upsampling. In ACM SIGGRAPH 2007 Papers, SIGGRAPH '07. ACM, Qingxiong Yang, Kar-Han Tan, and N. Ahuja. Real-time o(1) bilateral ltering. In Computer Vision and Pattern Recognition, CVPR IEEE Conference on, pages , F. Garcia, B. Mirbach, B. Ottersten, F. Grandidier, and A. Cuesta. Pixel weighted average strategy for depth sensor data fusion. In Image Processing (ICIP), th IEEE International Conference on, pages , O. Wasenmüller, G. Bleser, and D. Stricker. Combined Bilateral Filter for Enhanced Real-Time Upsampling of Depth Images. International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISAPP),

Combined Bilateral Filter for Enhanced Real-Time Upsampling of Depth Images.

44 Exemplary results of CBF (Wasenmüller et al.) Color Input Depth Input CBF Output O. Wasenmüller, G. Bleser, and D. Stricker. Combined Bilateral Filter for Enhanced Real-Time Upsampling of Depth Images. International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (VISAPP),

45 Cost Volume Yang et al.: Initialize H 0 by nearest neighbor upsampling Cost Volume: cost image for each depth hypotheses. Truncated squared difference Filter each cost image by a Joint Bilateral Filter (JBF) Subpixel-Refinement: Minimize quadratic interpolation polynomial of depth-triples C k I Bilateral Filter & Subpixel- Refinement inc(k) H k Qingxiong Yang, Ruigang Yang, J. Davis, and D. Nister. Spatial-depth super resolution for range images. In Computer Vision and Pattern Recognition, CVPR '07. IEEE Conference on, pages 1-8,

46 Cost Volume Qingxiong Yang, Ruigang Yang, J. Davis, and D. Nister. Spatial-depth super resolution for range images. In Computer Vision and Pattern Recognition, CVPR '07. IEEE Conference on, pages 1-8,

47 Markov Random Fields: Overview [Lo et al.] Hammersley-Clifford: Minimize sum of data terms and weighted smoothness terms E H = U(L i, H i ) + λ w ij V(H i, H j ) i i,j Kai-Han Lo, Kai-Lung Hua, and Y.-C.F. Wang. Depth map super-resolution via markov random elds without texture-copying artifacts. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages ,

Markov Random Field: Example Crucial points: Optimization method + Energy function Energy function often norm inside Gaussian kernel Euclidean: not robust (Diebel et al.

48 Markov Random Field: Example Crucial points: Optimization method + Energy function Energy function often norm inside Gaussian kernel Euclidean: not robust (Diebel et al.) Truncated absolute difference better (Lu et al.) Example use more complex distance involving structural tensor: RGB LR HR [Park et al.] Kai-Han Lo, Kai-Lung Hua, and Y.-C.F. Wang. Depth map super-resolution via markov random elds without texture-copying artifacts. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on, pages , J. Diebel and S. Thrun. An application of markov random fields to range sensing. In Proceedings of Conference on Neural Information Processing Systems (NIPS). MIT Press, Jiangbo Lu, Dongbo Min, R.S. Pahwa, and M.N. Do. A revisit to mrf-based depth map super-resolution and enhancement. In Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, pages , Jaesik Park, Hyeongwoo Kim, Yu-Wing Tai, M.S. Brown, and Inso Kweon. High quality depth map upsampling for 3d-tof cameras. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages ,

49 49 Additional RGB-Image: Conclusion Joint bilateral filter: Real-time is possible (by trend) Over-smoothing MRF and Cost-Volume: Both can provide high-quality MRF probably slower, but better results possible All methods: High dependency on particular choice of methods, parameters and data-set

50 50 Evaluation Single Depth-Image (Learning) Advantages (After training) single depth image is enough Multiple Depth-Images Random noise is removed effectively Dynamic methods incorporate systematic error Additional RGB-Image Random noise is removed effectively Complements characteristics of depthimages Good preservation of detail if image consistency holds Good results for dynamic objects/scenes Disadvantages Training Good samples Learning representations that work well in many cases Small structures vanish Only data of same type Registration of RGB and depth

51 51 Applications Kinect Fusion Body Reconstruction

52 Kinect Fusion - Overview 52

53 Challenges Tracking camera precisely Fusing and de-noising measurements (depth estimates) Avoiding drift Real-Time Low-Cost hardware Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 53

54 Proposed Solution Fast optimization for tracking; due to high frame rate Global framework for fusing data Interleaving tracking & mapping Using Kinect to get depth data ( low cost) Using GPU to get real-time performance ( low cost) Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 54

55 Method Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 55

56 Tracking Finding camera position is the same as fitting the depth map of a frame onto Model Tracking Mapping Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 56

Tracking ICP algorithm ICP = iterative closest point Already

onto camera 3) Correspondences are points with same coordinates 4)

convergence Tracking Mapping Parts of this slide are adapted from

57 Tracking ICP algorithm ICP = iterative closest point Already explained in Structured Light lecture Goal: fit two 3D point sets Problem: What are the correspondences? Kinect fusion chosen solution: 1) Start with T 0 2) Project model onto camera 3) Correspondences are points with same coordinates 4) Find new T with Least - Squares 5) Apply T, and repeat 2-5 until convergence Tracking Mapping Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 57

58 Tracking ICP algorithm Tracking Mapping Assumption: frame and model are roughly aligned. True because of high frame rate Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 58

59 Mapping Mapping is fusing depth maps when camera poses are known Problems: measurements are noisy Depth maps have holes Solution: Using implicit surface representation Fusing = estimations from all frames relevant Tracking Mapping Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 59

measure voxel distance to surface Parts of this slide are adapted from Richard A.

60 Mapping surface representation Surface is represented implicitly using Truncated Signed Distance Function (TSDF) Voxel grid Tracking Mapping Numbers in cells measure voxel distance to surface Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 60

61 Mapping Tracking Mapping Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 61

62 Mapping Tracking Mapping d= [pixel depth] [distance from sensor to voxel] Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 62

63 Mapping Tracking Mapping Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 63

64 Mapping Tracking Mapping Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 64

65 Mapping Tracking Mapping Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 65

66 Mapping Each Voxel also has a weight W, proportional to grazing angle Voxel D is the weighted average of all measurements Sensor 1 Sensor 2 d ( x) 2 d ( x) 1 Fx ( ) W( x) w( x) 1 w ( x) 2 Tracking Mapping Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 66

67 Handling drift Drift would have happened, if tracking was done from frame to frame Thus, tracking is done on built model Tracking Mapping Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 67

68 Pros & Cons Pros: Nice results Real time performance (30 Hz) Dense model No drift with local optimization Elegant solution Cons : 3D grid can not be trivially up-scaled Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 68

69 Limitations Doesn t work for large areas (Voxel-Grid) Doesn t work far away from objects (active ranging) Doesn t work well out-doors (IR) Requires powerful graphics card Uses lots of battery (active ranging) Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 69

70 Application: Body Reconstruction 70

71 71 What comes next? Overview of further teaching activities of the Department Augmented Vision

72 Other courses SS 2015: Lecture Computer Vision: Object and People Tracking, 4 CP, 2+1 Seminar 3D Computer Vision & Augmented Reality, 4 CP Project 3D Computer Vision & Augmented Reality, 8 CP WS 2015/16 Lecture 3D Computer Vision, 4 CP, 2+1 Seminar Computer Vision: Object and People Tracking, 4 CP Project Computer Vision: Object and People Tracking, 8 CP Individual topics and supervisors We also offer student jobs and master thesis in various areas! (computer vision, 3D reconstruction, sensor fusion, HCI, ) Just ask us! 72

3D Computer Vision. Depth Cameras. Prof. Didier Stricker. Oliver Wasenmüller

3D Computer Vision. Depth Cameras. Prof. Didier Stricker. Oliver Wasenmüller 3D Computer Vision Depth Cameras Prof. Didier Stricker Oliver Wasenmüller Kaiserlautern University http://ags.cs.uni-kl.de/ DFKI Deutsches Forschungszentrum für Künstliche Intelligenz http://av.dfki.de