3D Computer Vision. Depth Cameras. Prof. Didier Stricker. Oliver Wasenmüller

3D Computer Vision Depth Cameras Prof. Didier Stricker Oliver Wasenmüller Kaiserlautern University http://ags.cs.uni-kl.de/ DFKI Deutsches Forschungszentrum für Künstliche Intelligenz http://av.dfki.de 1

Content Motivation Depth Measurement Techniques Applications - Kinect Fusion - Body Reconstruction 2

What is a depth camera? A depth camera captured depth images. A depth image indicates in each pixel the distance from the camera to the seen object. (x,y,z) Color Image Depth Image (color encoded) (x,y) In the following slides: z indicates the depth How did we capture depth in the previous lectures. Camera Center 3

Depth from Stereo Images image 1 image 2 Dense disparity map Parts of this slide are adapted from Derek Hoiem (University of Illinois), Steve Seitz (University of Washington) and Lana Lazebnik (University of Illinois) 4

Depth from Stereo Images Goal: recover depth by finding image coordinate x that corresponds to x X X x x x z x' f f C Baseline B C Parts of this slide are adapted from Derek Hoiem (University of Illinois), Steve Seitz (University of Washington) and Lana Lazebnik (University of Illinois) 5

Stereo and the Epipolar constraint X X X x x x x Potential matches for x have to lie on the corresponding line l. Potential matches for x have to lie on the corresponding line l. Parts of this slide are adapted from Derek Hoiem (University of Illinois), Steve Seitz (University of Washington) and Lana Lazebnik (University of Illinois) 6

Simplest Case: Parallel images Image planes of cameras are parallel to each other and to the baseline Camera centers are at same height Focal lengths are the same Then, epipolar lines fall along the horizontal scan lines of the images Parts of this slide are adapted from Derek Hoiem (University of Illinois), Steve Seitz (University of Washington) and Lana Lazebnik (University of Illinois) 7

Basic stereo matching algorithm For each pixel in the first image Find corresponding epipolar line in the right image Examine all pixels on the epipolar line and pick the best match Triangulate the matches to get depth information Parts of this slide are adapted from Derek Hoiem (University of Illinois), Steve Seitz (University of Washington) and Lana Lazebnik (University of Illinois) 8

disparity Depth from disparity x x O O x f z x B f z X x x f f Baseline B O O z Disparity is inversely proportional to depth! Parts of this slide are adapted from Derek Hoiem (University of Illinois), Steve Seitz (University of Washington) and Lana Lazebnik (University of Illinois) 9

Depth Measurement Techniques 10

Depth Measurement Techniques Parts of this slide are adapted from Victor Castaneda and Nassir Navab (both University of Munich) 11

Depth Measurement Techniques Laser Scanner Structured Light Projection Time of Flight (ToF) 12

Structured Light Projection Souce: https://www.youtube.com/watch?v=28jwgxbqx8w Parts of this slide are adapted from Derek Hoiem (University of Illinois) 13

Structured Light Projection (see also lectures about structured light) Surface Projector Sensor Parts of this slide are adapted from Derek Hoiem (University of Illinois) 14

Structured Light Projection Projector Camera Parts of this slide are adapted from Derek Hoiem (University of Illinois) 15

Example: Book vs. No Book Source: http://www.futurepicture.org/?p=97 16

Example: Book vs. No Book Source: http://www.futurepicture.org/?p=97 17

Region-growing Random Dot Matching 1. Detect dots ( speckles ) and label them as unknown 2. Randomly select a region anchor, a dot with unknown depth a. Windowed search via normalized cross correlation along scanline Check that best match score is greater than threshold; if not, mark as invalid and go to 2 b. Region growing 1. Neighboring pixels are added to a queue 2. For each pixel in queue, initialize by anchor s shift; then search small local neighborhood; if matched, add neighbors to queue 3. Stop when no pixels are left in the queue 3. Stop when all dots have known depth or are marked invalid http://www.wipo.int/patentscope/search/en/wo2007043036 Parts of this slide are adapted from Derek Hoiem (University of Illinois) 18

Projected IR vs. Natural Light Stereo What are the advantages of IR? Works in low light conditions Does not rely on having textured objects Not confused by repeated scene textures Can tailor algorithm to produced pattern What are advantages of natural light? Works outside, anywhere with sufficient light Uses less energy Resolution limited only by sensors, not projector Difficulties with both Very dark surfaces may not reflect enough light Specular reflection in mirrors or metal causes trouble Parts of this slide are adapted from Derek Hoiem (University of Illinois) 19

Example: The Kinect Sensor (v1) Microsoft Kinect (v1) was released in 2011 as a new kind of controller for the Xbox 360. Parts of this slide are adapted from Rob Miles (University of Hull) 20

Example: The Kinect Sensor The Kinect is able to capture depth and color images. Therefore it contains two cameras and an infrared projector. It has also four microphones. Parts of this slide are adapted from Rob Miles (University of Hull) 21

Example: The Kinect Sensor The Kinect sensor contains a high quality video camera which can provide up to 1280x1024 resolution at 30 frames a second. Parts of this slide are adapted from Rob Miles (University of Hull) 22

Example: The Kinect Sensor IR Projector IR Camera The Kinect depth sensor uses an IR projector and an IR camera to measure the depth of objects in the scene in front of the sensor. Parts of this slide are adapted from Rob Miles (University of Hull) 23

Time of Flight (ToF) Time-of-Flight (ToF) Imaging refers to the process of measuring the depth of a scene by quantifying the changes that an emitted light signal encounters when it bounces back from objects in a scene. Two common principals: Pulsed Modulation Continuous Wave Modulation 24

Time of Flight (ToF) Pulsed Modulation Measure distance to a 3D object by measuring the absolute time a light pulse needs to travel from a source into the 3D scene and back, after reflection Speed of light is constant and known, c = 3 10 8 m/s Parts of this slide are adapted from Victor Castaneda and Nassir Navab (both University of Munich) 25

Time of Flight (ToF) Pulsed Modulation Advantages: Direct measurement of time-of-flight High-energy light pulses limit influence of background illumination Illumination and observation directions are collinear Disadvantages: High-accuracy time measurement required Measurement of light pulse return is inexact, due to light scattering Difficulty to generate short light pulses with fast rise and fall times Usable light sources (e.g. lasers) suffer low repetition rates for pulses Parts of this slide are adapted from Victor Castaneda and Nassir Navab (both University of Munich) 26

Time of Flight (ToF) Continuous Wave Modulation Microsoft Kinect v2 works with this principal Continuous light waves instead of short light pulses Modulation in terms of frequency of sinusoidal waves Detected wave after reflection has shifted phase Phase shift proportional to distance from reflecting surface Parts of this slide are adapted from Victor Castaneda and Nassir Navab (both University of Munich) 27

Time of Flight (ToF) Continuous Wave Modulation Microsoft Kinect v2 works with this principal Retrieve phase shift by demodulation of received signal Demodulation by cross-correlation of received signal with emitted signal Emitted sinusoidal signal: Received signal after reflection from 3D surface: Cross-correlation of both signals: Parts of this slide are adapted from Victor Castaneda and Nassir Navab (both University of Munich) 28

Time of Flight (ToF) Microsoft Kinect v2 works with this principal Continuous Wave Modulation Cross-correlation function simplifies to Sample at four sequential instants with different phase offset : Directly obtain sought parameters: Parts of this slide are adapted from Victor Castaneda and Nassir Navab (both University of Munich) 29

Time of Flight (ToF) Microsoft Kinect v2 works with this principal Continuous Wave Modulation Advantages: Variety of light sources available as no short/strong pulses required Applicable to different modulation techniques (other than frequency) Simultaneous range and amplitude images Disadvantages: In practice, integration over time required to reduce noise Frame rates limited by integration time Motion blur caused by long integration time Parts of this slide are adapted from Victor Castaneda and Nassir Navab (both University of Munich) 30

Depth Quality e.g. Kinect v1 Souce: http://vision.in.tum.de/data/datasets/rgbd-dataset Main problems: Resolution Noise 31

Depth Camera - Disadvantage: 1. High noise (+/-15mm) 2. Low resolution (176*144) 3. High distortion + Advantage: 1. Real-time capture 2. Video frame with 2/3D information Variance distribution in a depth image taken at approx. 1.5m average distance from a scene. Depth images contain heavy noise near the corners.

Applications Kinect Fusion Body Reconstruction 34

Kinect Fusion Paper link (ACM Symposium on User Interface Software and Technology, October 2011) YouTube Video

Challenges Tracking camera precisely Fusing and de-noising measurements (depth estimates) Avoiding drift Real-Time Low-Cost hardware Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 37

Proposed Solution Fast optimization for tracking; due to high frame rate Global framework for fusing data Interleaving tracking & mapping Using Kinect to get depth data ( low cost) Using GPU to get real-time performance ( low cost) Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 38

Method Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 39

KinectFusion- Depth map L projects a light spot P on an object surface, and O observes the spot Triangle (OPL) solve d Standard structured lighting model Given b,α and β Depth image (VGA &11-bit ) 40/72

KinectFusion-Vertex and normal map Vertex map is a 3D point cloud 3D vertex depth 2D depth point intrinsic matrix (IR camera) Normal vector indicates the direction of the surface at a vertex cross product neighboring points 41/72

KinectFusion- Camera tracking Small motion between consecutive positions of Kinect Find correspondences using projective data association Estimate camera pose T i by applying ICP algorithm to vertex and normal maps Tracking camera pose 42/72

Tracking Finding camera position is the same as fitting the depth map of a frame onto Model Tracking Mapping Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 43

Tracking ICP algorithm ICP = iterative closest point Goal: fit two 3D point sets Already explained in Structured Light lecture Problem: What are the correspondences? Kinect fusion chosen solution: 1) Start with T 0 2) Project model onto camera 3) Correspondences are points with same coordinates 4) Find new T with Least Squares (with the 3D-3D points) 5) Apply T, and repeat 2-5 until convergence Tracking Mapping Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 44

Tracking ICP algorithm Tracking Mapping Assumption: frame and model are roughly aligned. True because of high frame rate Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 45

Mapping Mapping is fusing depth maps when camera poses are known Problems: measurements are noisy Depth maps have holes Solution: Using implicit surface representation Fusing = estimations from all frames relevant Tracking Mapping Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 46

Mapping surface representation Surface is represented implicitly using Truncated Signed Distance Function (TSDF) Voxel grid Tracking Mapping Numbers in cells measure voxel distance to surface Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 47

KinectFusion- Volumetric integration Volumetric representation (3 3 3m, 512 voxels/axis) (0,1] (outside of the surface) tsdf(g) = 0 (on the surface) [-1,0) (inside the surface) TSDF volume grid 48/72

Mapping Tracking Mapping Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 49

Mapping Tracking Mapping d= [pixel depth] [distance from sensor to voxel] Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 50

Mapping Tracking Mapping Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 51

Mapping Tracking Mapping Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 52

Mapping Tracking Mapping Last step: Voxel D is the weighted average of all measurements Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 53

Handling drift Drift would have happened, if tracking was done from frame to frame Thus, tracking is done on built model Tracking Mapping Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 55

KinectFusion- Surface rendering Ray-casting technique Cast a ray through the focal point for each pixel Traverse voxels along the ray Find the first surface by observing the sign change of tsdf(g) Compute the intersection point using points around the surface boundary surface YouTube Video TSDF volume grid 56/72

Pros: Pros & Cons Nice results Real time performance (30 Hz) Dense model No drift with local optimization Elegant solution Cons : 3D grid can not be trivially up-scaled Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 57

Limitations Doesn t work for large areas (Voxel-Grid) Doesn t work far away from objects (active ranging) Doesn t work well out-doors (IR) Requires powerful graphics card Uses lots of battery (active ranging) Parts of this slide are adapted from Richard A. Newcombe (Imperial College London) and Boaz Petersil (Israel Institute of Technology) 58

Thank you! 28.01.2015 60