3D Reconstruction of Dynamic Textures with Crowd Sourced Data Dinghuang Ji, Enrique Dunn and Jan-Michael Frahm 1
Background Large scale scene reconstruction Internet imagery 3D point cloud Dense geometry 2
Motivation No man ever steps in the same river twice. --Heraclitus No local patch ever appears in the same fountain twice 3
Goal Bring static scene reconstruction alive 3D shape of the dynamic scene elements More realistic (dynamic) visualizations 4
Related works Nelson, R., Polana, R.: Qualitative recognition of motion using temporal texture. CVGIP: Image Understanding (1992) Activities Motion Events Dynamic textures 5
Related works Reconstruction and rendering of Time-Varying Natural Phenomena. PhD thesis of Ivo Ihrke, 2007. Modeling Dynamic Scenes Recorded with Freely Moving Cameras. Taneja et.al. ECCV 2010 What Shape are Dolphins? Building 3D Morphable Models from 2D Images. Cashman et. Al. PAMI 2012 6
Framework Data acquisition Rough model estimation Closed-loop modelling 7
Framework Data acquisition Rough model estimation Closed-loop modelling 8
Image based Scene Reconstruction Generate the static background and obtain camera parameters. Trevi fountain Mooney waterfall Navagio beach Piccadilly circus billboard 9
Video Frame selection Select sequential video frames contain stable dynamic motions Large viewpoint change Good frame sequence Heavy occlusion Frame sample 1 Frame sample n Frame sample m 10
Selected video sequences 11
Video Frame selection Extract HOG feature, and use NCC to measure the similarity. Histogram of Gradient Normalized Cross Correlation Local cell Histogram of orientation 12
Framework Data acquisition Rough model estimation Closed-loop modelling 13
Rough model estimation Selected frame sequences Dynamic texture segmentation Shape-from-Silhouettes 14
Foreground mask from videos Input video sequence Input video fragment 1 2 3 4 5 Final mask 15
Foreground mask from videos Homography based video stabilization Input video fragment 1 2 3 4 5 Final mask 16
Foreground mask from videos Accumulated frame differencing Input video fragment 1 2 3 4 5 Final mask 17
Foreground mask from videos Otsu thresholding and morphology operation Input video fragment 1 2 3 4 5 Final mask 18
Foreground mask from videos Remove small connected regions (final mask) Input video fragment 1 2 3 4 5 Final mask 19
Background mask from videos Feature matches between neighboring video frames Remove feature matches in foreground mask Estimate concave hull mask 20
Background mask estimation Alpha shape method Find the boundary of a set of points 21
Original image Graph-cut segmentation mask refinement ( green: static background, red: dynamic foreground) Foreground mask Background mask 22
Graph-cut segmentation Two labels image segmentation Solve with min-cut/max-flow method 23
Initial model generation Silhouettes from videos Shape-from-Silhouettes 24
Classic Shape from silhouettes 25
Classic Shape from silhouettes 26
Classic Shape from silhouettes 27
Classic Shape from silhouettes 28
Shape from silhouettes Problem Some of the silhouettes are not complete, this will carve away valid part of the reconstructed object. 29
Shape-from-Silhouettes: accumulative volume 30
Visualization with texture Static background + rough model 31
Project back to 2D images Classic Shape from silhouettes Shape from silhouettes fusion 32
Framework Data acquisition Rough model estimation Closed-loop modelling 33
Why we use Flickr images? 1. Reuse their camera parameters generated in static reconstruction. 2. Youtube videos usually have smaller resolutions (60% videos less than 360*480). 3. Isolated images expand the camera distributions, which are critical for shape-from-silhouettes methods. 34
Why we use Flickr images? 1500 registered video frames 800 Flickr registered images 14658 3D points with covering range 135 degree 68392 3D points with covering range 287 degree 35
Closed-loop modelling Rough model Project to Flickr images Generate a new model iteration 36
Project initial model to photo collections 37
Background mask of images Original image in photo-collection Nearest-neighbor in GIST feature space 38
Closed Loop 3D Shape Refinement Iteration 1 Frontal view Top view 39
Closed Loop 3D Shape Refinement Iteration 2 Frontal view Top view 40
Closed Loop 3D Shape Refinement Iteration 3 Frontal view Top view 41
Closed Loop 3D Shape Refinement Iteration 4 Frontal view Top view 42
Closed Loop 3D Shape Refinement Iteration 5 Frontal view Top view 43
Closed Loop 3D Shape Refinement Iteration 6 Frontal view Top view 44
Closed Loop 3D Shape Refinement Iteration 7 Frontal view Top view 45
Closed Loop 3D Shape Refinement Iteration 8 Frontal view Top view 46
Closed Loop 3D Shape Refinement Iteration 9 Frontal view Top view 47
Problem Over-segment 48
Problem Over-segment (frontal view) (top view) 49
Shape-from-Silhouettes two-way carving shape-from-silhouettes with foreground mask Keep only occupied voxels shape-from-silhouettes with background mask 50
Problem Uneven camera distribution 51
Shape-from-Silhouettes Weighted carving 52
Shape-from-Silhouettes Weighted carving 150 [0,30] l i l 0 100 50 [30,60] [60,90] [90,120] 0 camera # [120,150] [150,180] 53
Results Piccadilly circus without weighting Piccadilly circus with weighting Navagio beach without weighting Navagio beach with weighting 54
Implementation details Experiments the first iteration use an intersection ratio of 0.10, and increment a small number (i.e. 0.03) each iteration. To ensure convergence, we use a subset of wide field-ofview images and test their segmentation change. Rough initial model is generated by 15~30 video frames. Usually finished within 10 iterations, less than 5 hours. 55
Comparisons Experiments PMVS by Y. Furukawa et. Al multi-view stereo method for rigid structure. CMPMVS by M. Jancosek et. Al multi-view stereo method, show good results for weakly supported surface, i.e. water surface. 56
Dataset Experiments Keyframes sampled every 50 frames. Dataset Videos Downloaded Image Downloaded Keyframes Extracted Trevi Fountain 481 6000 68629 810 Navagio Beach 300 1000 45823 520 Piccadilly Circus Billboard 460 5000 75983 496 Mooney Falls 200 1000 17850 723 Images used for model refinement 57
58
Demos 59
Conclusions Initial trials on exploration of dynamic 3D reconstruction 3D reconstruction framework for Dynamic texture Robust shape-from-silhouettes method Dynamic texture cosegmentation 60