3D Reconstruction of Dynamic Textures with Crowd Sourced Data. Dinghuang Ji, Enrique Dunn and Jan-Michael Frahm

3D Reconstruction of Dynamic Textures with Crowd Sourced Data Dinghuang Ji, Enrique Dunn and Jan-Michael Frahm 1

Background Large scale scene reconstruction Internet imagery 3D point cloud Dense geometry 2

Motivation No man ever steps in the same river twice. --Heraclitus No local patch ever appears in the same fountain twice 3

Goal Bring static scene reconstruction alive 3D shape of the dynamic scene elements More realistic (dynamic) visualizations 4

Related works Nelson, R., Polana, R.: Qualitative recognition of motion using temporal texture. CVGIP: Image Understanding (1992) Activities Motion Events Dynamic textures 5

Related works Reconstruction and rendering of Time-Varying Natural Phenomena. PhD thesis of Ivo Ihrke, 2007. Modeling Dynamic Scenes Recorded with Freely Moving Cameras. Taneja et.al. ECCV 2010 What Shape are Dolphins? Building 3D Morphable Models from 2D Images. Cashman et. Al. PAMI 2012 6

Framework Data acquisition Rough model estimation Closed-loop modelling 7

Framework Data acquisition Rough model estimation Closed-loop modelling 8

Image based Scene Reconstruction Generate the static background and obtain camera parameters. Trevi fountain Mooney waterfall Navagio beach Piccadilly circus billboard 9

Video Frame selection Select sequential video frames contain stable dynamic motions Large viewpoint change Good frame sequence Heavy occlusion Frame sample 1 Frame sample n Frame sample m 10

Selected video sequences 11

Video Frame selection Extract HOG feature, and use NCC to measure the similarity. Histogram of Gradient Normalized Cross Correlation Local cell Histogram of orientation 12

Framework Data acquisition Rough model estimation Closed-loop modelling 13

Rough model estimation Selected frame sequences Dynamic texture segmentation Shape-from-Silhouettes 14

Foreground mask from videos Input video sequence Input video fragment 1 2 3 4 5 Final mask 15

Foreground mask from videos Homography based video stabilization Input video fragment 1 2 3 4 5 Final mask 16

Foreground mask from videos Accumulated frame differencing Input video fragment 1 2 3 4 5 Final mask 17

Foreground mask from videos Otsu thresholding and morphology operation Input video fragment 1 2 3 4 5 Final mask 18

Foreground mask from videos Remove small connected regions (final mask) Input video fragment 1 2 3 4 5 Final mask 19

Background mask from videos Feature matches between neighboring video frames Remove feature matches in foreground mask Estimate concave hull mask 20

Background mask estimation Alpha shape method Find the boundary of a set of points 21

Original image Graph-cut segmentation mask refinement ( green: static background, red: dynamic foreground) Foreground mask Background mask 22

Graph-cut segmentation Two labels image segmentation Solve with min-cut/max-flow method 23

Initial model generation Silhouettes from videos Shape-from-Silhouettes 24

Classic Shape from silhouettes 25

Classic Shape from silhouettes 26

Classic Shape from silhouettes 27

Classic Shape from silhouettes 28

Shape from silhouettes Problem Some of the silhouettes are not complete, this will carve away valid part of the reconstructed object. 29

Shape-from-Silhouettes: accumulative volume 30

Visualization with texture Static background + rough model 31

Project back to 2D images Classic Shape from silhouettes Shape from silhouettes fusion 32

Framework Data acquisition Rough model estimation Closed-loop modelling 33

Why we use Flickr images? 1. Reuse their camera parameters generated in static reconstruction. 2. Youtube videos usually have smaller resolutions (60% videos less than 360*480). 3. Isolated images expand the camera distributions, which are critical for shape-from-silhouettes methods. 34

Why we use Flickr images? 1500 registered video frames 800 Flickr registered images 14658 3D points with covering range 135 degree 68392 3D points with covering range 287 degree 35

Closed-loop modelling Rough model Project to Flickr images Generate a new model iteration 36

Project initial model to photo collections 37

Background mask of images Original image in photo-collection Nearest-neighbor in GIST feature space 38

Closed Loop 3D Shape Refinement Iteration 1 Frontal view Top view 39

Closed Loop 3D Shape Refinement Iteration 2 Frontal view Top view 40

Closed Loop 3D Shape Refinement Iteration 3 Frontal view Top view 41

Closed Loop 3D Shape Refinement Iteration 4 Frontal view Top view 42

Closed Loop 3D Shape Refinement Iteration 5 Frontal view Top view 43

Closed Loop 3D Shape Refinement Iteration 6 Frontal view Top view 44

Closed Loop 3D Shape Refinement Iteration 7 Frontal view Top view 45

Closed Loop 3D Shape Refinement Iteration 8 Frontal view Top view 46

Closed Loop 3D Shape Refinement Iteration 9 Frontal view Top view 47

Problem Over-segment 48

Problem Over-segment (frontal view) (top view) 49

Shape-from-Silhouettes two-way carving shape-from-silhouettes with foreground mask Keep only occupied voxels shape-from-silhouettes with background mask 50

Problem Uneven camera distribution 51

Shape-from-Silhouettes Weighted carving 52

Shape-from-Silhouettes Weighted carving 150 [0,30] l i l 0 100 50 [30,60] [60,90] [90,120] 0 camera # [120,150] [150,180] 53

Results Piccadilly circus without weighting Piccadilly circus with weighting Navagio beach without weighting Navagio beach with weighting 54

Implementation details Experiments the first iteration use an intersection ratio of 0.10, and increment a small number (i.e. 0.03) each iteration. To ensure convergence, we use a subset of wide field-ofview images and test their segmentation change. Rough initial model is generated by 15~30 video frames. Usually finished within 10 iterations, less than 5 hours. 55

Comparisons Experiments PMVS by Y. Furukawa et. Al multi-view stereo method for rigid structure. CMPMVS by M. Jancosek et. Al multi-view stereo method, show good results for weakly supported surface, i.e. water surface. 56

Dataset Experiments Keyframes sampled every 50 frames. Dataset Videos Downloaded Image Downloaded Keyframes Extracted Trevi Fountain 481 6000 68629 810 Navagio Beach 300 1000 45823 520 Piccadilly Circus Billboard 460 5000 75983 496 Mooney Falls 200 1000 17850 723 Images used for model refinement 57

Demos 59

Conclusions Initial trials on exploration of dynamic 3D reconstruction 3D reconstruction framework for Dynamic texture Robust shape-from-silhouettes method Dynamic texture cosegmentation 60