3D Shape Segmentation with Projective Convolutional Networks

Size: px

Start display at page:

Download "3D Shape Segmentation with Projective Convolutional Networks"

Jennifer Kelley
6 years ago
Views:

Subhransu Maji 1 Siddhartha Chaudhuri 3 1 University

1 3D Shape Segmentation with Projective Convolutional Networks Evangelos Kalogerakis 1 Melinos Averkiou 2 Subhransu Maji 1 Siddhartha Chaudhuri 3 1 University of Massachusetts Amherst 2 University of Cyprus 3 IIT Bombay

2 Goal: learn to segment & label parts in 3D shapes 3D Geometric representation

3 Goal: learn to segment & label parts in 3D shapes ShapePFCN 3D Geometric representation

4 Goal: learn to segment & label parts in 3D shapes ShapePFCN 3D Geometric representation Labeled 3D shape back seat base armrest

5 Goal: learn to segment & label parts in 3D shapes ShapePFCN 3D Geometric representation Training Dataset... Labeled 3D shape back seat base armrest

6 Motivation: Parsing RGBD data... back seat base armrest RGBD data from A Large Dataset of Object Scans Choi, Zhou, Miller, Koltun 2016 Our result

7 Motivation: 3D Modeling & Animation Ear Head Torso Back Upper arm Lower arm Hand Upper leg Lower leg Foot Tail Animation Texturing 3D Models Kalogerakis, Hertzmann, Singh, SIGGRAPH 2010 Simari, Nowrouzezahrai, Kalogerakis, Singh, SGP 2009

8 Related Work Train classifiers on hand engineered descriptors e.g., Kalogerakis et al. 2010, Guo et al curvature shape contexts geodesic dist.

9 Related Work Train classifiers on hand engineered descriptors e.g., Kalogerakis et al. 2010, Guo et al Concurrent approaches: Volumetric / octree based methods: Riegler et al (OctNet), Wang et al (O CNN), Klokov et al (kd net) Point based networks: Qi et al (PointNet / PointNet++) Graph based / spectral networks: Yi et al (SyncSpecCNN) Surface embedding networks: Maron et al curvature shape contexts geodesic dist.

Related Work Train classifiers on hand engineered descriptors e.g., Kalogerakis et al. 2010, Guo et al. 2015 Concurrent approaches: Volumetric / octree based methods: Riegler et al.

10 Related Work Train classifiers on hand engineered descriptors e.g., Kalogerakis et al. 2010, Guo et al Concurrent approaches: Volumetric / octree based methods: Riegler et al (OctNet), Wang et al (O CNN), Klokov et al (kd net) Point based networks: Qi et al (PointNet / PointNet++) Graph based / spectral networks: Yi et al (SyncSpecCNN) Surface embedding networks: Maron et al curvature shape contexts geodesic dist. Our method: view based network

11 Key Observations 3D models are often designed for viewing.

12 Key Observations 3D models are often designed for viewing.

13 Key Observations 3D models are often designed for viewing. Empty inside!

14 3D scans capture the surface. Key Observations + noise, missing regions etc

15 Key Observations 3D models are often designed for viewing. Parts do not touch! (not easily noticeable to the viewer, yet geometric implications on topology, connectedness...)

Shape renderings can be processed by powerful image based architectures

16 Key Observations Shape renderings can be treated as photos of objects (without texture) Image based network Chair! Airplane! Shape renderings can be processed by powerful image based architectures through transfer learning from massive image datasets. (Su, Maji, Kalogerakis, Learned Miller, ICCV 2015)

17 Key Idea Deep architecture that combines view based convnets for part reasoning on rendered shape images & prob. graphical models for surface processing.

18 Key Idea Deep architecture that combines view based convnets for part reasoning on rendered shape images & prob. graphical models for surface processing. Key challenges: Select views to avoid surface information loss & deal with occlusions

19 Key Idea Deep architecture that combines view based convnets for part reasoning on rendered shape images & prob. graphical models for surface processing. Key challenges: Select views to avoid surface information loss & deal with occlusions Promote invariance under 3D shape rotations

20 Key Idea Deep architecture that combines view based convnets for part reasoning on rendered shape images & prob. graphical models for surface processing. Key challenges: Select views to avoid surface information loss & deal with occlusions Promote invariance under 3D shape rotations Joint reasoning about parts across multiple views +surface

21 Pipeline View selection ShapePFCN fuselage wing vert. stabilizer horiz. stabilizer

22 Pipeline View selection ShapePFCN fuselage wing vert. stabilizer horiz. stabilizer

23 Input: shape as a collection of rendered views For each input shape, infer a set of viewpoints that maximally cover its surface across multiple distances.

24 Input: shape as a collection of rendered views Render shaded images (normal dot view vector) encoding surface normals. Shaded images

25 Input: shape as a collection of rendered views Render also depth images encoding surface position relative to the camera. Shaded images Depth images

26 Input: shape as a collection of rendered views Perform in plane camera rotations for rotational invariance. 0 o,90 o, 180 o, 270 o rotations Shaded images Depth images

27 Projective convnet architecture Each pair of depth & shaded images is processed by a FCN. Views are not ordered (no view correspondence across shapes). [Long, Shelhamer, and Darrell 2015] FCN FCN Surface confidence maps Shaded images Depth images FCN

28 Projective convnet architecture Each pair of depth & shaded images is processed by a FCN. Views are not ordered (no view correspondence across shapes). [Long, Shelhamer, and Darrell 2015] FCN FCN Surface confidence maps Shaded images Depth images FCN shared filters

29 Projective convnet architecture The output of each FCN branch is a view based confidence map per part label. hor. stabilizer FCN FCN Surface confidence maps Shaded images Depth images FCN shared filters View based map

30 Projective convnet architecture The output of each FCN branch is a view based confidence map per part label. wing FCN FCN Shaded images Depth images FCN shared filters View based map

31 Projective convnet architecture Aggregate & project the image confidence maps from all views on the surface. FCN FCN Surface based map Shaded images Depth images FCN shared filters View based map

32 Image2Surface projection layer For each surface element (triangle), find all pixels that include it in all views. Surface confidence: use max of these pixel confidences per label. FCN FCN max Surface based map Shaded images Depth images FCN shared filters View based map

33 Image2Surface projection layer For each surface element (triangle), find all pixels that include it in all views. Surface confidence: use max of these pixel confidences per label. FCN FCN max Surface based map Shaded images Depth images FCN shared filters View based map

34 Image2Surface projection layer For each surface element (triangle), find all pixels that include it in all views. Surface confidence: use max of these pixel confidences per label. FCN FCN max Surface based map Shaded images Depth images FCN shared filters View based map

35 CRF layer for spatially coherent labeling Last layer performs inference in a probabilistic model defined on the surface. R 1 R 2 R 3 R 4 R 1, R 2, R 3, R 4 random variables taking values: fuselage wing vert. stabilizer horiz. stabilizer

36 CRF layer for spatially coherent labeling Conditional Random Field: unary factors based on surface based confidences R 1 R 2 R 3 R 4 1 P( R, R, R, R... shape) P( R views) P( R, R surface) f f f ' Z f 1.. n f, f ' Unary factors (FCN confidences)

37 Projective convnet architecture: CRF layer Pairwise terms favor same label for triangles with: (a) similar surface normals (b) small geodesic distance R 2 R 1 R 3 R 4 1 P( R, R, R, R... shape) P( R views) P( R, R surface) f f f ' Z f 1.. n f, f ' Pairwise factors (geodesic+normal distance)

38 Projective convnet architecture: CRF layer Infer most likely joint assignment to all surface random variables (mean field) max 1 P( R, R, R, R... shape) P( R views) P( R, R surface) f f f ' Z f 1.. n f, f ' MAP assignment (mean field inference) R 2 R 1 R 3 R 4

39 Forward pass inference (convnet+crf) CRF FCN

40 Training The architecture is trained end to end with analytic gradients. CRF FCN Backpropagation / joint training (convnet+crf)

41 Training The architecture is trained end to end with analytic gradients. Training starts from a pretrained image based net (VGG16), then fine tune. CRF FCN Pre trained Backpropagation / joint training (convnet+crf)

42 Dataset used in experiments Evaluation on ShapeNet + LPSB + COSEG (46 classes of shapes) 50% used for training / 50% used for test split per Shapenet category Max 250 shapes for training. No assumption on shape orientation. [Yi et al. 2016]

43 Dataset used in experiments Evaluation on ShapeNet + LPSB + COSEG (46 classes of shapes) 50% used for training / 50% used for test split per Shapenet category Max 250 shapes for training. No assumption on shape orientation. [Yi et al. 2016]

44 Results Labeling accuracy on ShapeNet test dataset: ShapeBoost Guo et al. ShapePFCN

45 Results Ignore easy classes (2 or 3 part labels) Labeling accuracy on ShapeNet test dataset: ShapeBoost Guo et al. ShapePFCN % improvement in labeling accuracy for complex categories (vehicles, furniture)

46 Results Ignore easy classes (2 or 3 part labels) Labeling accuracy on ShapeNet test dataset: ShapeBoost Guo et al. ShapePFCN % improvement in labeling accuracy for complex categories (vehicles, furniture) Labeling accuracy on LPSB+COSEG test dataset: ShapeBoost Guo et al. ShapePFCN

47 ground-truth ShapeBoost ShapePFCN

48 ShapeBoost ShapePFCN Object scans from A Large Dataset of Object Scans Choi et al. 2016

49 Summary Deep architecture combining view based FCN & surface based CRF Multi scale view selection to avoid loss of surface information Transfer learning from massive image datasets Robust to geometric representation artifacts Acknowledgements: NSF (CHS , CHS , IIS ) Experiments were performed in the UMass GPU cluster (400 GPUs!) obtained under a grant by the MassTech Collaborative

50 Project page:

51 Auxiliary slides

52 What are the filters doing? Activated in the presence of certain patterns of surface patches conv4 conv5 fc6

53 What are the filters doing? Activated in the presence of certain patterns of surface patches conv4 conv5 fc6

54 What are the filters doing? Activated in the presence of certain patterns of surface patches conv4 conv5 fc6

55 Input: shape as a collection of rendered views For each input shape, infer a set of viewpoints that maximally cover its surface across multiple distances.

56 Input: shape as a collection of rendered views For each input shape, infer a set of viewpoints that maximally cover its surface across multiple distances.

57 Input: shape as a collection of rendered views For each input shape, infer a set of viewpoints that maximally cover its surface across multiple distances.

58 Input: shape as a collection of rendered views For each input shape, infer a set of viewpoints that maximally cover its surface across multiple distances.

59 Input: shape as a collection of rendered views For each input shape, infer a set of viewpoints that maximally cover its surface across multiple distances.

60 Input: shape as a collection of rendered views For each input shape, infer a set of viewpoints that maximally cover its surface across multiple distances.

61 FCN FCN max Surface based map Shaded images Depth images FCN shared filters View based map

62 Training The architecture is trained end to end with analytic gradients. CRF FCN Backpropagation / joint training (convnet+crf)

63 Challenges 3D models have missing or non photorealistic texture

64 Challenges 3D models have missing or non photorealistic texture (focus on shape instead)

65 ShapeNetCore: 8% improvement in labeling accuracy for complex categories (vehicles, furniture etc)

67 Key Observations 3D models have arbitrary orientation in the wild. Consistent shape orientation only in specific, well engineered datasets (often with manual intervention, no perfect alignment algorithm)

68 Comparisons pitfalls Method A assumes consistent shape alignment (or upright orientation), method B doesn t. Well, you may get much better numbers for B by changing its input! Convnet A has orders of magnitude more parameters than Convnet B. It might also be easy to get better numbers for B, if you increase its number of filters! Convnet A has an architectural trick that Convnet B could also have (e.g., U net, ensemble). Why not apply the same trick to B? Methods are largely tested on training data because of duplicate or near duplicate shapes. The more you overfit, the better! Ouch! 3DShapeNet has many identical models, or models with tiny differences (e.g., same airplane with different rockets)...

3D Shape Analysis with Multi-view Convolutional Networks. Evangelos Kalogerakis

3D Shape Analysis with Multi-view Convolutional Networks Evangelos Kalogerakis 3D model repositories [3D Warehouse - video] 3D geometry acquisition [KinectFusion - video] 3D shapes come in various flavors