3D Shape Segmentation with Projective Convolutional Networks

Similar documents
3D Shape Analysis with Multi-view Convolutional Networks. Evangelos Kalogerakis

Data driven 3D shape analysis and synthesis

Analysis and Synthesis of 3D Shape Families via Deep Learned Generative Models of Surfaces

arxiv: v3 [cs.cv] 13 Nov 2017

Machine Learning for Shape Analysis and Processing. Evangelos Kalogerakis

3D Shape Segmentation with Projective Convolutional Networks

CS468: 3D Deep Learning on Point Cloud Data. class label part label. Hao Su. image. May 10, 2017

Detecting and Parsing of Visual Objects: Humans and Animals. Alan Yuille (UCLA)

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

Learning to generate 3D shapes

Learning from 3D Data

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

Su et al. Shape Descriptors - III

3D Deep Learning on Geometric Forms. Hao Su

Volumetric and Multi-View CNNs for Object Classification on 3D Data Supplementary Material

Seeing the unseen. Data-driven 3D Understanding from Single Images. Hao Su

... arxiv: v2 [cs.cv] 5 Sep Learning local shape descriptors from part correspondences with multi-view convolutional networks

Multiresolution Tree Networks for 3D Point Cloud Processing

3D Object Classification via Spherical Projections

LEARNING TO GENERATE CHAIRS WITH CONVOLUTIONAL NEURAL NETWORKS

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

Pixels to Voxels: Modeling Visual Representation in the Human Brain

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Deep Models for 3D Reconstruction

Separating Objects and Clutter in Indoor Scenes

Machine Learning Techniques for Geometric Modeling. Evangelos Kalogerakis

Object Localization, Segmentation, Classification, and Pose Estimation in 3D Images using Deep Learning

POINT CLOUD DEEP LEARNING

Learning 3D Part Detection from Sparsely Labeled Data: Supplemental Material

Fully Convolutional Networks for Semantic Segmentation

3D Deep Learning

3D Semantic Segmentation with Submanifold Sparse Convolutional Networks

Photo-realistic Renderings for Machines Seong-heum Kim

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

Lecture 7: Semantic Segmentation

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

arxiv: v1 [cs.cv] 13 Feb 2018

Semantic Segmentation

An Empirical Study of Generative Adversarial Networks for Computer Vision Tasks

Supplementary A. Overview. C. Time and Space Complexity. B. Shape Retrieval. D. Permutation Invariant SOM. B.1. Dataset

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

arxiv: v1 [cs.cv] 28 Nov 2018

A Deeper Look at 3D Shape Classifiers

Shape Co-analysis. Daniel Cohen-Or. Tel-Aviv University

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

Multi-view Stereo. Ivo Boyadzhiev CS7670: September 13, 2011

Know your data - many types of networks

PARTIAL STYLE TRANSFER USING WEAKLY SUPERVISED SEMANTIC SEGMENTATION. Shin Matsuo Wataru Shimoda Keiji Yanai

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation. Charles R. Qi* Hao Su* Kaichun Mo Leonidas J. Guibas

arxiv: v1 [cs.cv] 28 Nov 2017

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus

Conditional Random Fields as Recurrent Neural Networks

Bilinear Models for Fine-Grained Visual Recognition

Accurate 3D Face and Body Modeling from a Single Fixed Kinect

CS395T paper review. Indoor Segmentation and Support Inference from RGBD Images. Chao Jia Sep

Three-Dimensional Object Detection and Layout Prediction using Clouds of Oriented Gradients

SPLATNet: Sparse Lattice Networks for Point Cloud Processing

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Exploring Collections of 3D Models using Fuzzy Correspondences

3D Scanning. Qixing Huang Feb. 9 th Slide Credit: Yasutaka Furukawa

arxiv: v1 [cs.cv] 31 Mar 2016

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

3D Object Classification using Shape Distributions and Deep Learning

arxiv: v1 [cs.cv] 30 Sep 2018

3D model classification using convolutional neural network

Real-Time Depth Estimation from 2D Images

The SIFT (Scale Invariant Feature

3D CONVOLUTIONAL NEURAL NETWORKS BY MODAL FUSION

An Exploration of Computer Vision Techniques for Bird Species Classification

Applications: Sampling/Reconstruction. Sampling and Reconstruction of Visual Appearance. Motivation. Monte Carlo Path Tracing.

Using Machine Learning for Classification of Cancer Cells

Modeling and Analyzing 3D Shapes using Clues from 2D Images. Minglun Gong Dept. of CS, Memorial Univ.

Dynamic Routing Between Capsules

Detecting Object Instances Without Discriminative Features

Unsupervised Deep Learning. James Hays slides from Carl Doersch and Richard Zhang

CRF Based Point Cloud Segmentation Jonathan Nation

arxiv: v4 [cs.cv] 29 Mar 2019

Transfer Learning. Style Transfer in Deep Learning

Learning Material-Aware Local Descriptors for 3D Shapes

INF 5860 Machine learning for image classification. Lecture 11: Visualization Anne Solberg April 4, 2018

Deep Learning in Image and Face Intrinsics

Structured Prediction using Convolutional Neural Networks

Lecture 13 Segmentation and Scene Understanding Chris Choy, Ph.D. candidate Stanford Vision and Learning Lab (SVL)

Recovering Realistic Texture in Image Super-resolution by Deep Spatial Feature Transform. Xintao Wang Ke Yu Chao Dong Chen Change Loy

arxiv: v1 [cs.gr] 13 Sep 2018

Intrinsic3D: High-Quality 3D Reconstruction by Joint Appearance and Geometry Optimization with Spatially-Varying Lighting

Shape Constancy and Shape Recovery: wherein Human and Computer Vision Meet

CS 112 The Rendering Pipeline. Slide 1

Alternatives to Direct Supervision

Structure-oriented Networks of Shape Collections

arxiv: v1 [cs.cv] 19 Sep 2018

Multi-view 3D Models from Single Images with a Convolutional Network

Bus Detection and recognition for visually impaired people

Machine Learning for Medical Image Analysis. A. Criminisi

Deep Face Recognition. Nathan Sun

Semantic Segmentation. Zhongang Qi

arxiv: v1 [cs.cv] 2 Nov 2018

Computer Vision: Summary and Discussion

Transcription:

3D Shape Segmentation with Projective Convolutional Networks Evangelos Kalogerakis 1 Melinos Averkiou 2 Subhransu Maji 1 Siddhartha Chaudhuri 3 1 University of Massachusetts Amherst 2 University of Cyprus 3 IIT Bombay

Goal: learn to segment & label parts in 3D shapes 3D Geometric representation

Goal: learn to segment & label parts in 3D shapes ShapePFCN 3D Geometric representation

Goal: learn to segment & label parts in 3D shapes ShapePFCN 3D Geometric representation Labeled 3D shape back seat base armrest

Goal: learn to segment & label parts in 3D shapes ShapePFCN 3D Geometric representation Training Dataset... Labeled 3D shape back seat base armrest

Motivation: Parsing RGBD data... back seat base armrest RGBD data from A Large Dataset of Object Scans Choi, Zhou, Miller, Koltun 2016 Our result

Motivation: 3D Modeling & Animation Ear Head Torso Back Upper arm Lower arm Hand Upper leg Lower leg Foot Tail Animation Texturing 3D Models Kalogerakis, Hertzmann, Singh, SIGGRAPH 2010 Simari, Nowrouzezahrai, Kalogerakis, Singh, SGP 2009

Related Work Train classifiers on hand engineered descriptors e.g., Kalogerakis et al. 2010, Guo et al. 2015 curvature shape contexts geodesic dist.

Related Work Train classifiers on hand engineered descriptors e.g., Kalogerakis et al. 2010, Guo et al. 2015 Concurrent approaches: Volumetric / octree based methods: Riegler et al. 2017 (OctNet), Wang et al. 2017 (O CNN), Klokov et al. 2017 (kd net) Point based networks: Qi et al. 2017 (PointNet / PointNet++) Graph based / spectral networks: Yi et al. 2017 (SyncSpecCNN) Surface embedding networks: Maron et al. 2017 curvature shape contexts geodesic dist.

Related Work Train classifiers on hand engineered descriptors e.g., Kalogerakis et al. 2010, Guo et al. 2015 Concurrent approaches: Volumetric / octree based methods: Riegler et al. 2017 (OctNet), Wang et al. 2017 (O CNN), Klokov et al. 2017 (kd net) Point based networks: Qi et al. 2017 (PointNet / PointNet++) Graph based / spectral networks: Yi et al. 2017 (SyncSpecCNN) Surface embedding networks: Maron et al. 2017 curvature shape contexts geodesic dist. Our method: view based network

Key Observations 3D models are often designed for viewing.

Key Observations 3D models are often designed for viewing.

Key Observations 3D models are often designed for viewing. Empty inside!

3D scans capture the surface. Key Observations + noise, missing regions etc

Key Observations 3D models are often designed for viewing. Parts do not touch! (not easily noticeable to the viewer, yet geometric implications on topology, connectedness...)

Key Observations Shape renderings can be treated as photos of objects (without texture) Image based network Chair! Airplane! Shape renderings can be processed by powerful image based architectures through transfer learning from massive image datasets. (Su, Maji, Kalogerakis, Learned Miller, ICCV 2015)

Key Idea Deep architecture that combines view based convnets for part reasoning on rendered shape images & prob. graphical models for surface processing.

Key Idea Deep architecture that combines view based convnets for part reasoning on rendered shape images & prob. graphical models for surface processing. Key challenges: Select views to avoid surface information loss & deal with occlusions

Key Idea Deep architecture that combines view based convnets for part reasoning on rendered shape images & prob. graphical models for surface processing. Key challenges: Select views to avoid surface information loss & deal with occlusions Promote invariance under 3D shape rotations

Key Idea Deep architecture that combines view based convnets for part reasoning on rendered shape images & prob. graphical models for surface processing. Key challenges: Select views to avoid surface information loss & deal with occlusions Promote invariance under 3D shape rotations Joint reasoning about parts across multiple views +surface

Pipeline View selection ShapePFCN fuselage wing vert. stabilizer horiz. stabilizer

Pipeline View selection ShapePFCN fuselage wing vert. stabilizer horiz. stabilizer

Input: shape as a collection of rendered views For each input shape, infer a set of viewpoints that maximally cover its surface across multiple distances.

Input: shape as a collection of rendered views Render shaded images (normal dot view vector) encoding surface normals. Shaded images

Input: shape as a collection of rendered views Render also depth images encoding surface position relative to the camera. Shaded images Depth images

Input: shape as a collection of rendered views Perform in plane camera rotations for rotational invariance. 0 o,90 o, 180 o, 270 o rotations Shaded images Depth images

Projective convnet architecture Each pair of depth & shaded images is processed by a FCN. Views are not ordered (no view correspondence across shapes). [Long, Shelhamer, and Darrell 2015] FCN FCN Surface confidence maps Shaded images Depth images FCN

Projective convnet architecture Each pair of depth & shaded images is processed by a FCN. Views are not ordered (no view correspondence across shapes). [Long, Shelhamer, and Darrell 2015] FCN FCN Surface confidence maps Shaded images Depth images FCN shared filters

Projective convnet architecture The output of each FCN branch is a view based confidence map per part label. hor. stabilizer FCN FCN Surface confidence maps Shaded images Depth images FCN shared filters View based map

Projective convnet architecture The output of each FCN branch is a view based confidence map per part label. wing FCN FCN Shaded images Depth images FCN shared filters View based map

Projective convnet architecture Aggregate & project the image confidence maps from all views on the surface. FCN FCN Surface based map Shaded images Depth images FCN shared filters View based map

Image2Surface projection layer For each surface element (triangle), find all pixels that include it in all views. Surface confidence: use max of these pixel confidences per label. FCN FCN max Surface based map Shaded images Depth images FCN shared filters View based map

Image2Surface projection layer For each surface element (triangle), find all pixels that include it in all views. Surface confidence: use max of these pixel confidences per label. FCN FCN max Surface based map Shaded images Depth images FCN shared filters View based map

Image2Surface projection layer For each surface element (triangle), find all pixels that include it in all views. Surface confidence: use max of these pixel confidences per label. FCN FCN max Surface based map Shaded images Depth images FCN shared filters View based map

CRF layer for spatially coherent labeling Last layer performs inference in a probabilistic model defined on the surface. R 1 R 2 R 3 R 4 R 1, R 2, R 3, R 4 random variables taking values: fuselage wing vert. stabilizer horiz. stabilizer

CRF layer for spatially coherent labeling Conditional Random Field: unary factors based on surface based confidences R 1 R 2 R 3 R 4 1 P( R, R, R, R... shape) P( R views) P( R, R surface) f f f 1 2 3 4 ' Z f 1.. n f, f ' Unary factors (FCN confidences)

Projective convnet architecture: CRF layer Pairwise terms favor same label for triangles with: (a) similar surface normals (b) small geodesic distance R 2 R 1 R 3 R 4 1 P( R, R, R, R... shape) P( R views) P( R, R surface) f f f 1 2 3 4 ' Z f 1.. n f, f ' Pairwise factors (geodesic+normal distance)

Projective convnet architecture: CRF layer Infer most likely joint assignment to all surface random variables (mean field) max 1 P( R, R, R, R... shape) P( R views) P( R, R surface) f f f 1 2 3 4 ' Z f 1.. n f, f ' MAP assignment (mean field inference) R 2 R 1 R 3 R 4

Forward pass inference (convnet+crf) CRF FCN

Training The architecture is trained end to end with analytic gradients. CRF FCN Backpropagation / joint training (convnet+crf)

Training The architecture is trained end to end with analytic gradients. Training starts from a pretrained image based net (VGG16), then fine tune. CRF FCN Pre trained Backpropagation / joint training (convnet+crf)

Dataset used in experiments Evaluation on ShapeNet + LPSB + COSEG (46 classes of shapes) 50% used for training / 50% used for test split per Shapenet category Max 250 shapes for training. No assumption on shape orientation. [Yi et al. 2016]

Dataset used in experiments Evaluation on ShapeNet + LPSB + COSEG (46 classes of shapes) 50% used for training / 50% used for test split per Shapenet category Max 250 shapes for training. No assumption on shape orientation. [Yi et al. 2016]

Results Labeling accuracy on ShapeNet test dataset: ShapeBoost Guo et al. ShapePFCN 81.2 80.6 87.5

Results Ignore easy classes (2 or 3 part labels) Labeling accuracy on ShapeNet test dataset: ShapeBoost Guo et al. ShapePFCN 81.2 80.6 87.5 76.8 76.8 84.7 8% improvement in labeling accuracy for complex categories (vehicles, furniture)

Results Ignore easy classes (2 or 3 part labels) Labeling accuracy on ShapeNet test dataset: ShapeBoost Guo et al. ShapePFCN 81.2 80.6 87.5 76.8 76.8 84.7 8% improvement in labeling accuracy for complex categories (vehicles, furniture) Labeling accuracy on LPSB+COSEG test dataset: ShapeBoost Guo et al. ShapePFCN 84.2 82.1 92.2

ground-truth ShapeBoost ShapePFCN

ShapeBoost ShapePFCN Object scans from A Large Dataset of Object Scans Choi et al. 2016

Summary Deep architecture combining view based FCN & surface based CRF Multi scale view selection to avoid loss of surface information Transfer learning from massive image datasets Robust to geometric representation artifacts Acknowledgements: NSF (CHS 1422441, CHS 1617333, IIS 1617917) Experiments were performed in the UMass GPU cluster (400 GPUs!) obtained under a grant by the MassTech Collaborative

Project page: http://people.cs.umass.edu/~kalo/papers/shapepfcn/

Auxiliary slides

What are the filters doing? Activated in the presence of certain patterns of surface patches conv4 conv5 fc6

What are the filters doing? Activated in the presence of certain patterns of surface patches conv4 conv5 fc6

What are the filters doing? Activated in the presence of certain patterns of surface patches conv4 conv5 fc6

Input: shape as a collection of rendered views For each input shape, infer a set of viewpoints that maximally cover its surface across multiple distances.

Input: shape as a collection of rendered views For each input shape, infer a set of viewpoints that maximally cover its surface across multiple distances.

Input: shape as a collection of rendered views For each input shape, infer a set of viewpoints that maximally cover its surface across multiple distances.

Input: shape as a collection of rendered views For each input shape, infer a set of viewpoints that maximally cover its surface across multiple distances.

Input: shape as a collection of rendered views For each input shape, infer a set of viewpoints that maximally cover its surface across multiple distances.

Input: shape as a collection of rendered views For each input shape, infer a set of viewpoints that maximally cover its surface across multiple distances.

FCN FCN max Surface based map Shaded images Depth images FCN shared filters View based map

Training The architecture is trained end to end with analytic gradients. CRF FCN Backpropagation / joint training (convnet+crf)

Challenges 3D models have missing or non photorealistic texture

Challenges 3D models have missing or non photorealistic texture (focus on shape instead)

ShapeNetCore: 8% improvement in labeling accuracy for complex categories (vehicles, furniture etc)

Key Observations 3D models have arbitrary orientation in the wild. Consistent shape orientation only in specific, well engineered datasets (often with manual intervention, no perfect alignment algorithm)

Comparisons pitfalls Method A assumes consistent shape alignment (or upright orientation), method B doesn t. Well, you may get much better numbers for B by changing its input! Convnet A has orders of magnitude more parameters than Convnet B. It might also be easy to get better numbers for B, if you increase its number of filters! Convnet A has an architectural trick that Convnet B could also have (e.g., U net, ensemble). Why not apply the same trick to B? Methods are largely tested on training data because of duplicate or near duplicate shapes. The more you overfit, the better! Ouch! 3DShapeNet has many identical models, or models with tiny differences (e.g., same airplane with different rockets)...