3D Shape Analysis with Multi-view Convolutional Networks. Evangelos Kalogerakis

Similar documents
3D Shape Segmentation with Projective Convolutional Networks

Data driven 3D shape analysis and synthesis

Machine Learning for Shape Analysis and Processing. Evangelos Kalogerakis

arxiv: v3 [cs.cv] 13 Nov 2017

3D Shape Segmentation with Projective Convolutional Networks

Su et al. Shape Descriptors - III

Detecting and Parsing of Visual Objects: Humans and Animals. Alan Yuille (UCLA)

Conditional Random Fields as Recurrent Neural Networks

Learning to generate 3D shapes

Analysis and Synthesis of 3D Shape Families via Deep Learned Generative Models of Surfaces

Machine Learning. Deep Learning. Eric Xing (and Pengtao Xie) , Fall Lecture 8, October 6, Eric CMU,

Pixels to Voxels: Modeling Visual Representation in the Human Brain

Learning 3D Part Detection from Sparsely Labeled Data: Supplemental Material

Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Architecture David Eigen, Rob Fergus

Machine Learning Techniques for Geometric Modeling. Evangelos Kalogerakis

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

PARTIAL STYLE TRANSFER USING WEAKLY SUPERVISED SEMANTIC SEGMENTATION. Shin Matsuo Wataru Shimoda Keiji Yanai

Using Machine Learning for Classification of Cancer Cells

Know your data - many types of networks

Learning from 3D Data

Beyond bags of features: Adding spatial information. Many slides adapted from Fei-Fei Li, Rob Fergus, and Antonio Torralba

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution and Fully Connected CRFs

... arxiv: v2 [cs.cv] 5 Sep Learning local shape descriptors from part correspondences with multi-view convolutional networks

Martian lava field, NASA, Wikipedia

Deep Learning for Robust Normal Estimation in Unstructured Point Clouds. Alexandre Boulch. Renaud Marlet

ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

CMU Lecture 18: Deep learning and Vision: Convolutional neural networks. Teacher: Gianni A. Di Caro

Convolutional Networks in Scene Labelling

Convolutional Neural Networks + Neural Style Transfer. Justin Johnson 2/1/2017

Learning and Inferring Depth from Monocular Images. Jiyan Pan April 1, 2009

Seeing the unseen. Data-driven 3D Understanding from Single Images. Hao Su

Real-Time Depth Estimation from 2D Images

3D Deep Learning on Geometric Forms. Hao Su

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

CS468: 3D Deep Learning on Point Cloud Data. class label part label. Hao Su. image. May 10, 2017

Analysis: TextonBoost and Semantic Texton Forests. Daniel Munoz Februrary 9, 2009

arxiv: v1 [cs.cv] 31 Mar 2016

Polygonal Meshes. Thomas Funkhouser Princeton University COS 526, Fall 2016

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

The Hilbert Problems of Computer Vision. Jitendra Malik UC Berkeley & Google, Inc.

Intrinsic3D: High-Quality 3D Reconstruction by Joint Appearance and Geometry Optimization with Spatially-Varying Lighting

MULTI-LEVEL 3D CONVOLUTIONAL NEURAL NETWORK FOR OBJECT RECOGNITION SAMBIT GHADAI XIAN LEE ADITYA BALU SOUMIK SARKAR ADARSH KRISHNAMURTHY

Multi-view 3D Models from Single Images with a Convolutional Network

Visual features detection based on deep neural network in autonomous driving tasks

Lecture 13 Segmentation and Scene Understanding Chris Choy, Ph.D. candidate Stanford Vision and Learning Lab (SVL)

Depth from Stereo. Dominic Cheng February 7, 2018

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Deep Learning For Video Classification. Presented by Natalie Carlebach & Gil Sharon

Decomposing a Scene into Geometric and Semantically Consistent Regions

Deep Learning. Deep Learning. Practical Application Automatically Adding Sounds To Silent Movies

Learning and Recognizing Visual Object Categories Without First Detecting Features

Processing 3D Surface Data

LEARNING TO GENERATE CHAIRS WITH CONVOLUTIONAL NEURAL NETWORKS

Recurrent Convolutional Neural Networks for Scene Labeling

Paper Motivation. Fixed geometric structures of CNN models. CNNs are inherently limited to model geometric transformations

IDE-3D: Predicting Indoor Depth Utilizing Geometric and Monocular Cues

Fully Convolutional Network for Depth Estimation and Semantic Segmentation

Deep 3D Machine Learning for Reconstruction and Repair of 3D Surfaces

Urban Scene Segmentation, Recognition and Remodeling. Part III. Jinglu Wang 11/24/2016 ACCV 2016 TUTORIAL

Weakly Supervised Object Recognition with Convolutional Neural Networks

3D model classification using convolutional neural network

3D Deep Learning

Structured light 3D reconstruction

Deep Models for 3D Reconstruction

DeepIM: Deep Iterative Matching for 6D Pose Estimation - Supplementary Material

Deconvolution Networks

Learning Semantic Environment Perception for Cognitive Robots

SSD: Single Shot MultiBox Detector. Author: Wei Liu et al. Presenter: Siyu Jiang

Detecting Object Instances Without Discriminative Features

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

Learning Deep Structured Models for Semantic Segmentation. Guosheng Lin

Two-Stream Convolutional Networks for Action Recognition in Videos

Object recognition (part 2)

CRF Based Point Cloud Segmentation Jonathan Nation

Region-based Segmentation and Object Detection

Deep Learning. Visualizing and Understanding Convolutional Networks. Christopher Funk. Pennsylvania State University.

RGBD Occlusion Detection via Deep Convolutional Neural Networks

Flow-Based Video Recognition

CS 1674: Intro to Computer Vision. Attributes. Prof. Adriana Kovashka University of Pittsburgh November 2, 2016

Cross-domain Deep Encoding for 3D Voxels and 2D Images

LSTM and its variants for visual recognition. Xiaodan Liang Sun Yat-sen University

arxiv: v1 [cs.lg] 31 Oct 2018

Deep Learning in Image Processing

Computer Vision: Summary and Discussion

Multi-view Stereo. Ivo Boyadzhiev CS7670: September 13, 2011

3D Object Recognition and Scene Understanding from RGB-D Videos. Yu Xiang Postdoctoral Researcher University of Washington

3D Perception. CS 4495 Computer Vision K. Hawkins. CS 4495 Computer Vision. 3D Perception. Kelsey Hawkins Robotics

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington

Shape Co-analysis. Daniel Cohen-Or. Tel-Aviv University

Face Recognition A Deep Learning Approach

Lecture 7: Semantic Segmentation

COMP 551 Applied Machine Learning Lecture 16: Deep Learning

Deep Incremental Scene Understanding. Federico Tombari & Christian Rupprecht Technical University of Munich, Germany

Robotics Programming Laboratory

Structured Prediction using Convolutional Neural Networks

Deep Face Recognition. Nathan Sun

AUTOMATIC 3D HUMAN ACTION RECOGNITION Ajmal Mian Associate Professor Computer Science & Software Engineering

Weighted Convolutional Neural Network. Ensemble.

arxiv: v1 [cs.cv] 2 Nov 2018

PIXELS TO VOXELS: MODELING VISUAL REPRESENTATION IN THE HUMAN BRAIN

Transcription:

3D Shape Analysis with Multi-view Convolutional Networks Evangelos Kalogerakis

3D model repositories [3D Warehouse - video]

3D geometry acquisition [KinectFusion - video]

3D shapes come in various flavors Polygon meshes Analytic surfaces Point Clouds May have different resolution, non manifold geometry, arbitrary or no texture and interior, disjoint parts, noise

We need algorithms that understand shapes Office chair Geometric representation

We need algorithms that understand shapes Office chair back seat base Geometric representation

We need algorithms that understand shapes Office chair back seat base Geometric representation Correspondences (structure, function, style, point based)

Why shape understanding? Generative models of shapes

Why shape understanding? Generative models of shapes Kalogerakis, Chaudhuri, Koller, Koltun, SIGGRAPH 2012

Why shape understanding? Scene design Lun, Kalogerakis, Wang, Sheffer, SIGGRAPH ASIA 2016

Why shape understanding? Texturing Ear Head Torso Back Upper arm Lower arm Hand Upper leg Lower leg Foot Tail Kalogerakis, Hertzmann, Singh, SIGGRAPH 2010

Why shape understanding? Character Animation Ear Head Torso Back Upper arm Lower arm Hand Upper leg Lower leg Foot Tail Simari, Nowrouzezahrai, Kalogerakis, Singh, SGP 2009

How can we perform shape understanding? It is very hard to perform shape understanding with manually specified rules & hand engineered descriptors

The importance of good shape descriptors descriptor 2 + + + + + + Motorbike Not Motorbike descriptor 1 Old-style descriptors: surface curvature, spin images, PCA

The importance of good shape descriptors descriptor 2 + + + + + descriptor 1 + Motorbike Not Motorbike Learning Algorithm descriptor 2 + + Is this a Motorbike? + + + descriptor 1 classification boundary Old-style descriptors: surface curvature, spin images, PCA

The importance of good shape descriptors tank? engine? Is this a Motorbike? tank? + + + + engine? tank? + + + Learning Algorithm + engine? + Motorbike Not Motorbike Need descriptors that capture semantics, function

From shallow mappings Old style approach: output is a direct function of hand engineered shape descriptors y y f( x) ( w x) x 1 x 2 x 3... x d 1

to neural nets Introduce intermediate learned functions that yield optimized descriptors. y (2) y ( w h) (1) h ( w x) h h ( w x) 2 (1) 1 1 h 2 2 1 1 x 1 x 2 x 3... x d 1

Stack several layers to deep neural nets y h 1 ' h 2 ' h 3 ' h n ' 1 h 1 h 2 h 3... h m 1 x 1 x 2 x 3... x d 1

Convolutional neural networks Think of these intermediate functions as convolutional filters acting on small adjacent windows

Convolutional neural networks Basic idea: interchange several convolutional and pooling (subsampling) layers. Source: http://deeplearning.net/tutorial/lenet.html

The image processing success story The convolution filters capture various hierarchical patterns (edges, sub parts, parts ). Convnets have achieved high accuracy in several image processing tasks. Matthew D. Zeiler and Rob Fergus, Visualizing and Understanding Convolutional Networks, 2014

How can we apply convnets for 3D shapes? Motivated by the success of image based architectures and the fact that 3D shapes are often designed for viewing

View-based convnets for 3D shapes we introduced view based convnets for 3D shape analysis! Projective Convnet fuselage wing vert. stabilizer horiz. stabilizer E. Kalogerakis, M. Averkiou, S, Maji, S. Chaudhuri, CVPR 2017 (oral)

Input: shape as a collection of rendered views For each input shape, infer a set of viewpoints that maximally cover its surface across multiple distances.

Input: shape as a collection of rendered views Render depth & shaded images (normal dot view vector) Shaded images Depth images

Input: shape as a collection of rendered views Perform in plane camera rotations for rotational invariance 0 o, 90 o, 180 o, 270 o rotations Shaded images Depth images

Projective convnet architecture Each pair of depth & shaded images is processed by a convnet. Views are not ordered (no view correspondence across shapes). Convnets have shared parameters. feature feature feature Surface confidence Shaded images Depth images Convnet (fully convolutional)

Projective convnet architecture The output of each convnet branch is a confidence map per part label. hor. stabilizer feature feature feature Surface confidence Shaded images Depth images Convnet (fully convolutional) Image-based confidence

Projective convnet architecture The output of each convnet branch is a confidence map per part label. wing feature feature feature Surface confidence Shaded images Depth images Convnet (fully convolutional) Image-based confidence

Projective convnet architecture Since we want our output on the surface, we aggragate the image confidences across all views onto the surface. feature Shaded images Depth images feature feature Convnet (fully convolutional) Image-based confidence Surface confidence Surface confidence

Projective convnet architecture For each face / surface point, find all pixels that include it across all views, and use the max of confidence per label. feature Shaded images Depth images feature feature Convnet (fully convolutional) Image-based confidence max Surface confidence Surface confidence

Projective convnet architecture: CRF layer The last layer performs inference in a probabilistic model defined on the surface to promote coherent labeling. R 1 R 2 R 3 R 4 R 1, R 2, R 3, R 4 random variables taking values: fuselage wing vert. stabilizer hor. stabilizer

Projective convnet architecture: CRF layer It has the form of a Conditional Random Field whose unary term represents the surface based label confidences R 1 R 2 R 3 R 4 1 P( R, R, R, R... shape) P( R views) P( R, R surface) 1 2 3 4 f f f ' Z f 1.. n i, j Unary factor (convnet)

Projective convnet architecture: CRF layer Pairwise terms favor same label for triangles or points with similar surface normals and small geodesic distance R 1 R 2 R 3 R 4 1 P( R, R, R, R... shape) P( R views) P( R, R surface) 1 2 3 4 f f f ' Z f 1.. n i, j Pairwise factor (geodesic+normal dist.)

Projective convnet architecture: CRF layer Inference aims to find the most likely joint assignment to all surface random variables (optimization problem) R 1 R 2 R 3 R 4 max 1 P( R, R, R, R... shape) P( R views) P( R, R surface) 1 2 3 4 f f f ' Z f 1.. n i, j MAP assignment (mean field inference)

Training The architecture is trained end to end with analytic gradients. Training starts from a pretrained image based net (VGG16) Surface-based conf. CRF layer Output labeling Convnet Image-based conf. Forward pass / joint inference (convnet+crf) Backpropagation / joint training (convnet+crf)

Training The architecture is trained end to end with analytic gradients. Training starts from a pretrained image based net (VGG16) Surface-based conf. CRF layer Output labeling Convnet Image-based conf. Backpropagation / joint training (convnet+crf)

What are the learned filters doing? Activated in the presence of certain surface patterns / patches conv4 conv5 fc6

Dataset used in experiments Evaluation on ShapeNetCore (human labeled shapes). 50% used for training / 50% used for test split per category. [Yi et al. 2016]

ShapeNetCore: 8% improvement in labeling accuracy for complex categories (vehicles, furniture etc)

ShapeNetCore: 8% improvement in labeling accuracy for complex categories (vehicles, furniture etc)

ground-truth ShapeBoost ShapePFCN

ground-truth ShapeBoost ShapePFCN

Shape recognition with multi-view CNNs An earlier version of a view based CNN for shape recognition View pooling CNN 2... CNN 1 Su, Maji, Kalogerakis, Learned-Miller, ICCV 2015

Summary Inspired by human vision: view based convnets analyze what can be seen under view projections Aggregate information from multiple views selected to maximally cover the surface Fast processing at high resolutions Robust to input geometric representation artifacts (e.g., irregular tessellation, polygon soups, etc) Initialized from image based architectures pretrained on massive image datasets (filters capture shape+texture)

Thank you! Acknowledgements: NSF (CHS-1422441, CHS-1617333, IIS- 1617917), NVidia, Adobe, Facebook, Qualcomm. Experiments were performed in the UMass GPU cluster (400 GPUs!) obtained under a grant by the MassTech Collaborative. Our project web page: http://people.cs.umass.edu/~kalo/papers/shapepfcn/