The Hilbert Problems of Computer Vision. Jitendra Malik UC Berkeley & Google, Inc.

Similar documents
ECCV Presented by: Boris Ivanovic and Yolanda Wang CS 331B - November 16, 2016

Multi-view stereo. Many slides adapted from S. Seitz

Fuzzy Set Theory in Computer Vision: Example 3

Computer Vision Lecture 16

Computer Vision Lecture 16

Deep learning for dense per-pixel prediction. Chunhua Shen The University of Adelaide, Australia

Computer Vision Lecture 16

Convolutional Neural Networks. Computer Vision Jia-Bin Huang, Virginia Tech

3D Shape Analysis with Multi-view Convolutional Networks. Evangelos Kalogerakis

LEARNING TO GENERATE CHAIRS WITH CONVOLUTIONAL NEURAL NETWORKS

Introduction to Computer Vision. Srikumar Ramalingam School of Computing University of Utah

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

An Exploration of Computer Vision Techniques for Bird Species Classification

The most cited papers in Computer Vision

Content-Based Image Recovery

Deformable Part Models

EECS 442 Computer Vision fall 2011

Large Scale 3D Reconstruction by Structure from Motion

TRANSPARENT OBJECT DETECTION USING REGIONS WITH CONVOLUTIONAL NEURAL NETWORK

Martian lava field, NASA, Wikipedia

Visual features detection based on deep neural network in autonomous driving tasks

Seeing the unseen. Data-driven 3D Understanding from Single Images. Hao Su

Seeing 3D chairs: Exemplar part-based 2D-3D alignment using a large dataset of CAD models

Computer vision: teaching computers to see

JOINT DETECTION AND SEGMENTATION WITH DEEP HIERARCHICAL NETWORKS. Zhao Chen Machine Learning Intern, NVIDIA

CS 395T Numerical Optimization for Graphics and AI (3D Vision) Qixing Huang August 29 th 2018

Proceedings of the International MultiConference of Engineers and Computer Scientists 2018 Vol I IMECS 2018, March 14-16, 2018, Hong Kong

Miniature faking. In close-up photo, the depth of field is limited.

Deep Learning for Virtual Shopping. Dr. Jürgen Sturm Group Leader RGB-D

INTRODUCTION TO DEEP LEARNING

Learning with Side Information through Modality Hallucination

Fully Convolutional Networks for Semantic Segmentation

Spatial Localization and Detection. Lecture 8-1

Spontaneously Emerging Object Part Segmentation

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks

Mask R-CNN. presented by Jiageng Zhang, Jingyao Zhan, Yunhan Ma

Convolutional Neural Networks + Neural Style Transfer. Justin Johnson 2/1/2017

Computer Vision: Making machines see

Part-Based Models for Object Class Recognition Part 3

Image-Based Modeling and Rendering

Synscapes A photorealistic syntehtic dataset for street scene parsing Jonas Unger Department of Science and Technology Linköpings Universitet.

Lecture 12 Recognition

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington

Fine-tuning Pre-trained Large Scaled ImageNet model on smaller dataset for Detection task

Bilinear Models for Fine-Grained Visual Recognition

Study of Residual Networks for Image Recognition

Deep Incremental Scene Understanding. Federico Tombari & Christian Rupprecht Technical University of Munich, Germany

Fully Convolutional Network for Depth Estimation and Semantic Segmentation

Photo-realistic Renderings for Machines Seong-heum Kim

Structured Prediction using Convolutional Neural Networks

Su et al. Shape Descriptors - III

Deep Learning in Image Processing

Photo Tourism: Exploring Photo Collections in 3D

3D Computer Vision. Structured Light II. Prof. Didier Stricker. Kaiserlautern University.

Augmenting Reality, Naturally:

Notes 9: Optical Flow

What are we trying to achieve? Why are we doing this? What do we learn from past history? What will we talk about today?

Dense 3D Reconstruction. Christiano Gava

Computer Vision. From traditional approaches to deep neural networks. Stanislav Frolov München,

3D Object Recognition and Scene Understanding from RGB-D Videos. Yu Xiang Postdoctoral Researcher University of Washington

Learning from 3D Data

Learning Semantic Environment Perception for Cognitive Robots

Intrinsic3D: High-Quality 3D Reconstruction by Joint Appearance and Geometry Optimization with Spatially-Varying Lighting

Object detection with CNNs

Training models for road scene understanding with automated ground truth Dan Levi

Does the Brain do Inverse Graphics?

Deep Learning in Visual Recognition. Thanks Da Zhang for the slides

Visual Computing TUM

VISION FOR AUTOMOTIVE DRIVING

Yiqi Yan. May 10, 2017

Flow-Based Video Recognition

Unsupervised Deep Learning. James Hays slides from Carl Doersch and Richard Zhang

CS 4758: Automated Semantic Mapping of Environment

Part Localization by Exploiting Deep Convolutional Networks

Why equivariance is better than premature invariance

Part-Based Models for Object Class Recognition Part 2

Part-Based Models for Object Class Recognition Part 2

Lecture 12 Recognition. Davide Scaramuzza

Learning Photographic Image Synthesis With Cascaded Refinement Networks. Jonathan Louie Huy Doan Siavash Motalebi

Lecture 5: Object Detection

Dense 3D Reconstruction. Christiano Gava

Real Time Monitoring of CCTV Camera Images Using Object Detectors and Scene Classification for Retail and Surveillance Applications

Smart Parking System using Deep Learning. Sheece Gardezi Supervised By: Anoop Cherian Peter Strazdins

Local-Level 3D Deep Learning. Andy Zeng

Does the Brain do Inverse Graphics?

Dynamic Routing Between Capsules

Reconstructive Sparse Code Transfer for Contour Detection and Semantic Labeling

Topics to be Covered in the Rest of the Semester. CSci 4968 and 6270 Computational Vision Lecture 15 Overview of Remainder of the Semester

Towards Real-Time Automatic Number Plate. Detection: Dots in the Search Space

Encoder-Decoder Networks for Semantic Segmentation. Sachin Mehta

When Big Datasets are Not Enough: The need for visual virtual worlds.

Where s Waldo? A Deep Learning approach to Template Matching

Actions and Attributes from Wholes and Parts

Storyline Reconstruction for Unordered Images

Computer Vision and Remote Sensing. Lessons Learned

Data-driven Depth Inference from a Single Still Image

Colored Point Cloud Registration Revisited Supplementary Material

12/3/2009. What is Computer Vision? Applications. Application: Assisted driving Pedestrian and car detection. Application: Improving online search

The Three R s of Vision

Detection and Fine 3D Pose Estimation of Texture-less Objects in RGB-D Images

Transcription:

The Hilbert Problems of Computer Vision Jitendra Malik UC Berkeley & Google, Inc.

This talk The computational power of the human brain Research is the art of the soluble Hilbert problems, circa 2004 Hilbert problems, circa 2015

Moravec s argument(1998) ROBOT: Mere Machine To Transcendent Mind 1 neuron = 1000 instructions/sec 1 synapse = 1 byte of information Human brain then processes 10^14 IPS and has 10^14 bytes of storage In 2000, we have 10^9 IPS and 10^9 bytes on a desktop machine Assuming Moore s law we obtain human level computing power in 2025, or with a cluster of 100 nodes in 2015.

Neural network architectures Getting Deeper All The Time Alexnet: Krizhevsky, Sutskever & Hinton (2012) Zeiler & Fergus (2013) VGG: Simonyan and Zisserman(2014) Googlenet: Szegedy et al (2014) ResNet: He et al (2015) My talk will focus on a higher level of abstraction

Research is the Art of the Soluble

The Hilbert Problems of Computer Vision Jitendra Malik Jitendra Malik Presented at Bay Area Vision Meeting, UC Santa Cruz, 2004

Forty years of computer vision 1963-2003 1960s: Beginnings in artificial intelligence, image processing and pattern recognition 1970s: Foundational work on image formation: Horn, Koenderink, Longuet-Higgins 1980s: Vision as applied mathematics: geometry, multi-scale analysis, control theory, optimization 1990s: Geometric analysis largely completed Probabilistic/Learning approaches in full swing Successful applications in graphics, biometrics, HCI

And now Back to basics: the classic problem of understanding the scene from its image/s Central question: Interplay of bottom-up and top-down information

Early Vision What can we learn from image statistics that we didn't know already? How far can bottom-up image segmentation go? How do we make inferences from shading and texture patterns in natural images?

Static Scene Understanding What is the interaction between segmentation and recognition? What is the interaction between scenes, objects, and parts? What is the role of design vs. learning in recognition systems?

Dynamic Scene Understanding What is the role of high-level knowledge in long range motion correspondence? How do we find and track articulated structures? How do we represent "movemes" and actions?

With the benefit of hindsight

Early Vision What can we learn from image statistics that we didn't know already? Training a deep multi layer neural network on a supervised learning task develops general representations How far can bottom-up image segmentation go? Can produce a small set of object proposals which can then be assigned labels by a classifier. Sliding windows no longer necessary How do we make inferences from shading and texture patterns in natural images? Instead of trying to model inverse optics, we can use learning. If there is a sparsityof data, one needs to use priors with few parameters, given enough data one can use nonparametric techniques like neural networks

Static Scene Understanding What is the interaction between segmentation and recognition? Bidirectional Information Flow What is the interaction between scenes, objects, and parts? Still Open. Context has not yet lived up to its promise. What is the role of design vs. learning in recognition systems? Learn as much as you can from data. Don t design features. Design architectures.

The Three R s of Vision Recognition Reconstruction Reorganization Each of the 6 directed arcs in this diagram is a useful direction of information flow

Simultaneous Detection & Segmentation Hariharan, Arbelaez, Girshick & Malik (2014,2015) We mark the pixels corresponding to an object instance, not just its bounding box.

Dynamic Scene Understanding What is the role of high-level knowledge in long range motion correspondence? How to find good correspondences can be learnt. An extreme case is Flownet (Brox), which shows how even optical flow computation can be learnt. How do we find and track articulated structures? Great progress in finding human keypoints How do we represent "movemes" and actions? Still open. We don t understand the hierarchical structure of activity and events

Human Pose Estimation with Iterative Error Feedback Joao Carreira Pulkit Agrawal Katerina Fragkiadaki Jitendra Malik

Results on MPI dataset assuming person scale is known (upper) or unknown(lower)

The (new) Hilbert Problems of Computer Vision Jitendra Malik Jitendra Malik Workshop, Santiago, 2015

People, Places and Things Model every place in the world Model every object category Model humans and develop algorithms for social perception

Reconstructing the world Over the past 10 years, 3D modeling from images has made huge advances in scale, quality, and generality. We can reconstruct scenes automatically from huge collections of photos downloaded from the Internet Snavely, Seitz, Szeliski. Reconstructing the World from Internet Photo Collections.

Reconstructing the world Over the past 10 years, 3D modeling from images has made huge advances in scale, quality, and generality. We can reconstruct scenes that vary over time Matzen & Snavely. Scene Chronology. ECCV 2014

Reconstructing the world Over the past 10 years, 3D modeling from images has made huge advances in scale, quality, and generality. We can reconstruct scenes that vary over time Matzen & Snavely. Scene Chronology. ECCV 2014 Martin-Brualla, Gallup, Seitz. Time-lapse Mining from Internet Photos.

Reconstructing the great indoors using Semantic Reconstruction of Rooms and Objects rendering 3D mesh point cloud using Depth Cameras Choi, Zhou, Koltun. Robust Reconstruction of Indoor Scenes. CVPR 2015 Ikehata, Yan, Furukawa. Structured Indoor Modeling. ICCV 2015

ShapeNet (Stanford & Princeton)

Image 3D Reconstruction from a Single Image Kar, Tulsiani, Carreira & Malik (2015) R-CNN, SDS Object Detection and Instance Segmentation Girshick et al., Hariharan et al. CVPR 2014 CVPR 2015 Viewpoints and Keypoints Tulsiani Viewpoint & Malik, Estimation CVPR 2015 car car Deformable 3D model High Frequency Depth Map Category Specific 3D Reconstruction

Basis Shape Models

Social Perception Computers today have pitifully low social intelligence We need to understand the internal state of humans as they interact with each other and the external world Examples: emotional state, body language, current goals.

Towards Human-Level AI First let s look at the evidence from Child Development

The Development of Embodied Cognition: Six Lessons from Babies Linda Smith & Michael Gasser

The Six Lessons Be multi-modal Be incremental Be physical Explore Be social Use language An example: Learning to see by moving, P. Agrawal, J. Carreira, J. Malik (ICCV 2015)

Towards Human-Level AI Perceptual Robotics Visual Grounding of Language Acquire Visual Commonsense from Observation and Interaction

Scene Understanding from RGB-D Images Object Detection, Instance Segmentation & Pose Estimation Saurabh Gupta, Ross Girshick, Pablo Arbeláez, Jitendra Malik UC Berkeley

Input Re-organization Recognition Detailed 3D Understanding Contour Detection Semantic Segm. Color and Depth Image Pair Region Proposal Generation Object Detection Instance Segm. Pose Estimation 40

Instance Segmentation

What we would like to infer Will person B put some money into Person C s tip bag?

Labeling Gupta & Malik (2015)

Events e.g. A meal at a restaurant Classical AI/Cognitive Science Solution Schemas (frames, scripts etc.) To have a robust, visually grounded solution we need to learn the equivalent from video + Knowledge Graph like structures Perhaps best tackled in particular domains e.g. team sports, instructional videos etc.

Core problems in vision Relationship between parts, objects and scenes The hierarchical structure of human behaviormovement, goals, actions and events ACTION = MOVEMENT + GOAL

Core problems in learning Should learning be done end-to-end for specific tasks (no semantics for intermediate layers) or do we want intermediate representations to emerge driven by the need to solve multiple tasks? Why are neural network solutions so remarkably replicable? Why do we not get stuck in bad local minima? What is a taxonomy of learning problems that is aligned with what we know from child development?

The best is yet to come.. Let s meet again in 2025