The Hilbert Problems of Computer Vision. Jitendra Malik UC Berkeley & Google, Inc.

The Hilbert Problems of Computer Vision Jitendra Malik UC Berkeley & Google, Inc.

This talk The computational power of the human brain Research is the art of the soluble Hilbert problems, circa 2004 Hilbert problems, circa 2015

Moravec s argument(1998) ROBOT: Mere Machine To Transcendent Mind 1 neuron = 1000 instructions/sec 1 synapse = 1 byte of information Human brain then processes 10^14 IPS and has 10^14 bytes of storage In 2000, we have 10^9 IPS and 10^9 bytes on a desktop machine Assuming Moore s law we obtain human level computing power in 2025, or with a cluster of 100 nodes in 2015.

Neural network architectures Getting Deeper All The Time Alexnet: Krizhevsky, Sutskever & Hinton (2012) Zeiler & Fergus (2013) VGG: Simonyan and Zisserman(2014) Googlenet: Szegedy et al (2014) ResNet: He et al (2015) My talk will focus on a higher level of abstraction

Research is the Art of the Soluble

The Hilbert Problems of Computer Vision Jitendra Malik Jitendra Malik Presented at Bay Area Vision Meeting, UC Santa Cruz, 2004

Forty years of computer vision 1963-2003 1960s: Beginnings in artificial intelligence, image processing and pattern recognition 1970s: Foundational work on image formation: Horn, Koenderink, Longuet-Higgins 1980s: Vision as applied mathematics: geometry, multi-scale analysis, control theory, optimization 1990s: Geometric analysis largely completed Probabilistic/Learning approaches in full swing Successful applications in graphics, biometrics, HCI

And now Back to basics: the classic problem of understanding the scene from its image/s Central question: Interplay of bottom-up and top-down information

Early Vision What can we learn from image statistics that we didn't know already? How far can bottom-up image segmentation go? How do we make inferences from shading and texture patterns in natural images?

Static Scene Understanding What is the interaction between segmentation and recognition? What is the interaction between scenes, objects, and parts? What is the role of design vs. learning in recognition systems?

Dynamic Scene Understanding What is the role of high-level knowledge in long range motion correspondence? How do we find and track articulated structures? How do we represent "movemes" and actions?

With the benefit of hindsight

Early Vision What can we learn from image statistics that we didn't know already? Training a deep multi layer neural network on a supervised learning task develops general representations How far can bottom-up image segmentation go? Can produce a small set of object proposals which can then be assigned labels by a classifier. Sliding windows no longer necessary How do we make inferences from shading and texture patterns in natural images? Instead of trying to model inverse optics, we can use learning. If there is a sparsityof data, one needs to use priors with few parameters, given enough data one can use nonparametric techniques like neural networks

Static Scene Understanding What is the interaction between segmentation and recognition? Bidirectional Information Flow What is the interaction between scenes, objects, and parts? Still Open. Context has not yet lived up to its promise. What is the role of design vs. learning in recognition systems? Learn as much as you can from data. Don t design features. Design architectures.

The Three R s of Vision Recognition Reconstruction Reorganization Each of the 6 directed arcs in this diagram is a useful direction of information flow

Simultaneous Detection & Segmentation Hariharan, Arbelaez, Girshick & Malik (2014,2015) We mark the pixels corresponding to an object instance, not just its bounding box.

Dynamic Scene Understanding What is the role of high-level knowledge in long range motion correspondence? How to find good correspondences can be learnt. An extreme case is Flownet (Brox), which shows how even optical flow computation can be learnt. How do we find and track articulated structures? Great progress in finding human keypoints How do we represent "movemes" and actions? Still open. We don t understand the hierarchical structure of activity and events

Human Pose Estimation with Iterative Error Feedback Joao Carreira Pulkit Agrawal Katerina Fragkiadaki Jitendra Malik

Results on MPI dataset assuming person scale is known (upper) or unknown(lower)

The (new) Hilbert Problems of Computer Vision Jitendra Malik Jitendra Malik Workshop, Santiago, 2015

People, Places and Things Model every place in the world Model every object category Model humans and develop algorithms for social perception

Reconstructing the world Over the past 10 years, 3D modeling from images has made huge advances in scale, quality, and generality. We can reconstruct scenes automatically from huge collections of photos downloaded from the Internet Snavely, Seitz, Szeliski. Reconstructing the World from Internet Photo Collections.

Reconstructing the world Over the past 10 years, 3D modeling from images has made huge advances in scale, quality, and generality. We can reconstruct scenes that vary over time Matzen & Snavely. Scene Chronology. ECCV 2014

Reconstructing the great indoors using Semantic Reconstruction of Rooms and Objects rendering 3D mesh point cloud using Depth Cameras Choi, Zhou, Koltun. Robust Reconstruction of Indoor Scenes. CVPR 2015 Ikehata, Yan, Furukawa. Structured Indoor Modeling. ICCV 2015

ShapeNet (Stanford & Princeton)

Image 3D Reconstruction from a Single Image Kar, Tulsiani, Carreira & Malik (2015) R-CNN, SDS Object Detection and Instance Segmentation Girshick et al., Hariharan et al. CVPR 2014 CVPR 2015 Viewpoints and Keypoints Tulsiani Viewpoint & Malik, Estimation CVPR 2015 car car Deformable 3D model High Frequency Depth Map Category Specific 3D Reconstruction

Basis Shape Models

Social Perception Computers today have pitifully low social intelligence We need to understand the internal state of humans as they interact with each other and the external world Examples: emotional state, body language, current goals.

Towards Human-Level AI First let s look at the evidence from Child Development

The Development of Embodied Cognition: Six Lessons from Babies Linda Smith & Michael Gasser

The Six Lessons Be multi-modal Be incremental Be physical Explore Be social Use language An example: Learning to see by moving, P. Agrawal, J. Carreira, J. Malik (ICCV 2015)

Towards Human-Level AI Perceptual Robotics Visual Grounding of Language Acquire Visual Commonsense from Observation and Interaction

Scene Understanding from RGB-D Images Object Detection, Instance Segmentation & Pose Estimation Saurabh Gupta, Ross Girshick, Pablo Arbeláez, Jitendra Malik UC Berkeley

Input Re-organization Recognition Detailed 3D Understanding Contour Detection Semantic Segm. Color and Depth Image Pair Region Proposal Generation Object Detection Instance Segm. Pose Estimation 40

Instance Segmentation

What we would like to infer Will person B put some money into Person C s tip bag?

Labeling Gupta & Malik (2015)

Events e.g. A meal at a restaurant Classical AI/Cognitive Science Solution Schemas (frames, scripts etc.) To have a robust, visually grounded solution we need to learn the equivalent from video + Knowledge Graph like structures Perhaps best tackled in particular domains e.g. team sports, instructional videos etc.

Core problems in vision Relationship between parts, objects and scenes The hierarchical structure of human behaviormovement, goals, actions and events ACTION = MOVEMENT + GOAL

Core problems in learning Should learning be done end-to-end for specific tasks (no semantics for intermediate layers) or do we want intermediate representations to emerge driven by the need to solve multiple tasks? Why are neural network solutions so remarkably replicable? Why do we not get stuck in bad local minima? What is a taxonomy of learning problems that is aligned with what we know from child development?

The best is yet to come.. Let s meet again in 2025