The Hilbert Problems of Computer Vision Jitendra Malik UC Berkeley & Google, Inc.
This talk The computational power of the human brain Research is the art of the soluble Hilbert problems, circa 2004 Hilbert problems, circa 2015
Moravec s argument(1998) ROBOT: Mere Machine To Transcendent Mind 1 neuron = 1000 instructions/sec 1 synapse = 1 byte of information Human brain then processes 10^14 IPS and has 10^14 bytes of storage In 2000, we have 10^9 IPS and 10^9 bytes on a desktop machine Assuming Moore s law we obtain human level computing power in 2025, or with a cluster of 100 nodes in 2015.
Neural network architectures Getting Deeper All The Time Alexnet: Krizhevsky, Sutskever & Hinton (2012) Zeiler & Fergus (2013) VGG: Simonyan and Zisserman(2014) Googlenet: Szegedy et al (2014) ResNet: He et al (2015) My talk will focus on a higher level of abstraction
Research is the Art of the Soluble
The Hilbert Problems of Computer Vision Jitendra Malik Jitendra Malik Presented at Bay Area Vision Meeting, UC Santa Cruz, 2004
Forty years of computer vision 1963-2003 1960s: Beginnings in artificial intelligence, image processing and pattern recognition 1970s: Foundational work on image formation: Horn, Koenderink, Longuet-Higgins 1980s: Vision as applied mathematics: geometry, multi-scale analysis, control theory, optimization 1990s: Geometric analysis largely completed Probabilistic/Learning approaches in full swing Successful applications in graphics, biometrics, HCI
And now Back to basics: the classic problem of understanding the scene from its image/s Central question: Interplay of bottom-up and top-down information
Early Vision What can we learn from image statistics that we didn't know already? How far can bottom-up image segmentation go? How do we make inferences from shading and texture patterns in natural images?
Static Scene Understanding What is the interaction between segmentation and recognition? What is the interaction between scenes, objects, and parts? What is the role of design vs. learning in recognition systems?
Dynamic Scene Understanding What is the role of high-level knowledge in long range motion correspondence? How do we find and track articulated structures? How do we represent "movemes" and actions?
With the benefit of hindsight
Early Vision What can we learn from image statistics that we didn't know already? Training a deep multi layer neural network on a supervised learning task develops general representations How far can bottom-up image segmentation go? Can produce a small set of object proposals which can then be assigned labels by a classifier. Sliding windows no longer necessary How do we make inferences from shading and texture patterns in natural images? Instead of trying to model inverse optics, we can use learning. If there is a sparsityof data, one needs to use priors with few parameters, given enough data one can use nonparametric techniques like neural networks
Static Scene Understanding What is the interaction between segmentation and recognition? Bidirectional Information Flow What is the interaction between scenes, objects, and parts? Still Open. Context has not yet lived up to its promise. What is the role of design vs. learning in recognition systems? Learn as much as you can from data. Don t design features. Design architectures.
The Three R s of Vision Recognition Reconstruction Reorganization Each of the 6 directed arcs in this diagram is a useful direction of information flow
Simultaneous Detection & Segmentation Hariharan, Arbelaez, Girshick & Malik (2014,2015) We mark the pixels corresponding to an object instance, not just its bounding box.
Dynamic Scene Understanding What is the role of high-level knowledge in long range motion correspondence? How to find good correspondences can be learnt. An extreme case is Flownet (Brox), which shows how even optical flow computation can be learnt. How do we find and track articulated structures? Great progress in finding human keypoints How do we represent "movemes" and actions? Still open. We don t understand the hierarchical structure of activity and events
Human Pose Estimation with Iterative Error Feedback Joao Carreira Pulkit Agrawal Katerina Fragkiadaki Jitendra Malik
Results on MPI dataset assuming person scale is known (upper) or unknown(lower)
The (new) Hilbert Problems of Computer Vision Jitendra Malik Jitendra Malik Workshop, Santiago, 2015
People, Places and Things Model every place in the world Model every object category Model humans and develop algorithms for social perception
Reconstructing the world Over the past 10 years, 3D modeling from images has made huge advances in scale, quality, and generality. We can reconstruct scenes automatically from huge collections of photos downloaded from the Internet Snavely, Seitz, Szeliski. Reconstructing the World from Internet Photo Collections.
Reconstructing the world Over the past 10 years, 3D modeling from images has made huge advances in scale, quality, and generality. We can reconstruct scenes that vary over time Matzen & Snavely. Scene Chronology. ECCV 2014
Reconstructing the world Over the past 10 years, 3D modeling from images has made huge advances in scale, quality, and generality. We can reconstruct scenes that vary over time Matzen & Snavely. Scene Chronology. ECCV 2014 Martin-Brualla, Gallup, Seitz. Time-lapse Mining from Internet Photos.
Reconstructing the great indoors using Semantic Reconstruction of Rooms and Objects rendering 3D mesh point cloud using Depth Cameras Choi, Zhou, Koltun. Robust Reconstruction of Indoor Scenes. CVPR 2015 Ikehata, Yan, Furukawa. Structured Indoor Modeling. ICCV 2015
ShapeNet (Stanford & Princeton)
Image 3D Reconstruction from a Single Image Kar, Tulsiani, Carreira & Malik (2015) R-CNN, SDS Object Detection and Instance Segmentation Girshick et al., Hariharan et al. CVPR 2014 CVPR 2015 Viewpoints and Keypoints Tulsiani Viewpoint & Malik, Estimation CVPR 2015 car car Deformable 3D model High Frequency Depth Map Category Specific 3D Reconstruction
Basis Shape Models
Social Perception Computers today have pitifully low social intelligence We need to understand the internal state of humans as they interact with each other and the external world Examples: emotional state, body language, current goals.
Towards Human-Level AI First let s look at the evidence from Child Development
The Development of Embodied Cognition: Six Lessons from Babies Linda Smith & Michael Gasser
The Six Lessons Be multi-modal Be incremental Be physical Explore Be social Use language An example: Learning to see by moving, P. Agrawal, J. Carreira, J. Malik (ICCV 2015)
Towards Human-Level AI Perceptual Robotics Visual Grounding of Language Acquire Visual Commonsense from Observation and Interaction
Scene Understanding from RGB-D Images Object Detection, Instance Segmentation & Pose Estimation Saurabh Gupta, Ross Girshick, Pablo Arbeláez, Jitendra Malik UC Berkeley
Input Re-organization Recognition Detailed 3D Understanding Contour Detection Semantic Segm. Color and Depth Image Pair Region Proposal Generation Object Detection Instance Segm. Pose Estimation 40
Instance Segmentation
What we would like to infer Will person B put some money into Person C s tip bag?
Labeling Gupta & Malik (2015)
Events e.g. A meal at a restaurant Classical AI/Cognitive Science Solution Schemas (frames, scripts etc.) To have a robust, visually grounded solution we need to learn the equivalent from video + Knowledge Graph like structures Perhaps best tackled in particular domains e.g. team sports, instructional videos etc.
Core problems in vision Relationship between parts, objects and scenes The hierarchical structure of human behaviormovement, goals, actions and events ACTION = MOVEMENT + GOAL
Core problems in learning Should learning be done end-to-end for specific tasks (no semantics for intermediate layers) or do we want intermediate representations to emerge driven by the need to solve multiple tasks? Why are neural network solutions so remarkably replicable? Why do we not get stuck in bad local minima? What is a taxonomy of learning problems that is aligned with what we know from child development?
The best is yet to come.. Let s meet again in 2025