3D Attention-Driven Depth Acquisition for Object Identification

Size: px

Start display at page:

Download "3D Attention-Driven Depth Acquisition for Object Identification"

Dale Page
5 years ago
Views:

1 3D Attention-Driven Depth Acquisition for Object Identification Kai Xu, Yifei Shi, Lintao Zheng, Junyu Zhang, Min Liu, Hui Huang, Hao Su, Daniel Cohen-Or and Baoquan Chen National University of Defense Technology Shandong University Shenzhen University SIAT Stanford University Tel-Aviv University

2 Background & motivation Robotic indoor scene modeling Perception on object

3 Background & motivation Indoor environments acquisition and modeling Dense Reconstruction Object Extraction [Nießner et al. 2013] [Xu et al. 2015]

4 Background & motivation What are these objects?

6 Active object recognition

7 Active object recognition

8 Problem setting A robot actively acquires new observations to gradually increase the confidence of object recognition Two key components: Object classification Estimate object class based on so far acquired observations View planning Predict the Next-Best- View to maximize its information gain

9 The main challenge Observation is partial and progressive Shape description/matching with partial data is hard Observations from varying views

10 The main challenge Observation is partial and progressive View planning Observed view??? Unobserved views How can you know which view is better without knowing its observation?

11 The main challenge Real indoor scenes are often cluttered Degrade recognition accuracy Invalidate the off-line learned viewing policy

12 Related work

13 Related work Online scene analysis and modeling Plane/Object Extraction [Zhang et al. 2014] SemanticPaint [Valentin et al. 2015]

14 Related work Active reconstruction and recognition Next-best-view for reconstruction [Wu et al. 2014] Next-best-view for recognition [Wu et al. 2015]

15 Method

16 The general framework

17 The general framework Recognition Goal Action View planning Belief Observe

fixations over time to build up an internal representation of the scene Internal

18 An attentional formulation Humans focus attention selectively on parts of the visual space to acquire information when and where it is needed, and combine information from different fixations over time to build up an internal representation of the scene Internal representation Ronald Rensink Hand-writing recognition [Mnih et al. 2014] Image caption generation [Xu et al. 2015]

19 Recurrent Attention Model Recurrent Neural Networks (RNN) W hh y t 1 y t y t+1 x t W ih h t W ho y t h t 1 h t h t+1 x t 1 x t x t+1 Aggregate information

20 View-based observation v 0 φ t v t I (0) θ t I (t)

21 3D Recurrent Attention Model View selection θ 1, φ (1) h 2 (1) NBV emission θ 2, φ (2) h 2 (2) NBV emission θ 3, φ (3) h 2 (3) View aggregation initial view h 1 (1) classify h 1 (2) classify h 1 (3) classify θ 0, φ (0) θ 1, φ (1) θ 2, φ (2) Feature extraction I (0) Feature extraction I (1) Feature extraction I (2)

Max-pooling View 3D Recurrent Attention Model CNN 1 θ 1, φ

CNN(1) h 2 2 l 2 h 2 (2) h 2 (3) l K CNN 1 classify classify

2015] (1) h 1 initial view h 1 (2) h 1 (3) θ 0, φ (0) θ 1, φ

22 Max-pooling View 3D Recurrent Attention Model CNN 1 θ 1, φ (1) θ 2, φ (2) θ 3, φ (3) NBV emission l 1 NBV emission CNN 1 CNN(1) h 2 2 l 2 h 2 (2) h 2 (3) l K CNN 1 classify classify classify Multi-View CNN [Su et al. 2015] (1) h 1 initial view h 1 (2) h 1 (3) θ 0, φ (0) θ 1, φ (1) θ 2, φ (2) Feature extraction I (0) Feature extraction I (1) Feature extraction I (2)

23 Network training CNN Back propagation Reinforcement learning θ i, φ (i) θ i, φ (i) rendering I (i) Indifferentiable I (i)

24 Reinforcement learning Stop? agent state reward action Depth acquisition How good the depth is? environment

25 Reward r t = H t p t, p + I t p t, p t 1 C t prediction accuracy information gain movement cost

26 Part-level attention occlusion Informative parts How to distinguish these two chairs?

27 Attention extraction Convolutional Neural Network Mid-level kernels in CNN

28 Attention extraction One wing Two wings

29 Results and evaluation

30 Database 57,452 models 57 categories 12,311 models 40 categories Render 52 sampled views model Render with jittering 260 sampled views

31 Timing Database MV-RNN train MV-RNN test ShapeNet 49 hr. 0.1 sec. ModelNet40 22 hr. 0.1 sec.

32 Visualization of attentions Part-level attention View sequence View sequence

33 NBV estimation 40 classes Classification Accuracy

34 NBV estimation under occlusion Classification Accuracy

35 Results on real scenes

36 Results on real scenes

37 Results on real scenes

38 Recognizable objects Limitations No contextual information

39 Future works: Multi-modal recognition What is this? Image database Shape database

40 Future: Multi-robot scene reconstruction & understanding AscTec Pelican PR2 Turtlebot 40

41 Future: Multi-robot attention model Attention based on shared internal representation? 41

42 Thank you Q & A More details: kevinkaixu.net & yifeishi.net

Perceiving the 3D World from Images and Videos. Yu Xiang Postdoctoral Researcher University of Washington

Perceiving the 3D World from Images and Videos Yu Xiang Postdoctoral Researcher University of Washington 1 2 Act in the 3D World Sensing & Understanding Acting Intelligent System 3D World 3 Understand