AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)

AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015) Donggeun Yoo, Sunggyun Park, Joon-Young Lee, Anthony Paek, In So Kweon.

State-of-the-art frameworks for object detection.

State-of-the-art frameworks for object detection. 1. Region-CNN framework. [Gkioxari et al., CVPR 14]

State-of-the-art frameworks for object detection. 1. Region-CNN framework. [Gkioxari et al., CVPR 14] Object proposal.

State-of-the-art frameworks for object detection. 1. Region-CNN framework. [Gkioxari et al., CVPR 14] CNN Object proposal.

State-of-the-art frameworks for object detection. 1. Region-CNN framework. [Gkioxari et al., CVPR 14] SVM CNN Object proposal.

State-of-the-art frameworks for object detection. 1. Region-CNN framework. [Gkioxari et al., CVPR 14] BB Reg. NMS SVM CNN Object proposal.

State-of-the-art frameworks for object detection. 1. Region-CNN framework. [Gkioxari et al., CVPR 14] BB Reg. NMS SVM CNN Object proposal. ( ) The maximally scored region is prone to focus on discriminative part (e.g. face) rather than entire object (e.g. human body).

State-of-the-art frameworks for object detection. 2. Detection by CNN-regression. [Szegedy et al., NIPS 13]

State-of-the-art frameworks for object detection. 2. Detection by CNN-regression. [Szegedy et al., NIPS 13] X 1 y 1 X 2 y 2 CNN

State-of-the-art frameworks for object detection. 2. Detection by CNN-regression. [Szegedy et al., NIPS 13] (X 2,Y 2 ) X 1 y 1 X 2 y 2 CNN (X 1,Y 1 )

State-of-the-art frameworks for object detection. 2. Detection by CNN-regression. [Szegedy et al., NIPS 13] (X 2,Y 2 ) X 1 y 1 X 2 y 2 CNN (X 1,Y 1 ) ( ) Direct mapping from an image to an exact bounding box is relatively difficult for a CNN.

Idea: Ensemble of weak prediction.

Idea: Ensemble of weak prediction. Stop signal

Idea: Ensemble of weak prediction. Stop signal Stop signal

Model: Rather than CNN regression model, use CNN classification model.

Model: Rather than CNN regression model, use CNN classification model. Bottom-right direction prediction. Top-left direction prediction. Fully connected. Fully connected. Convolution. Convolution. Convolution. Pooling. Normalization. Convolution. Pooling. Normalization. Convolution.

Model: Rather than CNN regression model, use CNN classification model. [ 3 directions, stop signal, no object ] R 5 [ 3 directions, stop signal, no object ] R 5 Bottom-right direction prediction. Top-left direction prediction. Fully connected. Fully connected. Convolution. Convolution. Convolution. Pooling. Normalization. Convolution. Pooling. Normalization. Convolution.

Model: Rather than CNN regression model, use CNN classification model. [ 3 directions, stop signal, no object ] R 5 [ 3 directions, stop signal, no object ] R 5 F F Fully connected. Fully connected. Convolution. Convolution. Convolution. Pooling. Normalization. Convolution. Pooling. Normalization. Convolution.

Iterative test: Ensemble of weak directions.

Training AttentionNet.

Training AttentionNet. 1. Generating training samples.

Training AttentionNet. 2. Minimizing the loss function by back-propagation and stochastic gradient descent. L = 1 2 L softmax y TL, t TL + 1 2 L softmax y BR, t BR.

Result. (Good examples.)

Result. (Bad examples.)

How to detect multiple instance?

Extension to multiple-instance: 1. Fast multi-scale sliding window search using fully-convolutional network.

*Fast extraction of multi-scale dense activations.

*Fast extraction of multi-scale dense activations. 227 227 3 Conv. 5 Conv. 4 Conv. 3 Conv. 2 Conv. 1 FC 8 FC 7 FC 6

*Fast extraction of multi-scale dense activations. 227 227 3 Conv. 5 Conv. 4 Conv. 3 Conv. 2 Conv. 1 FC 8 FC 7 FC 6 322 322 3 Conv. 5 Conv. 4 Conv. 3 Conv. 2 Conv. 1 FC 8 FC 7 FC 6

*Fast extraction of multi-scale dense activations. Idea: Fully connection can be equally implemented by convolutional layer. 227 227 3 Conv. 5 Conv. 4 Conv. 3 Conv. 2 Conv. 1 FC 8 FC 7 FC 6 322 322 3 Conv. 5 Conv. 4 Conv. 3 Conv. 2 Conv. 1 FC 8 FC 7 FC 6

*Fast extraction of multi-scale dense activations.

*Fast extraction of multi-scale dense activations. 4,096 Multi-scale dense activations.

*Fast extraction of multi-scale dense activations. 4,096 Each activation vector comes from each patch. Multi-scale dense activations.

Extension to multiple-instance: 1. Fast multi-scale sliding window search using fully-convolutional network.

Extension to multiple-instance: 2. Early rejection with { TL, BR } constraint.

Extension to multiple-instance: 2. Early rejection with { TL, BR } constraint. Satisfying { TL, BR }: Start iterative test.

Extension to multiple-instance: 2. Early rejection with { TL, BR } constraint. Un-satisfying { TL, BR }: Reject. Satisfying { TL, BR }: Start iterative test.

Extension to multiple-instance: 2. Early rejection with { TL, BR } constraint. Un-satisfying { TL, BR }: Reject. Un-satisfying { TL, BR }: Reject. Satisfying { TL, BR }: Start iterative test.

Extension to multiple-instance: Overall architecture for sliding window search.

Extension to multiple-instance: Merging multiple bounding boxes.

Evaluation on PASCAL VOC Series. PASCAL VOC 2007 Person. 58.7 RCNN. PASCAL VOC 2012 Person. RCNN-based.

Evaluation on PASCAL VOC Series. AttentionNet. PASCAL VOC 2007 Person. 58.7 RCNN. AttentionNet. PASCAL VOC 2012 Person. RCNN-based.

Evaluation on PASCAL VOC Series. AttentionNet+RCNN. PASCAL VOC 2007 Person. 58.7 RCNN. AttentionNet+RCNN. PASCAL VOC 2012 Person. RCNN-based.

Evaluation on PASCAL VOC Series. PASCAL VOC 2007 Person. 58.7 Precision-recall curve on PASCAL VOC 2007 Person. PASCAL VOC 2012 Person.