Hybrid Learning Framework for Large-Scale Web Image Annotation and Localization

Size: px

Start display at page:

Download "Hybrid Learning Framework for Large-Scale Web Image Annotation and Localization"

Jasper Wiggins
5 years ago
Views:

1 .. Hybrid Learning Framework for Large-Scale Web Image Annotation and Localization Yong Li, Jing Liu, Yuhang Wang, Bingyuan Liu, Jun Fu, Yunze Gao, Hui Wu, Hang Song, Peng Ying, Hanqing lu Image & Video Analysis Group, Institute of Automation, Chinese Academy of Sciences September 9, 2015 Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

2 Outline.1 Introduction to The Image Annotation Task.2 The Proposed Hybrid Learning Framework.3 Experiments and Results.4 Conclusions Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

3 Introduction to The Image Annotation Task Two subtasks in Scalable Concept Image Annotation challenge Image concept detection and localisation: it is to annotate and localize 500,000 web images with 251 concepts. Generation of textual descriptions of images: it is to describe an image with a textual description of the visual content depicted in the image. We focus on the subtask 1. Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

4 Introduction to The Image Annotation Task Two subtasks in Scalable Concept Image Annotation challenge Image concept detection and localisation: it is to annotate and localize 500,000 web images with 251 concepts. Generation of textual descriptions of images: it is to describe an image with a textual description of the visual content depicted in the image. We focus on the subtask 1. Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

5 The Proposed Hybrid Learning Framework We propose a two-stage hybrid learning framework as follows. Annotation Localization Query Image Webpage Description Annotation By Classification Annotation By Search Localization By Fast RCNN Localization By Search Face Detection Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

6 Flowchart of The Proposed Approach. Yong Li (IVA, NLPR, CASIA) HLF.... September 9, / 25

7 Data Preparation We prefer multiple online resources to perform such task, including the ImageNet database, the Sun database, the WordNet, and the online image sharing website Flicker. There are 175 concepts concurrent in the ImageNet dataset and the ImageCLEF task simultaneously. Meanwhile, there are 217 concepts concurrent in the Sun dataset and the ImageCLEF task. For the concepts not in the ImageNet and Sun database or with very few annotated images, we crawled images from the website Flicker and filtered it with 50 images left for each concept. Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

8 The Proposed Hybrid Learning Framework Image Annotation Annotation by classification. Annotation by search. Concept Localization Localization By Fast RCNN. Localization By Search. Localization of Face Related Concepts. Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

9 Annotation by classification We train a linear SVM for each concept with one-vs-rest strategy. The visual feature is extracted with the released VGG19 model with dimension Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

10 Annotation by classification The ratio between the positive samples and negative samples is between 1:10 to 1:5. Due to images usually being labelled with multiple concepts in training data, the negative samples for a given concept classifier are selected as the ones whose all labels do not include the concept. The final output of SVM is normalized with the Logistic function 1 f(x) = to make the output value between [0, 1]. 1+exp( w T x) Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

11 The Proposed Hybrid Learning Framework Image Annotation Annotation by classification. Annotation by search. Concept Localization Localization By Fast RCNN. Localization By Search. Localization of Face Related Concepts. Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

Annotation by Search The search-based approach for image annotation works on the assumption that visual similar images should reflect similar

12 Annotation by Search The search-based approach for image annotation works on the assumption that visual similar images should reflect similar semantical concepts, and most textual information of web images is relevant to their visual content. Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

13 Annotation by Search Annotation by search mainly consists of the following three steps, Search for similar images: The extracted 4096-dimensional deep features are mapped to dimensional binary hash codes leveraging the random projection algorithm to save memory space and reduce retrieval time. Semantic Reranking: To further improve the results of visual similarity search, we explore the textual information of the given image, and perform the semantic similarity search on the top-n A visually similar images to rerank the similar image set. Relevant concept selection: We employ a WordNet-based approach, which is similar to the solution in [1]. [1]: Budikova, P., Botorek, J., Batko, M., Zezula, P.: DISA at imageclef 2014: The search-based solution for scalable image annotation. Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

14 Annotation by Search Annotation by search mainly consists of the following three steps, Search for similar images: The extracted 4096-dimensional deep features are mapped to dimensional binary hash codes leveraging the random projection algorithm to save memory space and reduce retrieval time. Semantic Reranking: To further improve the results of visual similarity search, we explore the textual information of the given image, and perform the semantic similarity search on the top-n A visually similar images to rerank the similar image set. Relevant concept selection: We employ a WordNet-based approach, which is similar to the solution in [1]. [1]: Budikova, P., Botorek, J., Batko, M., Zezula, P.: DISA at imageclef 2014: The search-based solution for scalable image annotation. Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

15 Annotation by Search Annotation by search mainly consists of the following three steps, Search for similar images: The extracted 4096-dimensional deep features are mapped to dimensional binary hash codes leveraging the random projection algorithm to save memory space and reduce retrieval time. Semantic Reranking: To further improve the results of visual similarity search, we explore the textual information of the given image, and perform the semantic similarity search on the top-n A visually similar images to rerank the similar image set. Relevant concept selection: We employ a WordNet-based approach, which is similar to the solution in [1]. [1]: Budikova, P., Botorek, J., Batko, M., Zezula, P.: DISA at imageclef 2014: The search-based solution for scalable image annotation. Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

16 The Proposed Hybrid Learning Framework Image Annotation Annotation by classification. Annotation by search. Concept Localization Localization By Fast RCNN. Localization By Search. Localization of Face Related Concepts. Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

17 Localization By Fast RCNN Fast RCNN originates from the method RCNN, which provides classification result and regressed location simultaneously. Deep ConvNet RoI projection Conv feature map RoI pooling layer Outputs: softmax FCs FC RoI feature vector bbox regressor FC For each RoI Comparisons of such two methods will be given. We have used both methods during the competition. Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

Localization By Fast RCNN Fast RCNN originates from the method RCNN, which provides classification result and regressed location simultaneously.

18 Localization By Fast RCNN Fast RCNN originates from the method RCNN, which provides classification result and regressed location simultaneously. Deep ConvNet RoI projection Conv feature map RoI pooling layer Outputs: softmax FCs FC RoI feature vector bbox regressor FC For each RoI Comparisons of such two methods will be given. We have used both methods during the competition. Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

19 Localization By Fast RCNN RCNN has demonstrated its advantage in object detection. However, it has notable drawbacks. Training is a multi-stage pipeline, which is expensive in space and time. One first finetunes a ConvNet for detection using cross-entropy loss. Then linear SVMs are fit to ConvNet features computed on warped object proposals. Finally, bounding-box regressors are learned. Test-time detection is slow. At test time, features are extracted from each warped object proposal in each test image. Detection with VGG16 takes 47s per image. Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

20 Localization By Fast RCNN Compared with RCNN, Fast RCNN has the following advantages. It achieves higher detection quality(map). Training is single-stage, using a multi-task loss. It provides classification result and regressed location simultaneously for each candidate object proposal. No disk storage is required for feature caching. For us, we preferred the RCNN method when decided to take part in the task. However, it is really tough to deal with 500,000 images with RCNN. We met fast RCNN two weeks before the deadline. It helped us a lot. Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

21 Localization By Fast RCNN Compared with RCNN, Fast RCNN has the following advantages. It achieves higher detection quality(map). Training is single-stage, using a multi-task loss. It provides classification result and regressed location simultaneously for each candidate object proposal. No disk storage is required for feature caching. For us, we preferred the RCNN method when decided to take part in the task. However, it is really tough to deal with 500,000 images with RCNN. We met fast RCNN two weeks before the deadline. It helped us a lot. Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

22 The Proposed Hybrid Learning Framework Image Annotation Annotation by classification. Annotation by search. Concept Localization Localization By Fast RCNN. Localization By Search. Localization of Face Related Concepts. Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

23 Localization By Search We give a special consideration to the 25 scene related concepts for concept localization ( e.g., beach, sea, river and valley ). We first find its top-n L visually similar neighbors with the same concept in the localization training data. We use their merged bounding box as the location of the scenery concept. Annotation Query Image Similar Image 1 Similar Image 2 Similar Image 3 Valley Shore Bathroom Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

24 The Proposed Hybrid Learning Framework Image Annotation Annotation by classification. Annotation by search. Concept Localization Localization By Fast RCNN. Localization By Search. Localization of Face Related Concepts. Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

25 Localization of Face Related Concepts We give a special consideration to the person and face related concepts. Face detection and facial point localization have been actively studied over the past years and achieve satisfactory performance. Face Beard Hand Head Hair Leg Mouth Tongue Foot Eye Ear Arm Nose Neck Linear classifiers are trained with the SIFT features extracted on the facial points to determine the following concepts. Man Woman Male child Female child Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

26 Localization of Face Related Concepts We give a special consideration to the person and face related concepts. Face detection and facial point localization have been actively studied over the past years and achieve satisfactory performance. Face Beard Hand Head Hair Leg Mouth Tongue Foot Eye Ear Arm Nose Neck Linear classifiers are trained with the SIFT features extracted on the facial points to determine the following concepts. Man Woman Male child Female child Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

27 Experiments and Results We have submitted 8 runs with 5 different settings of combinations with the above model modules, including Annotation By Classification (ABC), Annotation By Search (ABS), localization by Fast R-CNN (FRCN), Localization By Search (LBS) and Concept Extension (CE). Method ABC CE ABS LBS FRCN SVM Threshold Overlap 0.5 Overlap 0 Setting 1 yes no yes yes yes Setting 2 yes yes yes yes yes Setting 3 yes yes no yes yes Setting 4 yes no no yes yes Setting 5 no no no no yes Effectiveness ABS is verified by comparing the result of setting 2 and setting 3 with 2 percent improvement. Effectiveness of FRCN is verified with setting 5, which is unsatisfactory. The proposed two stage process is more suitable to deal with such task. Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

28 MAP with Different Recall Rates Mean average precision(%) Setting 1 Setting 2 Setting 3 Setting 4 Setting Overlap with GT labels(%) Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

29 Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25 CEA_LIST CNRS_TPT-1 CNRS_TPT-2 CNRS_TPT-3 CNRS_TPT-4 CNRS_TPT-5 CNRS_TPT-6 CNRS_TPT-7 CNRS_TPT-8 CNRS_TPT-9 CNRS_TPT-10 IRIP-iCC-1 IRIP-iCC-2 IRIP-iCC-3 IRIP-iCC-4 IRIP-iCC-5 IRIP-iCC-6 IRIP-iCC-7 IRIP-iCC-8 IRIP-iCC-9 IRIP-iCC-10 ISIA IVANLPR-1 IVANLPR-2 IVANLPR-3 IVANLPR-4 IVANLPR-5 IVANLPR-6 IVANLPR-7 IVANLPR-8 kdevir-1 kdevir-2 kdevir-3 kdevir-4 kdevir-5 Lip6.fr-1 Lip6.fr-2 Lip6.fr-3 Lip6.fr-4 Lip6.fr-5 Lip6.fr-6 Lip6.fr-7 MLVISP6-1 MLVISP6-2 MLVISP6-3 MLVISP6-4 MLVISP6-5 MLVISP6-6 MLVISP6-7 MLVISP6-8 MLVISP6-9 MLVISP6-10 MLVISP6-11 MLVISP6-12 MLVISP6-13 MLVISP6-14 RUC-1 RUC-2 RUC-3 RUC-4 RUC-5 RUC-6 RUC-7 RUC-8 SMIVA-1 SMIVA-2 SMIVA-3 SMIVA-4 SMIVA-5 SMIVA-6 SMIVA-7 SMIVA-8 SMIVA-9 SMIVA-10 SMIVA-11 SMIVA-12 SMIVA-13 SMIVA-14 SMIVA-15 SMIVA Map with 50% overlap with the GT labels Submission Results of Different Teams

30 Conclusions We propose a two-stage hybrid learning framework for Large-Scale Web Image Annotation and Localization. It is important to analyse the concepts carefully and dealing with different kinds of concepts with different modules is necessary. For the proposed method, the performance is limited by the number of training samples with full annotations. Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

31 Thanks for your attention! More info: Yong Li (IVA, NLPR, CASIA) HLF September 9, / 25

Deep learning for object detection. Slides from Svetlana Lazebnik and many others

Deep learning for object detection Slides from Svetlana Lazebnik and many others Recent developments in object detection 80% PASCAL VOC mean0average0precision0(map) 70% 60% 50% 40% 30% 20% 10% Before deep