Large-Scale Semantics for Image and Video Retrieval

Size: px

Start display at page:

Download "Large-Scale Semantics for Image and Video Retrieval"

Terence Rodgers
5 years ago
Views:

1 Large-Scale Semantics for Image and Video Retrieval Lexing Xie with Apostol Natsev, John Smith, Matthew Hill, John Kender, Quoc-Bao Nyugen, *Jelena Tesic, *Rong Yan, *Alex Haubold IBM T J Watson Research Center, Hawthorne, IQR workshop, Sept 20,

2 The Challenge of Large Scale The data view and the semantics view 2

3 Outline: Scaling Up Visual Semantics Multi-modal retrieval with visual semantics. [MM 05+07,CIVR ICME 06+07, TV 04-07] Crowd-sourced semantics in real-world events. Find snow capped mountains Construction and detection of multi-faceted visual taxonomy. [MM 10] The value and accuracy of image tags. Outdoors? Sky? Ship? Red? Grass? Flower? Water? [CIVR 10] Tagging vs. browsing for image annotation. [CVPR 08] 3

4 Retrieval Approaches for Multi-modal Search queries retrieval approaches Find shots of Condoleezza Rice Find shots of soccer game with goalpost visible Find scenes of snow capped mountains Text????? Image similarity ~????? 4 Semantic concepts ~???? Government Leader US Flag Face Sports Vegetation Walking/Running Mountain Snow Skiing Challenge: Leverage limited vocabulary of concepts to answer arbitrary queries? [Natsev et. al. MM 07] 4

5 Overview of Concept-Based Retrieval Methods Lexical Statistical Text Haubold et al. [TV 05-06, ICME 06] Kennedy et al. [TV 05] Neo et al. [TV 05,CIVR 06] Snoek et al. [TV 06] Ontology Map [MM 07] Term Co-occurrence Map [MM 07] Visual examples --- Smith et al. [ICME 03] Natsev et al. [KDD 04,MM 05] Rasiwasia et al. [CIVR 06] Snoek et al. [TV 06] Tešić et al [CIVR 07] Feature Selection [MM 07] Retrieval results --- Yan and Hauptmann [TV 06, CMU PhD 06] Kennedy and Chang [CIVR 07] 5 TV = TREC Video Retrieval (TRECVID) Workshop 5

6 Mapping Words to Concepts Lexical: synonym expansion Trigger concept if matching one of the (stemmed) synonyms Weight concepts using trigger frequency C: Airplane W: Airplane Plane Airline Flight Lexical: WordNet similarity Based on graph structure, statistics, etc. Lesk, Resnik, Jiang-Conrath, Wu.. Lexical: Ontology mapping Semantically tag queries and concepts [IBM PIQUANT] Incorporate correlations and logical relations Statistical: co-occurrence with significance Compute G^2 score for each word-concept pair Capture both positive and negative relations O i ln (O i / E i ) G 2 = 2 * Σ i word! word concept O 1 O 2! concept O 3 O 4 6

7 Mapping Images to Concepts Content-based feature selection Obtain concept distribution in query examples Compare to those in corpus background Visual query examples H1 H0 Query modeling with few examples View queries in different vector spaces: image feature + concept Build bagged SVM classifiers for retrieval Probabilistic local context analysis (plca) Select semantic concepts and learn weights using initial retrieval results Use top/bottom retrieved documents as pseudo positive/negative examples Initial Results Airplane N3 N3 N3 Sky O Road N1 N2 N1 P P N3 N1 P N2 N1 P P N3 P N2 N2 N2 N1 N1 N2 N2 N1 N1 N2 Y 1 Y m Video 1 X Video M w 1 :? w C :? 7

8 Results of Concept Mapping Query: Find shots of soccer game with goalpost visible. concept triggered? approaches WordNet ontology G^2 stats plca visual sports Y Y Y Y Y person Y walking_running Y Y Y Y Y vegetation Y Y Y Evaluate on TRECVID benchmark corpora Multi-lingual news from 6 channels 2005: 110 hours (77,816 shots) 2006: 160 hours (119,490 shots) Speech transcripts (ASR and MT) 39 LSCOM-Lite concept detectors 24 distinct multi-modal queries for 2005 and 2006 Measure AP (average precision) on each query Present improvement over baseline (in %) news videos ENG CHN ARB Average precision Precision AP Recall 8

9 Lexical Approaches: Evaluation on TRECVID 2005 Concept-based re-ranking is effective for retrieval Improvements vary for different queries Each lexical approach has different strength All approaches perform comparably overall ~20%+ 9 9

10 Multi-modal Approaches: Evaluation on TRECVID 2006 Text queries Multimodal queries Text-based expansion approaches perform comparably but are complementary Content-based approaches bring significant further improvements 10

The Need for a Visual Semantic Structure We need organization beyond just a few visual concepts For people to remember and keep track To generate *Negative* training examples To help improve

Jolie party snow Isolated dots: Difficult to keep track of beyond just a few; Does not help create training data; Ignores inner semantic structure Concept Web: Take into account statistical structure

11 The Need for a Visual Semantic Structure We need organization beyond just a few visual concepts For people to remember and keep track To generate *Negative* training examples To help improve prediction results meeting [Xie et. al. MM 10] male portrait A. Jolie party snow Isolated dots: Difficult to keep track of beyond just a few; Does not help create training data; Ignores inner semantic structure Concept Web: Take into account statistical structure in training+classification. Difficult to keep track of beyond just a few; Does not help create training data; Ignores inner semantic structure [Naphade 01,Yan 06, Jiang 07, Qi 07 ] Concept Tree: Hierarchical structure to help training. Difficult to author beyond just a few; rigid, relationship often violated, [Wu et al, Schimd et al, He+Zemel, ] 11

12 What Relations To Model? From word semantics Visually detectable concrete nouns + some verbs and adjectives Hypernyms and hyponyms: parent-children relationships and mutual exclusion NOT modeled: part-of, entailment, highly view-point and context dependent From image appearances a picture is a sentence -- JRK Multiple divisions for a concept building, urban, protest, red,... artistic, active, freedom, entail, apple fruit walking, jogging Person: {male, female}, {young, old}, {face, head-n-shoulder, wholebody}, {military, civilian}, From corpus and detection statistics Concept co-occurrence Detector complementarity beach, palm tree sky beach via color beach via texture 12

13 Multi-faceted Visual Taxonomy Augment tree-structure with facets Peer concepts within facets are mutually exclusive e.g. cat, dog; apple, banana; urban, rural Concepts across facets are not mutually exclusive e.g. baby, female, smiling Advantages Takes into account multiple-descriptions Can be used to infer positive and negative training data Still to be considered Co-occurring objects/scenes Common-sense relationships 13

14 Generative Models for Concept Structure Mutual exclusion Multiple facets Parentchild Cooccurrence Detector uncertainty 14

15 Generative Models for Concept Structure Mutual exclusion Multiple facets Parentchild Cooccurrence Detector uncertainty 15

16 Performance Comparisons 222-node taxonomy with six top-level facets (10~50 nodes, depth 3~7) Evaluate on 60,200 web images Baseline: ensemble SVM trained on another 240K images Mutual exclusion relationships are important: improves classification accuracy Multi-faceted tree structure (bnet) performs best in classification, and is robust to noise in concept prior. 16

17 Example Classification Results 17

18 An Image =? an image = one thousand words -- Which thousand words? -- Which hundred words do we use first? night, city, building, travel, urban, architecture, sculpture, lights, cloud, sky, reflection, artistic, ring, nikon, seoul, korea, perspective, ring,... We would like to use visual categories that are: Popular Visually observable and distinct Not available in camera metadata Not subjective or abstract urban korea nikon artistic 18

The Tag Value Problem A typical image tagging algorithm [Xie et. al. CIVR 10] Which k tags for each image The most accurate ones? The most specific ones?

19 The Tag Value Problem A typical image tagging algorithm [Xie et. al. CIVR 10] Which k tags for each image The most accurate ones? The most specific ones? the most useful ones? Estimate factor of the following c -- visual tag u -- user-assigned social tag Outdoors? Sky? Ship? Red? Grass? Flower? Water? 19

20 Four tag re-ranking methods User-perceived value Information content Flickr popularity Tag posterior can we take into account both concept prior and tag popularity? 20

21 Tag posterior Goal: express tag confidence in terms of both classifier score, concept prior, and tag popularity.. Conditional independence Tag posterior factor: The use of all reranking factors: 21

22 Evaluation setup Recruited use group 24 users in US and UK Diverse professional backgrounds (R&D, marketing, services ) From casual photographers to amateur enthusiasts 5,998 user contributed photos uploaded to Flickr User evaluations Background questions Assigned subjective score to 60 visual tags. Label photos, delete incorrect machine-tags Subset of 5 users participated in pair-wise tag ranking study 22

23 User tolerance of tagging precision: 74~82% Mean: 2.59 Target Precision Mean: 74% Std. Dev.: 0.59 Target Precision Range: [68%, 80%] Mean: 4.57 Target Precision Mean: 82% Std. Dev.: 2.06 Target Precision Range: [72%, 87%] 23

24 Tag accuracy evaluation ImageNet-1000, UIUC-NEC Social 20 KNN-voting [Li, Snoek 09] ImageNet-1000, KNN ImageNet-1000, liblin* Aggregated performance over 50 core visual categories (excluding 11 colour categories). 24

IC appears to be less useful due to the difficulty of reliably estimating tag prior.

25 User study: comparing tags Comparative user study to vote on two sets of tags. Flickr popularity and Posterior popularity are the winners. IC appears to be less useful due to the difficulty of reliably estimating tag prior. Weighting flickr popularity with IC still helps. Data-driven methods out-perform user perceived popularity 25

26 User Score vs Pair-wise Comparison Some rankings confirm intuition e.g. specific tags are preferred over general ones (beach > nature, mountain > outdoors) Some result from other factors of perceived usefulness e.g. common words are preferred (nature > vegetation) 26

27 Tags uploaded to Flickr.. 27

Two Approaches for Manual Concept Extraction [Yan, Natsev and Campbell, CVPR 08] Tagging -- widely used Associate a single image / video at a time with multiple keywords Social tagging

28 Two Approaches for Manual Concept Extraction [Yan, Natsev and Campbell, CVPR 08] Tagging -- widely used Associate a single image / video at a time with multiple keywords Social tagging by millions of users Browsing -- specific domains Associate multiple image / video with a single given keyword Suitable for domain experts IBM EVA [MM 05] CMU Extreme Retriever [MM 06] 28

29 Models for Annotation Time Tagging Time for tag f Start-up cost Browsing Time for judging positive Time for judging nega t f = 6.8 sec, t s = 5.6 sec t p = 1.4sec, t n = 0.2sec 29

30 Frequency-based Hybrid Annotation Combining both tagging and browsing interfaces to optimize the annotation time for manually annotating image collections Choose the tagging interface for infrequent keywords Choose the browsing interface for frequent keywords Formally model the annotation time as a function of word frequency, time per word, and annotation interfaces 30

31 Results for Frequency-based Annotation Results on two large collections: TRECVID and Corel More accurate than automatic annotation (100% accurate) More efficient than tagging / browsing annotation (2-x speedup) More effective than tagging / browsing in a given amount of time TRECVID Collection Corel Collection 31 31

32 Summary Find snow capped mountains Outdoors? Sky? Ship? Red? Grass? Flower? Water? 32

33 Thank you! We sincerely thank US Gov, IBM and client partners Ambreen Javed for data annotation Our fantastic group of users All of you for listening! Questions + more info: Lexing Xie <xlx@us.ibm.com> Demos + download

Columbia University High-Level Feature Detection: Parts-based Concept Detectors

TRECVID 2005 Workshop Columbia University High-Level Feature Detection: Parts-based Concept Detectors Dong-Qing Zhang, Shih-Fu Chang, Winston Hsu, Lexin Xie, Eric Zavesky Digital Video and Multimedia Lab