The Stanford/Technicolor/Fraunhofer HHI Video Semantic Indexing System

Size: px

Start display at page:

Download "The Stanford/Technicolor/Fraunhofer HHI Video Semantic Indexing System"

Augustus Freeman
5 years ago
Views:

Silveira 2, H. Lakshman 3, J. Zepeda 2, A. Sheth 2, P. Perez 2, B.

1 The Stanford/Technicolor/Fraunhofer HHI Video Semantic Indexing System Our first participation on the TRECVID workshop A. F. de Araujo 1, F. Silveira 2, H. Lakshman 3, J. Zepeda 2, A. Sheth 2, P. Perez 2, B. Girod 1 1 Stanford University 2 Technicolor 3 Fraunhofer HHI Guest lecture in Berkeley CS294, Nov. 7 th 2012

2 Outline q What is TRECVID? q TRECVID Semantic Indexing task q Overview of our system and how it differs from others q Experimental results q Conclusion 2

3 What is TRECVID? q Part of the TREC (Text REtrieval Conference) series, sponsored by NIST (National Institute of Standards and Technology) q Started as video track in 2001/2002, became an independent evaluation in 2003 q Goal of the conference series: encourage research in information retrieval by providing a large test collection, uniform scoring procedures, and a forum for organizations interested in comparing their results 3

4 What is TRECVID? q De facto venue for state-of-the-art video analysis research q TRECVID is concluded with a workshop in Nov/Dec. Attendance is restricted to teams that submit something. q 6 different tracks/tasks: q Semantic Indexing (SIN) (focus of the presentation) q Known-item search (KIS) q Surveillance event detection (SED) q Instance Search (INS) q Multimedia Event Detection (MED) q Multimedia Event Recounting (MER) 4

5 What is TRECVID? Known-Item Search (KIS) q Use case: you ve seen a specific video and want to find it again, but don t know how to go directly to it. You remember some things about it. q System task: Given a test collection of short videos and a topic (some words and/or phrases describing the target video, people, places, or things visible): Automatically return a list of up to 100 video IDs ranked according to the likelihood that the video is the target one -- OR -- Interactively return a single video ID believed to be the target 5

6 What is TRECVID? Interactive surveillance event detection (SED) q Use case: detection of events in large amounts of surveillance video. q System task: Given a textual description of an observable event of interest: Automatically detect all occurrences of the event in a nonsegmented corpus of video 6

7 What is TRECVID? Instance Search (INS) q Use case: browsing a video archive, you find a video of a person, place, or thing of interest to you, known or unknown, and want to find more video containing the same target, but not necessarily in the same context. q System task: Given a topic with: (a) example segmented images of the target (2-6), and (b) a target type (PERSON, CHARACTER, PLACE, OBJECT) Return a list of up to 1000 shots ranked by likelihood that they contain the topic target 7

8 What is TRECVID? Multimedia Event Detection (MED) q Use case: Searching user-defined events on pre-computed metadata q System task: Given an event specified by an event kit (name, textual explanation, video exemplars), search multimedia recordings for the event. q Associated task: Multimedia Event Recounting: produce textual recounting that summarizes key evidence of the event 8

9 Semantic Indexing (SIN) What is TRECVID? q Use cases: Filtering, categorization, browsing, search... q System task: Given the test collection, master shot reference, and concept definitions, return for each concept a list of at most 2000 shot IDs from the test collection ranked according to their likeliness of containing the concept 9

10 TRECVID SIN task - Motivation q YouTube reports 72hs of videos uploaded every minute; Flickr reports 1M photo uploads per day; etc q Video increasingly ubiquitous, collections increasingly large: from broadcasters to your personal collection q Problems: ü How can you find a video? Or a video segment? ü How can you organize your video collection? q Video Semantic Indexing can help! Indexing videos with tags that represent objects, scenes, actions, events 10

11 TRECVID SIN task Background (1/3) q LSCOM effort [Naphade et al, 2006] [ taxonomy of 1000 concepts, realistic use cases, large annotated set of broadcast news. Example use cases: q Armed uniformed soldiers walking on city lanes q U.S. Maps depicting the electoral vote distribution (blue vs. red state) q Based on these use cases, [Hauptmann et all, 2007] showed that keyword-based search performed poorly; and that including content helped a lot! 11

12 TRECVID SIN task Background (2/3) q [Hauptmann et all, 2007] ü Using 320 semantic concepts, performance improved significantly (from 1% to 10% MAP in case where concept detectors have low performance) ü Extrapolation shows that a few thousand concepts should be enough to take video retrieval s performance to the level of current text-based search engines (65% MAP) ü Mindset: with a generic framework, detect a large number of concepts with reasonable performance 12

13 TRECVID SIN task Background (3/3) q Video Search Engine [Snoek et all, 2007] ü User inputs information need (query-by-keyword, query-byimage, query-by-concept) ü System interprets need, processes it based on metadata, and returns best estimate of relevant multimedia documents 13

14 TRECVID SIN task how it works (1/5) q Schedule: roughly 5 months from release of data to submission date q Three submission types: q LIGHT: 50 concepts (20 selected for evaluation) q FULL: 346 concepts (50 selected for evaluation) q PAIR: 10 concept pairs q Each team can submit up to 4 runs q Submissions consist of ranked list of 2000 shots per concept (as if it was the output of a system queried with the concept) 14

15 TRECVID SIN task how it works (2/5) q Videos range from 10s to 4min. Mostly YouTube-like, usergenerated content q Unit of analysis: shot (e.g., annotations are given at the shot level). Each video is composed by many shots q Annotations are collected via a collaborative process with the participating teams. Initially, keyframes are shown to annotator; video is played only if necessary q IMPORTANT: not all shots are annotated. An active-learning based system runs during annotation to select most useful samples to be annotated 15

16 TRECVID SIN task how it works (3/5) q 2012 training data: q 19,701 videos (600 hours) q 400,289 shots (~20 shots/video) q On average, a concept has: 1,225 P; 42,924 N q 2012 testing data q 8,263 videos (200 hours) q 145,634 shots q Videos contain some metadata: title, tags, short descriptions, outputs of speech-to-text (but not very consistent) q Data collected from Internet Archive and representative of usergenerated content 16

17 TRECVID SIN task how it works (4/5) q Example video 1 Concepts: Indoor, Kitchen, Room q Example video 2 Concepts: Trees, Vegetation q Example video 3 Concepts: Cityscape, Daytime_Outdoor, Outdoor, Scene_text, Streets, City, Text 17

18 TRECVID SIN task how it works (5/5) q Performance Measures q Inferred Average Precision (infap) per class ü Metric conceived to approximate the usual Average Precision, without the need of evaluating every single test instance [Yilmaz et al, 2008] q Mean infap (MinfAP) for final score: simple mean of the individual infap per class q Other measures, such as P-R and P@n curves are provided, but main one is MinfAP 18

19 Overview of Semantic Indexing System q Overall architecture shots P: positive N: negative M: missing S: skip 19

20 Overview of Semantic Indexing System q Architecture in more detail ü Each color is a feature channel ü Input is a shot (in our case, only keyframe is used) 20

21 Descriptor extraction q Modalities: q Visual Keyframe-based descriptors shown to provide most gain q Audio Significant gain only for some concepts q Tags, short descriptions Sparse, multilingual q Speech-to-text transcriptions Used only keyframebased visual modality English only, even if videos are not in English 21

22 Descriptor extraction q Local descriptors: q Keypoint selection ü Combination of dense extraction (best) and Harris-Laplace detector ü It has been shown that the use of different keypoint extraction methods provides complementary gain q Patch description ü SIFT descriptor ü OppSIFT descriptor (SIFT-based on each color component on Opponent color space) ü We verify gains when combining these two 22

23 Descriptor extraction q Global descriptor: CENTRIST/SPACT ü Binary pattern for each pixel, based on comparisons to neighbors ü Aggregated in histograms, according to spatial location on image: Spatial Principal component Analysis of Census Transform histograms SPACT ü Shown to provide very good results, at very low computational cost 23

24 Local descriptor aggregation q BoVW + SPM ü Histogram of appearances of visual words ü Pooled over different spatial regions of the image q Residual Vectors ü Mean of differences in each Voronoi cell ü Inspired on Fisher Vector approach [Perronin et al, 2007] ü No spatial aggregation due to lack of time 24

25 Classification q Machine Learning: SVMs ü One-versus-rest (shown to perform well for large-scale in [Perronin, 2012]) ü Use of HIK and RBF kernels, depending on feature type ü Validation experiments based on Average Precision, to choose the parameter C 25

26 Classification q Late fusion ü After obtaining classifier scores for each feature channel, there needs to be a way to combine them ü We perform a linear combination of their scores, based (1) on validation performances, or (2) on a simple average of scores ü Other option: early fusion: performed worse in preliminary experiments 26

27 Classification q Co-occurrence information ü Since the SIN task is a multi-label problem, should help ü Example: concept News Studio happens often together with concept Person ü Very hard to exploit, since not all shots are tagged: ü less than half of shots contain more than 10 annotations ü 16% of shots contain more than 100 annotations ü Previous work [Qi et al, 2007] shows that it is 25 times more computational complex to train using co-occurrence information 27

28 Feature channels BoW Residual Global q Dense keypoint extraction + OppSIFT. Bag-of-Words and Spatial Pyramid pooling in a 1x3 grid with a 4096 visual dictionary; q Dense keypoint extraction + SIFT. Bag-of-Words and Spatial Pyramid pooling in a 1x3 grid with a 4096 visual dictionary; q HarLap keypoint extraction + OppSIFT. Bag-of-Words pooling with a 4096 visual dictionary; q HarLap keypoint extraction + SIFT. Bag-of-Words pooling with a 4096 visual dictionary; q Residual vectors on PCA-reduced (to 64 dimensions) densely extracted OppSIFT, using a 256-dimensional visual dictionary; q Residual vectors on PCA-reduced (to 32 dimensions) densely extracted SIFT, using a 512-dimensional visual dictionary; q SPACT: Spatial Principal component Analysis of Census Transform Histograms (CENTRIST). 28

29 Experimental results q Inferred Precision-Recall curve 29

30 q Inferred curve Experimental results 30

31 Experimental results q Mean Inferred Average Precisions per concept 31

32 Timing q Descriptor extraction and aggregation: 1 to 2 days q Precomputation of kernel matrices: ü Training: 10 hours to 5 days O(400k^2) ü Testing: 5 hours to 2.5 days O(400k x 100k) q Classifier training, with precomputed kernels, was limited by memory loading time: ~ 1 hour to load on memory, ~10 min to train classifier q Late fusion does not take a significant amount of time 32

33 Visualization 33

34 Conclusion q We ve built a quite complex system that performs Semantic Concept detection in a large-scale multimedia database (training on 400k shots, testing on 100k shots) q Use of advanced Computer Vision and Machine Learning tools, and had to put them all together to work q In our first participation, we achieved the 6 th best performance in the TRECVID SIN task 34

35 Thank You Project website: h/p://stanford.edu/~afaraujo/trecvid h/p://stanford.edu/~afaraujo

THE steep rise in the availability of video content, during

THE steep rise in the availability of video content, during TRECVID WORKSHOP, NOVEMBER 2012 1 The Stanford/Technicolor/Fraunhofer HHI Video Semantic Indexing System A. F. de Araujo 1, F. Silveira 2, H. Lakshman 3, J. Zepeda 2, A. Sheth 2, P. Pérez 2 and B. Girod