LIG and LIRIS at TRECVID 2008: High Level Feature Extraction and Collaborative Annotation

Similar documents
Context-based conceptual image indexing

Classifier Fusion for SVM-Based Multimedia Semantic Indexing

Learning Object Representations for Visual Object Class Recognition

BoxPlot++ Zeina Azmeh, Fady Hamoui, Marianne Huchard. To cite this version: HAL Id: lirmm

Setup of epiphytic assistance systems with SEPIA

LaHC at CLEF 2015 SBS Lab

Comparison of spatial indexes

Selection of Scale-Invariant Parts for Object Class Recognition

Natural Language Based User Interface for On-Demand Service Composition

Structuring the First Steps of Requirements Elicitation

An FCA Framework for Knowledge Discovery in SPARQL Query Answers

XML Document Classification using SVM

Efficient implementation of interval matrix multiplication

Multi-atlas labeling with population-specific template and non-local patch-based label fusion

YAM++ : A multi-strategy based approach for Ontology matching task

How to simulate a volume-controlled flooding with mathematical morphology operators?

Multimedia CTI Services for Telecommunication Systems

Relabeling nodes according to the structure of the graph

Service Reconfiguration in the DANAH Assistive System

Fault-Tolerant Storage Servers for the Databases of Redundant Web Servers in a Computing Grid

DSM GENERATION FROM STEREOSCOPIC IMAGERY FOR DAMAGE MAPPING, APPLICATION ON THE TOHOKU TSUNAMI

Sliding HyperLogLog: Estimating cardinality in a data stream

Malware models for network and service management

Comparator: A Tool for Quantifying Behavioural Compatibility

Geometric Consistency Checking for Local-Descriptor Based Document Retrieval

Teaching Encapsulation and Modularity in Object-Oriented Languages with Access Graphs

Quaero at TRECVid 2012: Semantic Indexing

KeyGlasses : Semi-transparent keys to optimize text input on virtual keyboard

Scale Invariant Detection and Tracking of Elongated Structures

MUTE: A Peer-to-Peer Web-based Real-time Collaborative Editor

Linked data from your pocket: The Android RDFContentProvider

HySCaS: Hybrid Stereoscopic Calibration Software

Deformetrica: a software for statistical analysis of anatomical shapes

Non-negative dictionary learning for paper watermark similarity

Prototype Selection Methods for On-line HWR

Technical documents classification

Spectral Active Clustering of Remote Sensing Images

Lossless and Lossy Minimal Redundancy Pyramidal Decomposition for Scalable Image Compression Technique

Using a Medical Thesaurus to Predict Query Difficulty

Moveability and Collision Analysis for Fully-Parallel Manipulators

Robust IP and UDP-lite header recovery for packetized multimedia transmission

lambda-min Decoding Algorithm of Regular and Irregular LDPC Codes

QuickRanking: Fast Algorithm For Sorting And Ranking Data

Assisted Policy Management for SPARQL Endpoints Access Control

Tacked Link List - An Improved Linked List for Advance Resource Reservation

DANCer: Dynamic Attributed Network with Community Structure Generator

A Practical Approach for 3D Model Indexing by combining Local and Global Invariants

A Practical Evaluation Method of Network Traffic Load for Capacity Planning

The Connectivity Order of Links

Blind Browsing on Hand-Held Devices: Touching the Web... to Understand it Better

QAKiS: an Open Domain QA System based on Relational Patterns

Catalogue of architectural patterns characterized by constraint components, Version 1.0

Real-Time Collision Detection for Dynamic Virtual Environments

A N-dimensional Stochastic Control Algorithm for Electricity Asset Management on PC cluster and Blue Gene Supercomputer

FStream: a decentralized and social music streamer

An Efficient Numerical Inverse Scattering Algorithm for Generalized Zakharov-Shabat Equations with Two Potential Functions

The optimal routing of augmented cubes.

Change Detection System for the Maintenance of Automated Testing

Regularization parameter estimation for non-negative hyperspectral image deconvolution:supplementary material

Experimental Evaluation of an IEC Station Bus Communication Reliability

Linux: Understanding Process-Level Power Consumption

Real-Time and Resilient Intrusion Detection: A Flow-Based Approach

Designing objective quality metrics for panoramic videos based on human perception

RETIN AL: An Active Learning Strategy for Image Category Retrieval

An Experimental Assessment of the 2D Visibility Complex

Every 3-connected, essentially 11-connected line graph is hamiltonian

The Animation Loop Station: Near Real-Time Animation Production

IMPLEMENTATION OF MOTION ESTIMATION BASED ON HETEROGENEOUS PARALLEL COMPUTING SYSTEM WITH OPENC

SDLS: a Matlab package for solving conic least-squares problems

Low Cost and Usable Multimodal Biometric System Based on Keystroke Dynamicsand 2D Face Recognition

Video annotation based on adaptive annular spatial partition scheme

Rushes Video Segmentation Using Semantic Features

A Methodology for Improving Software Design Lifecycle in Embedded Control Systems

Dictionary of gray-level 3D patches for action recognition

XBenchMatch: a Benchmark for XML Schema Matching Tools

Primitive roots of bi-periodic infinite pictures

INRIA-LEAR S VIDEO COPY DETECTION SYSTEM. INRIA LEAR, LJK Grenoble, France 1 Copyright detection task 2 High-level feature extraction task

TEXTURE CLASSIFICATION METHODS: A REVIEW

The Proportional Colouring Problem: Optimizing Buffers in Radio Mesh Networks

Human Detection and Action Recognition. in Video Sequences

A million pixels, a million polygons: which is heavier?

The Stanford/Technicolor/Fraunhofer HHI Video Semantic Indexing System

RecordMe: A Smartphone Application for Experimental Collections of Large Amount of Data Respecting Volunteer s Privacy

Smile Detection Using Multi-scale Gaussian Derivatives

Computing and maximizing the exact reliability of wireless backhaul networks

Detecting Anomalies in Netflow Record Time Series by Using a Kernel Function

Fuzzy sensor for the perception of colour

A scattering transform combination with local binary pattern for texture classification

Throughput prediction in wireless networks using statistical learning

Mokka, main guidelines and future

Study on Feebly Open Set with Respect to an Ideal Topological Spaces

A Fuzzy Approach for Background Subtraction

ASAP.V2 and ASAP.V3: Sequential optimization of an Algorithm Selector and a Scheduler

[Demo] A webtool for analyzing land-use planning documents

NP versus PSPACE. Frank Vega. To cite this version: HAL Id: hal

Efficient Gradient Method for Locally Optimizing the Periodic/Aperiodic Ambiguity Function

Representation of Finite Games as Network Congestion Games

Evaluation of GIST descriptors for web-scale image search

Cloud My Task - A Peer-to-Peer Distributed Python Script Execution Service

Visual perception of unitary elements for layout analysis of unconstrained documents in heterogeneous databases

Transcription:

LIG and LIRIS at TRECVID 2008: High Level Feature Extraction and Collaborative Annotation Stéphane Ayache, Georges Quénot To cite this version: Stéphane Ayache, Georges Quénot. LIG and LIRIS at TRECVID 2008: High Level Feature Extraction and Collaborative Annotation. TRECVID Workshop, 2008, Gaithersburg, MD, United States. <hal- 01089427> HAL Id: hal-01089427 https://hal.archives-ouvertes.fr/hal-01089427 Submitted on 1 Dec 2014 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

LIG and LIRIS at TRECVID 2008: High Level Feature Extraction and Collaborative Annotation Stéphane Ayache and Georges Quénot Abstract This paper describes our participations of LIG and LIRIS to the TRECVID 2008 High Level Features detection task. We evaluated several fusion strategies and especially rank fusion. Results show that including as many low-level and intermediate features as possible is the best strategy, that SIFT features are very important, that the way in which the fusion from the various low-level and intermediate features does matter, that the type of mean (arithmetic, geometric and harmonic does matter. LIG and LIRIS best runs respectively have a Mean Inferred Average Precision of 0.0833 and 0.0598; both above TRECVID 2008 HLF detection task median performance. LIG and LIRIS also co-organized the TRECVID 2008 collaborative annotation. 40 teams did 1235428 annotations. The development collection was annotated at least once at 100%, at least twice at 37.6%, at least three times at 3.99% and at least four times at 0.06%. Thanks to the active learning and active cleaning used approach, the annotations that were done multiple times were those for which the risk of error was maximum. 1 Introduction This paper describes the participations of LIG and LIRIS to the TRECVID 2008 [12] High Level Features (HLF detection task [13]. The framework for HLF detection is based on network of operators [1][3]. We made use of INRIA LEAR SIFT bag of features descriptors [10][15][7]. Our main interest to this year s TRECVID challenge was to compare various ways to combine low-level features and/or systems output. Our system for video shots indexing is fully based on key frames analysis; our generic classification system follows a classical pipeline architecture which includes low-level features extractor, mid-level semantic classification and fusion modules. We used one or more keyframes per shot according to the shot and subshot segmentation provided by Fraunhofer HHI [11]. The following sections present the considered features, the fusion strategies and the submitted runs. 2 Visual analysis We performed visual analysis at several level of granularity from global to fine blocks analysis, as well as various semantic levels. Our low-level feature extractors first split images on overlapped blocks to form a grid of N M blocks. For our submissions, we chose N and M such as we obtained a satisfying trade-off between classification performance and time computing. Finally, those values depends on which feature is considered. The analysis first treats each key frame to extract several feature vectors, secondly merge them using standard early or late fusion schemes, or a combination of them and then merge key frames to assign a score to each shot. 2.1 Low-level features At global level, we consider classical color and texture features. Color is represented by a 3-dimensional histogram on RGB space. We discretize the color space to form a 4 4 4 bins histogram. Texture information is described with Gabor bank of filters; we used 8 orientations and 5 scales. Finally, global features are normalized and concatenated on a 104 dimensions vector. We also extracted color and texture features at block levels, features obtained from each block are then concatenated to form a rich description of key frames: Color (1: is represented by 3 3 3 3D histogram on a grid of 8 6 blocks. The overall local color feature vector has 1296 dimensions. Color (2: is represented by the first two moments on a grid of 8 6 blocks. This local color feature vector has 432 dimensions. Edge Direction Histogram: is computed on a grid of 4 3 blocks. Each bin is defined as the sum of the magnitude gradients from 50 orientations. Thus, overall EDH feature has 600 dimensions. EDH feature is known to be invariant to scale and translation. Local Binary Pattern: [9] is computed on grid of 2 2 blocks, leading to a 1024 dimensional vector. The LBP operator labels the pixels of an image by

thresholding the 3 3-neighborhood of each pixel with the center value and considering the result as a decimal number. LBP is known to be invariant to any monotonic change in gray level. 2.2 Feature on interest points One of the more relevant feature for visual indexing is the SIFT descriptor combined with a bag of words representation. The SIFT descriptor [8] is describes the local shape of points of interest using edge histograms. To make the descriptor invariant, the interest region is divided into a 4 4 grid and every sector has its own edge direction histogram (8-bin. We used a codebook of 1000 visual word, provided by INRIA LEAR. 2.3 Semantic feature This feature aims at modeling co-occurrence between high-level features using a bag of concepts approach. First, we consider each block from key frames which are relevant for a concept, as relevant for this concept too. This is a very strong assumption but it could be reasonable depending upon the concepts. Thus, we use existing concepts annotations (from a part of the learning set at global level, to train SVM classifiers at the blocks level, where blocks are represented with moments color and edge direction histogram features. Then blocks of key frames are classified using models of all the concepts, leading to nb blocks nb concepts classification scores per key frame. The final semantic feature is defined by the sum of scores on nb blocks for each concepts, leading to a nb concepts dimensional feature. 3 HLF extraction framework 3.1 Supervised classification We use Support Vector Machine [6] as binary classifier for supervised classification of HLF with RBF Kernel and probabilistic output scores. We obtain SVM parameter by testing all combination of parameters (σ, c with 5-fold cross validation. The models are learned using standard annotation provided by LIG Collaborative Annotation by considering up to 800 randomly positive examples and twice as negative randomly selected. 3.2 Early and Late fusion We merged our various features with combinations of early and late fusion schemes. While the early fusion proceeds in the feature space, the late fusion combines classification scores. Combination of those two schemes is possible when more than two features is available and yields much flexibility on the way to merge features. For example we can combine features with early fusion once then combine with others feature with late fusion. We know from [4] that such combinations outperform, for some concepts, classical early and late fusion. 3.2.1 Early fusion Early fusion is basically defined by a simple concatenation of the features from the various modalities. While the number of extracted features depends upon the modalities and the type of the features, an early fusion scheme based on simple vector concatenation is much affected by the vector which has the highest number of inputs. Such fusion should have an impact on the classification, especially with a RBF kernel which is based on Euclidian distance. Thus, we normalize each entry of the concatenated vector so that the average norm of each source vector is about the same. The normalization formula is: x i = x i min i (max i min i Card(x i where x i is an input of the feature vector x, min i and max i are respectively the minimum and maximum value of the i th input among the training samples and Card(x i is the number of dimensions of the source vector of x i. 3.2.2 Late fusion The late fusion scheme first classifies each concept using individual modalities and then merges the scores on top of those classifiers. A second layer of classifier can be considered but it does not always conduct to expected performance as it is needed to split training set to learn this classifier while avoiding over fitting. We chose here to use a usual function to combine modalities with neither parameters nor learning phase: v = 1 N N log(v i i=0 where N is the number of modalities and v i is the score for the i th modality. 3.3 Rank fusion Rank fusion is a specific case of late fusion. Its principle is to assign to each shot and for each source a score which is equal to its rank according to a classifier that uses this source and then to build a global score for each shot as a mean of the scores (ranks. This global score is then used to re-rank the shots. Three variants of the rank fusion correspond to the use of the arithmetic

(classical mean, the geometric mean and the harmonic mean. The harmonic mean is often considered as the best choice for rank fusion. All three variants can be implemented with weights for the various sources. We tried the three variants with the expected best results from LIG and LSIS in order to compare them. 4 Experiments 4.1 Runs description The LIG and LIRIS runs are described in a Functional Programing (FP Style [5] since it was observed that FP expressions are a powerful way for describing networks of operators [14] as those we are using in our approach [1]. LIG runs: LIG 1 = LATE( EARLY(Local-Color, EDH, Global-Features, EARLY(Semantic, EDH, Global-Features, EARLY(Local-Color, SIFT, Global-Features LIG 2 = LATE( EARLY(Local-Color, EDH, Global-Features, EARLY(Semantic, Global-Features, EARLY(Local-Color, SIFT LIG 3 = EARLY(Local-Color, EDH, Global-Features, Semantic, SIFT, LBP LIG 4 = LATE( EARLY(Local-Color, EDH, Global-Features, EARLY(Semantic, Global-Features LIG 5 = EARLY(Global-Features, Semantic LIG 6 = EARLY(Local-Color, EDH, Global-Features These were empirically determined as good combinations from tests on the development collection. LIRIS runs: LIRIS 1 = EARLY(Global-Features, Semantic, EDH LIRIS 2 = HarmonicRANK ( EARLY(Global-Features, Semantic, EDH, (Semantic, (EDH, (LBP LIRIS 3 = ArithmeticRANK ( EARLY(Global-Features, Semantic, EDH, (Semantic, (EDH, (LBP LIRIS 4 = Semantic LIRIS 5 = EDH LIRIS 6 = LBP 4.2 Results We list in table 1 the submitted runs of LIG and LIRIS groups. For more details for this run, please refer to the corresponding TRECVID paper. We show in bold the two LIG best runs and the LIRIS best run. The metrics are the Mean Inferred Average Precision on the 20 target concepts (HLFs. LIG 1 0.0833 LIRIS 1 0.0525 LIG 2 0.0793 LIRIS 2 0.0267 LIG 3 0.0785 LIRIS 3 0.0598 LIG 4 0.0659 LIRIS 4 0.0254 LIG 5 0.0423 LIRIS 5 0.0221 LIG 6 0.0568 LIRIS 6 0.0189 Table 1: results fo LIG and LIRIS HLF submissions 4.3 Discussion LIG compared various combinations of features both in the choice of the features and the way they are combined (see above for example of non-trivial combinations for LIG 1 and LIG 2. Their relative performances were judged on the development set and the run priority was defined according to it. The run performance ordering was quite well predicted (only LIG 5 and LIG 6 were swapped. LIRIS conducted some runs using only a single feature (LIRIS 4, LIRIS 5 and LIRIS 6. As expected, the performance using single features is low and varies according to the feature. LIRIS also tried arithmetic and harmonic mean based rank fusion. Both do better than each of the single feature but, surprisingly, the arithmetic mean leads to much better results than the harmonic mean (usually considered as better.

2007 Annotated % Annotated Negative Skipped Positive % Positive Pass 1 641223 82.7 578299 13163 49761 7.76 Pass 2 46864 6.05 11904 7478 27482 58.6 Pass 3 21987 2.84 9383 4040 8564 39.0 Pass 4 1492 0.19 324 940 228 15.3 Synthesis 641223 82.7 578683 15348 47192 7.36 2008 Annotated % Annotated Negative Skipped Positive % Positive Pass 1 872320 100.0 830087 30608 11625 1.33 Pass 2 327744 37.6 310815 7886 9043 2.76 Pass 3 34810 3.99 26746 4772 3292 9.46 Pass 4 554 0.06 396 50 108 19.5 Synthesis 641223 82.7 578683 15348 47192 7.36 Table 2: Annotation statistics by pass, average on all concepts. 5 Collaborative annotation The active learning based collaborative annotation system was described in details in [2] and ecir08. The main evolution between the 2007 and 2008 versions consists in an improvement of the active cleaning feature. Another major difference is that for a comparable number of keyframes concepts to annotate (21532 36 in 2007 and 20 22084 in 2008, more participants did more annotations and 100% of the corpus was annotated at least once in 2008 against 82.7% in 2007. 40 (respectively 32 teams participated to the 2008 (respectively 2007 TRECVID collaborative annotation effort and produced a total of 1235428 (respectively 711566 annotations. Table 2 gives some statistics on these annotations. Pass 1, Pass 2, Pass 3 and Pass 4 corresponds to the number of annotations that were done respectively at least once, at least twice, at least three times and at least four times for a given key frame concept. The Synthesis correspond to the global annotation when a majority rule is applied if there is more than one annotation for a key frame concept. Figure 1 shows how the collaborative effort was spread over time for years 2007 and 1008. Horizontal units correspond to the days of May 2007 or 2008 between 1 and 31 included and extrapolated outside. The 2008 annotation was more spread over time but was shorter in total. The peak of daily annotations is of a bit less than 100,000 in both cases. 6 Conclusion LIG and LIRIS evaluated several fusion strategies including rank fusion for LIRIS. Results show that including as many low-level and intermediate features as possible is the best strategy, that SIFT features are very important, that the way in which the fusion from the various low-level and intermediate features does matter, and that the type of mean (arithmetic, geometric and harmonic does matter. LIG and LIRIS best runs respectively have a Mean Inferred Average Precision of 0.0833 and 0.0598; both above TRECVID 2008 HLF detection task median performance. LIG and LIRIS also co-organized the TRECVID 2008 collaborative annotation. 40 teams did 1235428 annotations. The development collection was annotated at least once at 100%, at least twice at 37.6%, at least three times at 3.99% and at least four times at 0.06%. Thanks to the active learning and active cleaning used approach, the annotations that were done multiple times were those for which the risk of error was maximum. References [1] S. Ayache and G. Quénot. Image and video indexing using networks of operators. EURASIP Journal on Image and Video Processing, Volume 2007, 2007. [2] S. Ayache and G. Quénot. TRECVID 2007 Collaborative Annotation using Active Learning. In Proc. of TRECVID Workshop, 5-6 Nov. 2007. [3] S. Ayache and G. Quénot. LIG and LIRIS at TRECVID 2008: High Level Feature Extraction and Collaborative Annotation. In Proc. of TRECVID Workshop, 17-18 Nov. 2008. [4] S. Ayache, G. Quénot, J. Gensel, and S. Satoh. Using topic concepts for semantic video shots classification. In International Conference on Image and Video Retrieval (CIVR 06, 13-15 Jul. 2006. [5] J. Backus. Can programming be liberated from the von neumann style?: a functional style and its algebra of programs. Commun. ACM, 21(8:613 641, August 1978.

Figure 1: Daily annotations in the collaborative annotation project (GMT time, May 2007 or 2008 days. [6] C.-C. Chang and C.-J. Lin. LIBSVM: A Library for Support Vector Machines, 2001. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. [7] M. Douze, A. Gaidon, H. Jegou, M. Marszalek, and C. Schmid. INRIA-LEAR s Video Copy Detection System. In Proc. of TRECVID Workshop, 17-18 Nov. 2008. [8] D. Lowe. Distinctive image features from scaleinvariant keypoints. International Journal of Computer Vision, 60(2:91 110, 2004. [9] T. Maenpaa, M. Pietikainen, and T. Ojala. Texture classication by multi-predicate local binary pattern operators. In Proc. 15th International Conference on Pattern Recognition, pages 951 954, 2000. [10] K. Mikolajczyk and C. Schmid. Scale and affine invariant interest point detectors. International Journal of Computer Vision, 60(1:63 86, 2004. [12] A. F. Smeaton, P. Over, and W. Kraaij. Evaluation campaigns and trecvid. In MIR 06: Proceedings of the 8th ACM International Workshop on Multimedia Information Retrieval, pages 321 330, New York, NY, USA, 2006. ACM Press. [13] A. F. Smeaton, P. Over, and W. Kraaij. High-Level Feature Detection from Video in TRECVid: a 5-Year Retrospective of Achievements. In A. Divakaran, editor, Multimedia Content Analysis, Theory and Applications. Springer Verlag, Berlin, 2009. [14] B. Zavidovique, J. Sérot, and G. M. Quénot. Massively parallel data flow computer dedicated to real-time image processing. Integr. Comput.-Aided Eng., 4(1:9 29, 1997. [15] J. Zhang, M. Marsza lek, S. Lazebnik, and C. Schmid. Local features and kernels for classification of texture and object categories: a comprehensive study. International Journal of Computer Vision, 73(2:213 238, jun 2007. [11] C. Petersohn. Fraunhofer HHI at TRECVID 2004: Shot Boundary Detection System. In Proc. of TRECVID Workshop, 15-16 Nov. 2004.