Video search requires efficient annotation of video content To some extent this can be done automatically

Size: px

Start display at page:

Download "Video search requires efficient annotation of video content To some extent this can be done automatically"

Joan Stephens
5 years ago
Views:

1 VIDEO ANNOTATION

2 Market Trends Broadband doubling over next 3-5 years Video enabled devices are emerging rapidly Emergence of mass internet audience Mainstream media moving to the Web

3 What do we search in video - a movie scene where Titanic hits the iceberg - the CNN business news report from everything related to Claudia Schiffer - a western movie - all home videos taken in Paris - all scores of a soccer game -movies that I like -.. Video search requires efficient annotation of video content To some extent this can be done automatically

4 Video conceptual hierarchy Frames Shots Shot boundaries Scenes Audio Type, Cast, Producer Video Episode Episode Episode Video Shot Video Shot Video Shot Frame Frame Frame Global Content, Intent Video Shot Camera motion, angle Objects and their spatial relation

5 Video objects Low Level Visual Features Color Texture Shape Motion Shot Boundaries Mid Level Semantic Content People/Objects Location Actions Time High Level Semantic Content Concept Event Episode/Story

6 Video retrieval and annotation Retrieval Annotation Syntactical level Text Based Visual and audio content-based (limited success) texture, shot automatic segmentation from shot sequences: edit effects from shot key-frames: color, shape, face.. from shot sequences: motion info Semantical level Text Based date, sequence tags annotation text tags from manual annotation (subjectivity of human perception) from video metadata: author, from shot, scene, text tags from automatic (limited success) from visual and audio

7 Automatic annotation Automatic annotation requires automatic partitioning of video into syntactic segments (low-level analysis) and detection of entities and their spatio-temporal relationships (mid/high-level analysis) Mid/High-level analysis News report Shot 1 Shot 2 Movie episode Shot 3 Sport-event highlight Shot 4 Low-level analysis Shot 5 Topic segment of a documentary Shot 6... Shot 7 t...

8 Video processing chain Temporal Segmentation Feature Extraction Spatial Segmentation Event Modeling Semantic level Object Tracking Motion Detection Syntactic level

9 Video processing framework Events Event Detection High Level: use shot descriptors and domain specific inference process to detect events and then episodes: event specific Shot summaries Motion verification Mid Level: determine the object class (e.g. sky, grass, tree, rock, animal) Generate shot descriptors: domain dependent Classification Color & Texture Analysis Spatio-temporal Analysis Motion Analysis Video Sequence Shot boundary detection Low Level: extract color, texture, motion features, detect shot boundaries: domain and event independent

10 Syntactic features Texture: Autocorrelation Color: Wavelet transforms Color Moments Gabor Filters Color Histograms Color Autocorrelograms Shape: Edge Detectors Moment invariants Finite Element Methods Shape from Motion Segmentation: OCR: Scene Segmentation Shot detection Modeling Successful OCR deployments Face: Face Detection algorithms Neural Networks EigenFaces ASR: Acoustic analysis HMMS N-grams CSR LVCSR Media IR systems NIST Video TREC Starts Web Media Search

11 Color Robust to background, Independent of size, orientation Color Histogram [Swain & Ballard] Color Moments Color Sets: Map RGB Color space to Hue Saturation Value, & quantize [Smith, Chang] Color layout- local color features by dividing image into regions Color Autocorrelograms Texture One of the earliest Image features [Haralick et al 70s] Co-occurrence matrix Orientation and distance on gray-scale pixels Contrast, inverse deference moment, and entropy [Gotlieb & Kreyszig] Human visual texture properties: coarseness, contrast, directionality, likeliness, regularity and roughness [Tamura et al] Wavelet Transforms [90s] [Smith & Chang] extracted mean and variance from wavelet subbands Gabor Filters.. Region Segmentation Partition image into regions Strong Segmentation: Object segmentation is difficult. Weak segmentation: Region segmentation based on some homegenity criteria Scene Segmentation Shot detection, scene detection Look for changes in color, texture, brightness Context based scene segmentation applied to certain categories such as broadcast news

12 Shape Outer Boundary based vs. region based Fourier descriptors Moment invariants Finite Element Method (Stiffness matrix- how each point is connected to others; Eigen vectors of matrix) Wavelet transforms leverages multiresolution [Chuang & Kao] Chamfer matching for comparing 2 shapes (linear dimension rather than area) Well-known edge detection algorithms. Face Face detection is highly reliable, Face recognition for video is still a challenging problem. - Viola and Jones face detection - Neural Networks [Rwoley] - Wavelet based histograms of facial features - EigenFaces: Extract eigenvectors and use as feature space OCR OCR is fairly successful technology. Accurate, especially with good matching vocabularies. ASR Automatic speech recognition fairly accurate for medium to large vocabulary broadcast type data, Large number of available speech vendors. Still open for free conversational speech in noisy conditions.

Semantic features Sky Context is useful to recognize a scene Scenes are composed by objects and events Water Sky Water Use Proto-concepts to describe Context Use Machine

13 Semantic features Sky Context is useful to recognize a scene Scenes are composed by objects and events Water Sky Water Use Proto-concepts to describe Context Use Machine Learning to link Context to concepts that define objects and events Image Bag of Regions Compute Similarity Sky Grass Road... = Proto-concept Similarity Distribution:... =

14 Semantic segments Semantic segment concatenation of consecutive video shots that are related to each other regarding their semantic content parts of a video where content coherence (i.e. continuity of the semantic content from one shot to another) remains high detection of Semantic segment boundaries can be made by measuring the coherence of the semantic content along neighboring video shots: segment boundaries are places of low content coherence Semantic segments are suitable for video genres characterized by clear sequential content structure: broadcast news movies (episode-based scheme) documentaries

15 Bridging the Semantic Gap High level video analysis is essential to perform semantic automatic annotation of video Four ways of performing high-level analysis Video categorization into genres Video partitioning into semantic segments Semantic segment extraction Video summarization/abstraction Feature-based algorithmic solutions: different features and algorithms for different analysis processes Some model assumptions are required.

16 Raw video STEP 1 Syntactic analysis Basic statistics STEP 2 Derivation of style attributes style attributes STEP 3 Genre mapping Genre recognition

17 Step 1: Syntactic analysis Syntactic analysis is based on simple statistics for the sequence of RGB frames Color Statistics basis for cut detection color histogram standard deviation Audio statistics record basic audio frequency and amplitude statistics Motion Detection motion energy - total amount of motion in a scene (block-wise difference color histogram) motion vector - distinguish camera motion from object motion Object Segment moving object same speed, same direction use the motion vector field subtraction the camera motion pure object motion

Step 2: Derivation of style attributes Assign semantics to the scene Scene length and transition important style attribute scene separator - only hard cuts ( >95% ) other transition - artistic style

18 Step 2: Derivation of style attributes Assign semantics to the scene Scene length and transition important style attribute scene separator - only hard cuts ( >95% ) other transition - artistic style element Camera motion panning, tilting, zooming identify the motion vector direction with the highest frequency normal distribution classification panning - error rate < 10% Object recognition simple objects or pattern in well-defined environment face recognition TV channel KBS logo in bottom right corner predefine pattern stored in a database logo recognition algorithm - every 4th frame

19 Semantics of Audio Amplitude speech - characteristic frequency spectrum music - beat noise > threshold silence < threshold Frequency distinguish between speaker and noise > 95% human speech limited frequency spectrum

20 Step 3: Genre mapping Newscast scene length - much shorter than tennis camera motion, object motion noise at high amplitude easy to distinguish from tennis and soccer Tennis Commercials Car race characteristic pattern of speaker and non-speaker Logo very good example of the discriminating power of audio bouncing of the ball - singular peak speaker phase Cartoon separated by up to 5 monochrome frames fade in scene transition scene length > other genre less camera motion zero amplitude - no background noise Result no single attribute is sufficient to uniquely identify genre style profile Pre-specified genres... Action movie News Video Classifier Tennis match Drama... Video Feature Extractor Genre Video-Style Attributes Video-Style Attributes Matching

21 Video Search: Video TREC 1st TREC (2001) 2nd TREC (2002) Mean Average items (MAP) (in general category) Transcript only was better than transcript+image aspects+asr Also there were different types of test: known items versus general. Known items had at least one result. Interactive runs to rephrase queries Again text based on only ASR was the best performing system Mean Average precision of Leading system employed multiple systems: TF-IDF variants (Mean Average Precision MAP 0.093). Other was boolean with query expansion (0.101 MAP) OCR was not particularly applicable; Phonetic ASR did not help. 3rd TREC (2003) Data changes radically (Broadcast news added CNN, ABC, CSPAN) Baseline ASR + CC MAP ASR + CC + VOCR + Image Similarity + Person X retrieval MAP * Successful Approaches in the TREC Video Retrieval Evaluations, Alexander Hauptmann, Michael Christel, ACM Multimedia 2004.

22 Aircraft Animal Crowd Desert Military Screen Walking Mountain Sky Water body Boat Entertainment Natural disaster Sports Weather news Building Explosion Office Studio Bus Car Face Flag USA Outdoor Truck People Urban Chart Corp. leader Gov. leader Map People marching Vegetation Court Meeting Police / security Prisoner Vehicle Violence 39 concepts

23 User interface: friendly, visually rich interface that allows the user to interactively query the database, browse the results, and view the selected video clips Query / search engine: responsible for searching the database according to the parameters provided by the user Visual summaries: representation of video contents in a concise, typically hierarchical, way Digital video archive: repository of digitized, compressed video data Indexes Digitization and compression: Hardware and software necessary to convert the video information into digital compressed format Cataloguing: process of extracting meaningful story units from the raw video data and building the corresponding indexes Indexes: pointers to video segments or story units

Contend Based Multimedia Retrieval

Contend Based Multimedia Retrieval CBIR Query Types Semantic Gap Features Segmentation High dimension IBMS QBIC GIFT, MRML Blobworld CLUE SIMPLIcity CBMR Multimedia Automatic Video Analysis 1 CBIR Contend