Multimedia Database Systems. Retrieval by Content

Similar documents
An Introduction to Content Based Image Retrieval

Lesson 11. Media Retrieval. Information Retrieval. Image Retrieval. Video Retrieval. Audio Retrieval

CSI 4107 Image Information Retrieval

Searching Video Collections:Part I

CHAPTER 8 Multimedia Information Retrieval

Workshop W14 - Audio Gets Smart: Semantic Audio Analysis & Metadata Standards

Efficient Indexing and Searching Framework for Unstructured Data

CHAPTER 7 MUSIC INFORMATION RETRIEVAL

Rough Feature Selection for CBIR. Outline

Contend Based Multimedia Retrieval

Wolf-Tilo Balke Silviu Homoceanu Institut für Informationssysteme Technische Universität Braunschweig

A Document-centered Approach to a Natural Language Music Search Engine

Welcome Back to Fundamental of Multimedia (MR412) Fall, ZHU Yongxin, Winson

Content Based Image Retrieval

Video search requires efficient annotation of video content To some extent this can be done automatically

Detection of goal event in soccer videos

Extraction of Color and Texture Features of an Image

1 Introduction. 3 Data Preprocessing. 2 Literature Review

AUTOMATIC VIDEO INDEXING

Multimedia Information Retrieval

Automatic Classification of Audio Data

Video Summarization Using MPEG-7 Motion Activity and Audio Descriptors

Chapter 4 - Image. Digital Libraries and Content Management

MPEG-7 Audio: Tools for Semantic Audio Description and Processing

A Brief Overview of Audio Information Retrieval. Unjung Nam CCRMA Stanford University

Lecture 6: Multimedia Information Retrieval Dr. Jian Zhang

A Texture Extraction Technique for. Cloth Pattern Identification

Video shot segmentation using late fusion technique

Multimedia Databases. Wolf-Tilo Balke Younès Ghammad Institut für Informationssysteme Technische Universität Braunschweig

Content-based Image Retrieval (CBIR)

Multimedia Databases. 9 Video Retrieval. 9.1 Hidden Markov Model. 9.1 Hidden Markov Model. 9.1 Evaluation. 9.1 HMM Example 12/18/2009

A Robust Wipe Detection Algorithm

Image Transformation Techniques Dr. Rajeev Srivastava Dept. of Computer Engineering, ITBHU, Varanasi

Region-based Segmentation

GEMINI GEneric Multimedia INdexIng

Principles of Audio Coding

Module 7 VIDEO CODING AND MOTION ESTIMATION

Multimedia Databases. 2. Summary. 2 Color-based Retrieval. 2.1 Multimedia Data Retrieval. 2.1 Multimedia Data Retrieval 4/14/2016.

Spectral modeling of musical sounds

doc. RNDr. Tomáš Skopal, Ph.D. Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague

CORRELATION BASED CAR NUMBER PLATE EXTRACTION SYSTEM

MACHINE LEARNING: CLUSTERING, AND CLASSIFICATION. Steve Tjoa June 25, 2014

Introduction to Similarity Search in Multimedia Databases

CHAPTER 6 PROPOSED HYBRID MEDICAL IMAGE RETRIEVAL SYSTEM USING SEMANTIC AND VISUAL FEATURES

MuBu for Max/MSP. IMTR IRCAM Centre Pompidou. Norbert Schnell Riccardo Borghesi 20/10/2010

Video coding. Concepts and notations.

Content-Based Image Retrieval Readings: Chapter 8:

Georgios Tziritas Computer Science Department

Dietrich Paulus Joachim Hornegger. Pattern Recognition of Images and Speech in C++

CHAPTER 3 FEATURE EXTRACTION

Chapter 3 Image Registration. Chapter 3 Image Registration

Analysis of Image and Video Using Color, Texture and Shape Features for Object Identification

Texture Analysis. Selim Aksoy Department of Computer Engineering Bilkent University

COMP 102: Computers and Computing

Hello, I am from the State University of Library Studies and Information Technologies, Bulgaria

Robotics Programming Laboratory

Multimedia Information Retrieval The case of video

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Matching and Recognition in 3D. Based on slides by Tom Funkhouser and Misha Kazhdan

Lecture 7: Introduction to Multimedia Content Description. Reji Mathew & Jian Zhang NICTA & CSE UNSW COMP9519 Multimedia Systems S2 2009

Content Based Video Retrieval

Textural Features for Image Database Retrieval

Multimedia Systems Image III (Image Compression, JPEG) Mahdi Amiri April 2011 Sharif University of Technology

Music Genre Classification

CS 534: Computer Vision Segmentation and Perceptual Grouping

COSC160: Detection and Classification. Jeremy Bolton, PhD Assistant Teaching Professor

MULTIVIEW REPRESENTATION OF 3D OBJECTS OF A SCENE USING VIDEO SEQUENCES

Chapter X Sampler Instrument

MULTI-VERSION MUSIC SEARCH USING ACOUSTIC FEATURE UNION AND EXACT SOFT MAPPING

CONTENT BASED VIDEO RETRIEVAL SYSTEM

LECTURE 4: FEATURE EXTRACTION DR. OUIEM BCHIR

Chapter 11 Representation & Description

70 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 1, FEBRUARY ClassView: Hierarchical Video Shot Classification, Indexing, and Accessing

DIGITAL IMAGE ANALYSIS. Image Classification: Object-based Classification

M PEG-7,1,2 which became ISO/IEC 15398

ITNP80: Multimedia! Sound-II!

Music Signal Spotting Retrieval by a Humming Query Using Start Frame Feature Dependent Continuous Dynamic Programming

Lesson 6. MPEG Standards. MPEG - Moving Picture Experts Group Standards - MPEG-1 - MPEG-2 - MPEG-4 - MPEG-7 - MPEG-21

Semantic Video Indexing

Generation of Sports Highlights Using a Combination of Supervised & Unsupervised Learning in Audio Domain

Repeating Segment Detection in Songs using Audio Fingerprint Matching

The MPEG-7 Description Standard 1

IMPROVING THE PERFORMANCE OF CONTENT-BASED IMAGE RETRIEVAL SYSTEMS WITH COLOR IMAGE PROCESSING TOOLS

Digital Image Processing

ECE 176 Digital Image Processing Handout #14 Pamela Cosman 4/29/05 TEXTURE ANALYSIS

Texture. Frequency Descriptors. Frequency Descriptors. Frequency Descriptors. Frequency Descriptors. Frequency Descriptors

Movie synchronization by audio landmark matching

Hierarchical Semantic Content Analysis and Its Applications in Multimedia Summarization. and Browsing.

ScienceDirect. Reducing Semantic Gap in Video Retrieval with Fusion: A survey

Multimedia Information Retrieval

DATA WAREHOUSING AND MINING UNIT-V TWO MARK QUESTIONS WITH ANSWERS

Digital Media. Daniel Fuller ITEC 2110

Video Key-Frame Extraction using Entropy value as Global and Local Feature

Video De-interlacing with Scene Change Detection Based on 3D Wavelet Transform

Voice Command Based Computer Application Control Using MFCC

Saliency Detection for Videos Using 3D FFT Local Spectra

Multimedia Communications. Audio coding

Story Unit Segmentation with Friendly Acoustic Perception *

Efficient Content Based Image Retrieval System with Metadata Processing

Image Matching Using Run-Length Feature

Transcription:

Multimedia Database Systems Retrieval by Content

MIR Motivation Large volumes of data world-wide are not only based on text: Satellite images (oil spill), deep space images (NASA) Medical images (X-rays, MRI scans) Music files (mp3, MIDI) Video archives (youtube) Time series (earthquake measurements) Question: how can we organize this data to search for information? E.g., Give me music files that sound like the file query.mp3 Give me images that look like the image query.jpg 2

MIR Motivation One of the approaches used to handle multimedia objects is to exploit research performed in classic IR. Each multimedia object is annotated by using free-text or controlled vocabulary. Similarity between two objects is determined as the similarity between their textual description. 3

MIR Challenges Multimedia objects are usually large in size. Objects do not have a common representation (e.g., an image is totally different than a music file). Similarity between two objects is subjective and therefore objectivity emerges. Indexing schemes are required to speed up search, to avoid scanning the whole collection. The proposed techniques must be effective (achieve high recall and high precision if possible). 4

MIR Fundamentals In MIR, the user information need is expressed by an object Q (in classic IR, Q is a set of keywords). Q may be an image, a video segment, an audio file. The MIR system should determine objects that are similar to Q. Since the notion of similarity is rather subjective, we must have a function S(Q,X), where Q is the query object and X is an object in the collection. The value of S(Q,X) expresses the degree of similarity between Q and X. 5

MIR Fundamentals Queries posed to an MIR system are called similarity queries, because the aim is to detect similar objects with respect to a given query object. Exact match is not very common in multimedia data. There are two basic types of similarity queries: A range query is defined by a query object Q and a distance r and the answer is composed of all objects X satisfying S(Q,X) <= r. A k-nearest-neighbor query is defined by an object Q and an integer k and the answer is composed of the k objects that are closer to Q than any other object. 6

MIR Fundamentals Similarity queries in 2-D Euclidean space k = 3 Q r Q range query k-nn query 7

MIR Fundamentals Given a collection of multimedia objects, the ranking function S( ), the type of query (range or k-nn) and the query object Q, the brute-force method to answer the query is: Brute-Force Query Processing [Step1] Select the next object X from the collection [Step2] Test if X satisfies the query constraints [Step 3] If YES then report X as part of the answer [Step 4] GOTO Step 1 8

MIR Fundamentals Problems with the brute-force method The whole collection is being accessed, increasing computational as well as I/O costs. The complexity of the processing algorithm is independent of the query (i.e., O(n) objects will be scanned). The calculation of the function S( ) is usually time consuming and S( ) is evaluated for ALL objects, the overall running time increases. Objects are being processed in their raw form without any intermediate representation. Since multimedia objects are usually large in size, memory problems arise. 9

MIR Fundamentals Multimedia objects are rich in content. To enable efficient query processing, objects are usually transformed to another more convenient representation. Each object X in the original collection is transformed to another object T(X) which has a simpler representation than X. The transformation used depends on the type of multimedia objects. Therefore, different transformations are used for images, audio files and videos. The transformation process is related to feature extraction. Features are important object characteristics that have large discriminating power (can differentiate one object from another). 10

MIR Fundamentals Image Retrieval: paintings could be searched by artists, genre, style, color etc. 11

MIR Fundamentals Satellite images for analysis/prediction 12

MIR Fundamentals Audio Retrieval by content: e.g, music information retrieval. 300 250 Figure 1: 2642.wav and RatedPG.wav original signal 2642.wav 200 150 100 50 0 0 0.5 1 1.5 time (sec) 300 250 200 150 100 50 original signal RatedPG.wav 0 0 0.5 1 1.5 time (sec) 30 25 spectrum of 2642.wav spectrum of RatedPG.wav 13

MIR Fundamentals Each multimedia object (text,image,audio,video) is represented as a point (or set of points) in a multidimensional space. 14

Image Retrieval Using Annotations Problem of image annotation Large volumes of databases Valid only for one language with image retrieval this limitation should not exist Problem of human perception Subjectivity of human perception Too much responsibility on the end-user Problem of deeper (abstract) needs Queries that cannot be described at all, but tap into the visual features of images. 15

Image Retrieval Using Annotations young lady or old lady? Can you annotate these images? 16

Image Retrieval Using Annotations Can you annotate this image? 17

Image Retrieval By Content There are three important image characteristics: Color Texture Shape Although each can be used by itself, two images that have similar colors, similar textures and depict similar shapes, are considered similar. The determination of similar images is called Image Retrieval By Content. 18

Image Retrieval By Content Major modules 19

Image Retrieval By Content Some systems QBIC, IBM Almaden, 1993. CANDID, Los Alamos National Laboratory, 1995. Berkeley Digital Library Project, University of California @ Berkeley, 1996. FOCUS, University of Massachusetts @ Amherst, 1997. MARS, University of Illinois @ Urbana-Champaign, 1997. Blobworld, University of California @ Berkeley, 1999. C-Bird, Simor Fraser University, 1998. 20

Color Features Examining images based on the colors they contain is one of the most widely used techniques because it does not depend on image size or orientation. Color searches will usually involve comparing color histograms, though this is not the only technique in practice. Image Corresponding histogram 21

Color Features The histogram of an image counts the number of pixels containing a specific color. For truecolor images the RGB space is mapped to 1-D space. To avoid a large number of colors (>16M) quantization is used. The dissimilarity between a query image Q and a DB image X corresponds to the distance between the histograms HQ and HX. 22

23 Color Features The distance between two histograms can be calculated by various distance measures such as: i X Q X Q i H i H H H L 2 2 ) ( ) ( ), ( i X Q X Q X Q i H i H i H i H H H D ) ( ) ( ln ) ( ) ( ), ( i X Q X Q i H i H H H M 2 ) ( ) ( ), ( i X Q X Q i H i H H H B ) ( ) ( ln ), ( i X i Q i X Q X Q i H i H i H i H H H H ) ( ), ( min ) ( ), ( min ), ( Histogram Intersection Euclidean Distance (L2) Bhattacharyya Distance Matusita Distance Divergence Note: HQ(i) is the i-th histogram position.

Color Features The previous measures assume that two colors are independent. Therefore, small color differences may increase the dissimilarity of two images. To tackle this problem the Quadratic-Form distance functions have been proposed. D (Q,X) = (HQ HX) T A (HQ HX) color similarity matrix 24

Texture Features Texture measures look for visual patterns in images and how they are spatially defined. Textures are represented by texels which are then placed into a number of sets, depending on how many textures are detected in the image. These sets not only define the texture, but also the texture location in the image. Examples of different textures 25

Texture Features Texture innate property of all surfaces Clouds, trees, bricks, hair etc Refers to visual patterns of homogeneity. Does not result from presence of a single color. Most accepted classification of textures based on psychology studies Tamura representation: Coarseness Contrast Directionality most important Line-likeness Regularity Roughness 26

Texture Features The most commonly used texture representation is the texture histogram or texture description vector, which is a vector of numbers that summarizes the texture in a given image or image region. Usually, the first three texture features (i.e., Coarseness, Contrast, Directionality) are used to generate the texture histogram. Texture dissimilarity is calculated by applying an appropriate distance function (e.g., divergence). 27

Shape Features Shape does not refer to the shape of an image but to the shape of a particular region. Shapes will often be determined by first applying segmentation or edge detection to an image. In some cases accurate shape detection will require human intervention, but this should be avoided. Popular methods: projection matching, boundary matching. 28

Shape Features 29

Shape Features Boundary Matching algorithms require the extraction and representation of the boundaries of the query shape and image shape. The boundary can be represented as a sequence of pixels or maybe approximated by a polygon. For a sequence of pixels, one classical matching technique uses Fourier descriptors to compare two shapes. 30

Shape Features From this sequence of points a sequence of unit vectors and a sequence of cumulative differences can be computed unit vectors cumulative differences 31

Shape Features The Fourier descriptors {a -M,, a 0,,a M } are then approximated by These descriptors can be used to define a shape distance measure. 32

Image Similarity Computation From color, texture and shape features a single vector VX can be produced for image X, and a single function S(VQ,VX) is used to compute similarity. Alternatively, each image X is scored separately for color, texture and shape similarity with respect to the query Q. Partial scores are combined to compute the final score. Potential Problem: feature vectors may be large (many dimensions) causing problems in indexing. 33

Audio Retrieval By Content Deals with the retrieval of similar audio pieces (e.g., music) A feature vector is constructed by extracting acoustic and subjective features from the audio in the database. The same features are extracted from the queries. The relevant audio in the database is ranked according to the feature match between the query and the database. 34

Audio Retrieval By Content 300 250 Figure 1: 2642.wav and RatedPG.wav original signal 2642.wav 200 150 100 50 300 250 200 150 100 0 0 0.5 1 1.5 time (sec) 50 original signal RatedPG.wav 0 0 0.5 1 1.5 time (sec) 30 35

Audio Retrieval By Content As in the image case, characteristic features should be extracted from audio files. Audio features are separated in two categories: Acoustic features (objective) Subjective/Semantic features 36

Acoustic Features Acoustic features describe an audio in terms of commonly understood acoustical characteristics, and can be computed directly from the audio file Loudness Spectrum Powers Brightness Bandwidth Pitch Cepstrum 37

Acoustic Features Loudness is approximated by the square root of the energy of the signal computed from the Short-Time Fourier Transform (STFT), in decibels. Spectrum Powers include total spectrum power and sub-band powers. They are all represented with logarithmic forms. Brightness is computed as the centroid of the Short- Time Fourier Transform (STFT) and is stored as a log frequency. Bandwidth is computed as the power-weighted average of the difference between the spectral components and the centroid. 38

Acoustic Features Pitch is the fundamental period of a human speech waveform, and is an important parameter in the analysis and synthesis of speech signals. However we can still use pitch as a low-level feature to characterize changes in the periodicity of waveforms in different audio signals The Cepstrum has proven to be highly effective in automatic speech recognition and in modeling the subjective pitch and frequency content of audio signals. It can be illustrated by use of the Mel-Frequency Cepstral Coefficients, which are computed from FFT power coefficients. 39

Subjective/Semantic Features Subjective features describe sounds using personal descriptive language. The system must be trained to understand the meaning of these descriptive terms. Semantic features are high-level features that are summarized from the low-level features. Compared with low-level features, they are more accurate to reflect the characteristics of audio content. 40

Subjective/Semantic Features Major features: Timbre: it is determined by the harmonic profile of the sound source. It is also called tone color. Rhythm: represents changes in patterns of timbre and energy in a sound clip. Events: are typically characterized by a starting time, a duration, and a series of parameters such as pitch, loudness, articulation, vibrato, etc. Instruments: instrument identification can be accomplished using the histogram classification system. This requires that the system has been trained on all possible instruments. 41

Audio Similarity Computation Given a query audio file Q the retrieval process is applied to the feature vectors generated by the audio signals. As in the image case, we need a similarity function S(Q,X) to compute the similarity between Q and any audio file X. To enable sub-pattern similarity (e.g., find audio files containing a particular audio piece) additional processing is required (e.g., segmentation). 42

Video Retrieval By Content Video is the most demanding and challenging multimedia type containing: Text (e.g., subtitles) Audio (e.g., music, speech) Images (frames) All these change over time! 43

Video Retrieval By Content There is an amazing growth in the amount of digital video data in recent years. Lack of tools for classify and retrieve video content. There exists a gap between low-level features and high-level semantic content. To let machine understand video is important and challenging. 44

Video Features Decomposition Scene: single dramatic event taken by a small number of related cameras. Shot: A sequence taken by a single camera. Frame: A still image. 45

Video Features Shot Boundary Analysis Obvious Cuts Video Scenes Shots Key Frame Analysis Frames 46

Shot Detection A shot is a contiguous recording of one or more video frames depicting a contiguous action in time and space. During a shot, the camera may remain fixed, or may exhibit such motions as panning, tilting, zooming, tracking, etc. Segmentation is a process for dividing a video sequence into shots. Consecutive frames on either side of a camera break generally display a significant quantitative change in content. We need a suitable quantitative measure that captures the difference between two frames. 47

Shot Detection Use of pixel differences: tend to be very sensitive to camera motion and minor illumination changes. Global histogram comparisons: produce relatively accurate results compared to others. Local histogram comparisons: produce the most accurate results compared to others. Use of motion vectors: produce more false positives than histogram-based methods. Use of the DCT coefficients from MPEG files: produce more false positives than histogram-based methods. 48

Shot Representation Three major ways: Based on representative frames Based on motion information Based on objects We focus on the first one. 49

Representative Frames The most common way of creating a shot index is to use a representative frame to represent each shot. Features of this frame are extracted and indexed based on color, shape, texture (as in image retrieval). If shots are quite static, any frame within the shot can be used as a representative. Otherwise, more effective methods should be used to select the representative frame. Two issues: (i) how many frames must be selected from each shot and (ii) how to select these frames. 50

Representative Frames How many frames per shot? Three methods: One frame per shot. The method does not consider the length and content changes. The number of selected representatives depends on the length of the video shot. Content is not handled properly. Divide shots into subshots and select one representative frame from each subshot. Length and content are taken into account. 51

Representative Frames How are the frames selected? Select the first frame from each segment (shot or subshot) An average frame is defined so that each pixel in this frame is the average of pixel values at the same grid point in all the frames of the segment. Then, the frame within the segment that is most similar to the average frame is selected as the representative. The histograms of all the frames in the segment are averaged. The frame whose histogram is closest to this average histogram is selected as the representative frame of the segment. 52

Representation of Multimedia Objects Each multimedia object (text,image,audio,video) is represented as a point (or set of points) in a multidimensional space. 53