CHAPTER 3 FEATURE EXTRACTION

Size: px

Start display at page:

Download "CHAPTER 3 FEATURE EXTRACTION"

Madeline Rodgers
5 years ago
Views:

1 37 CHAPTER 3 FEATURE EXTRACTION 3.1. INTRODUCTION This chapter presents feature representation of information of a frame in a video. The feature representation for image objects, the feature representation for text and feature representation for audio are presented. Image analysis is the quantitative or qualitative characterization of two-dimensional (2D) or three dimensional (3D) digital images to extract meaningful information. The characterization of an image is based upon visual features which are extracted from the image. This can then be used to classify images with similar characteristics for applications such as contentbased image retrieval (CBIR), which is also known as query by image content (QBIC). Applications may require the classification and retrieval of the entire image as a whole; however images may also be segmented into sub-regions which represent distinct objects within the image FEATURE EXTRACTION IN IMAGE There are four main descriptors for the visual content of the image: 1. Color Features. 2. Textural Features. 3. Geometrical or Shape-based Features. 4. Topological Features. These features can either be global or local. Global image analysis considers the image as a whole, whereas local analysis first segments the image into several Regions Of Interest (ROI) followed by finding the properties and features of the ROI.

2 Color features Color spaces: A color model is an abstract mathematical model describing the way colors can be represented as tuples of numbers, typically as three or four values or color components. When this model is associated with a precise description of how the components are to be interpreted (viewing conditions, etc.), the resulting set of colors is called a color space. The choice of a color space depends on the information to be extracted or on the treatment to be applied. Color histograms: These are used to encode the frequency distribution of pixel values either on a whole image or on some region of interest (ROI). Given a finite set of colors, it associates to each color, its frequency in the image. It is invariant under any geometrical transformation (translation, rotation). When comparing two images or ROI using histograms it is necessary to compute the distance between both histograms using dissimilarity measures such as Euclidean, χ-square, Kolmogorov-Smirnov and Kuiper distances. Classical histograms and most of their derivatives do not take into account spatial distribution of pixels. Nevertheless Blob histograms are able to differentiate pictures having the same color pixel distribution but containing objects of different sizes. In order to reduce the histogram size, a few representative colors can be selected from the color space, either using some generic heuristic or by analyzing the image. This color quantization can be used as a basic descriptor of the image. Color moments: These have been shown to be both efficient and effective to represent the color distribution of images [Stricker and Orengo, 1995]. They include the first order moment (mean), the second-order moment (variance) and the third order moment (skewness), thus an image can be described in only nine values (3 moments per color component).

3 Textural features From a perceptual point of view, a texture may be defined by its coarseness, repetitiveness, directionality and granularity. However in terms of digital images, the texture of an image or region is defined as a function of the spatial variation in pixel intensities (grey values), Tuceryan and Jain, The analysis of texture is used to determine regions of homogeneous texture, the boundaries between these regions can then be used to segment the image. Textural classification is also used to associate a region with a textural class (e.g. the material being represented (cotton, sand, etc), or a property of that material (smooth, coarse, etc). The image analysis applied in the modeling of texture can be divided into three general methods: Statistical methods: It characterize image texture according to measures of the spatial distribution of grey values (e.g. moments of different orders, correlation functions, related covariance functions). Structural methods: It assume that textures are composed of primitives (called texels). The texture is produced by the placement of these primitives according to certain placement rules. This class of algorithms, in general, is limited in power unless one is dealing with very regular textures. Structural texture analysis consists of two major steps: (a) extraction of the texture elements (texels), and (b) inference of the placement rule. A texture may then be characterized through properties of its texels (average intensity, area, perimeter, etc.) or the texel pattern as defined by the placement rules. Model-based methods: It studies texture as a linear combination of a set of basis functions. The two main difficulties of such methods are first to find a suitable model to represent the texture (e.g.fractal Model, Markov model, Fourier filter, Multi-channel Gabor filter, Wavelet transform) and then to compute the accurate parameters which capture the essential perceived characterization of the texture.

4 Geometrical or Shape-based features Using shape descriptors implies being able to extract accurate shapes from an image. Shape descriptors may be based on contour or edge detection together with statistical tools. Such methods are particularly suitable for simple images, which contain one shape easily distinguishable from the background. But better results may be obtained after a segmentation process, which is necessary when dealing when complex images. Shapes can be described either by their contour or by the region they contain. Moreover they can be either seen from a global or from a local point of view. The former approach, which has been chosen for many shape descriptors, aims at capturing some overall property either of the shape itself (e.g.) or of its contour (e.g. Fourier descriptor). The latter approach is based on local observations on the region or more often on its contour (e.g. inflexion points). Global shape descriptors may be misled when occlusions occur whereas local ones are very sensitive to noise. Region descriptors: Simple geometrical attributes such as area, eccentricity, bounding box, elongation, convexity, compactness, and circular or elliptic variances are also often used to describe shapes. Although simple to compute, as they can be gathered in attributes vector that may be compared through the use of some accurate distance, their characterization power is generally too weak to be used in isolation and they are often combined with more complex shape descriptors, such as those provided by geometrical moments. Contour descriptors: Fourier descriptors are one of the most popular tools to characterize and compare contours. A contour is first sampled into a given number of points. A shape signature function is then applied on the representative points of the contour (e.g. complex shape signature, distance to centroid, area, cumulative angular function, curvature). Such a function

5 41 produces a set of values, which are encoded through a Fourier transform and then normalized. Other methods include Autoregressive models and Wavelet transforms (particularly suitable for describing high curvature points) Topological features Digital topology deals with properties and features of two-dimensional (2D) or three-dimensional (3D) digital images that correspond to topological properties (e.g., connectedness) or topological features (e.g., boundaries) of objects. Concepts and results of digital topology are used to specify and justify important (low-level) image analysis algorithms, including algorithms for thinning, border or surface tracing, counting of components or tunnels, or region-filling IMAGE SEGMENTATION In order to analyze an image at the level of the objects it contains it is necessary to segment the image so that the image features can be related to the region representing the object. A segmentation process aims at accurately identifying the different areas of an image, either by computing an accurate partition of the image by detecting coherent regions or by detecting the boundaries between regions. There are three broad approaches which are applied in ROI detection. Affine region detectors which detect regions covariant with a class of affine transformations. The second approach is based on extracting a per pixel salience measure; after grouping pixels of similar saliency a hierarchical representation of salient regions may be obtained. Finally, clustering can be applied to ROI as is usual with clustering it is possible to apply three basic methods; generating the clustering bottom-up (starting from a set of seed regions, combine the regions until some stop criteria are reached), top-down (by splitting the image into smaller regions) or a combination of both

6 42 bottom-up and top-down. The main difficulty in the application of such clustering methods is in deciding how to choose accurate criteria to characterize regions and determining a stopping condition for the algorithm VIDEO One of the features of video analysis is that it brings together a number of media types (image, audio and (via ASR) text) into a single connected setting. Thus video analysis has the opportunity of exploiting the data from these correlated, simultaneous channels, to extract information. In addition there are other features which are specific to the media of video; those that involve the way in which the video frames are linked together using various editing effects (cut, fades, dissolves, etc.). The general video analysis process involves: Boundary detection: Segmenting the video stream into shots Key-frame extraction: Characterizing the content of a shot/video Determining what objects are in the shot/video The primary application of such a process is to allow the index of video in order to make it searchable, for content-based image retrieval systems; however the ultimate goal is to recognize the events portrayed and to understand the narrative of the video Feature extraction in video By analyzing a video stream in terms of a structured sequence of shots, and then characterizing the shots in terms of key-frames, the modeling of video content is reduced to extracting the content of structured still images. This means that the visual features extracted from video are mainly derived from the frame images, which are described above. In addition videos have the features which describe the motion of objects between frames, as well as features relating to the audio channel.

7 43 Boundary detection: The identification of the shot boundaries is a key essential step prior to performing shot-level feature extraction and any subsequent scene-level analysis. Shot transitions can be classified as of two types: abrupt transitions (cut) and gradual transitions (fade, wipe, dissolve, etc.). The approaches to detecting these shot transitions either make use of some statistical measure the change in frame features which indicate a transition or use some form of Machine Learning (ML). In general visual features are used to identify the boundaries. There are a number of ML approaches to Boundary Detection including nearest neighbor, neural nets, HMM for both shot boundary detection and higher level topic/story boundary detection and SVM. Key-frame extraction: The usual approach to providing a higher level description for a video stream is to extract a set of keyframes which represent a summarization of the content of the whole stream. The general technique employed is frame clustering, each cluster being centred on a keyframe, thus the key-frames are maximally distinct from one another. The results of applying the clustering technique are dependent upon which features are used, the distance metric employed and the method for determining the number of key-frames (clusters) which sufficiently describe the video. Although clustering is the main key-frame extraction technique, other ML approaches have been applied to the problem, such as genetic algorithms. Object extraction: The extraction of objects from video applies the techniques described above, for image object identification. As objects can be found in a number of sequential or disparate frames, they can also be used as features in key-frame extraction.

8 FEATURE REPRESENTATION Feature representation for objects in an image Step 1: A query image is read from the user. Step 2: The three planes of the input image is separated or the image is converted into gray scale and color map. The average color map (Red, green, blue) is obtained. Step 3: The contents are segmented. Step 4: The number of objects are found. Step 5: Using region properties of matlab, 14 properties of each objects are obtained. Regular shapes of the objects are defined. This includes the text matter in the image. If it is a texture image like garden or, then grey scale co-occurrence matrix values such as contrast, energy, entropy, homogeneity are calculated. In a video, each representative frame is preprocessed to extract information from the frame. Digital image processing is used to enhance image features of interest while attenuating detail irrelevant. Subsequent to enhancement of the frame image, segmentation of the image is done. Segmentation subdivides an image into its constituent parts or objects. The level to which this subdivision is carried depends on the problem to be solved. That is, segmentation should stop when the objects of interest in an application are isolated. Segmentation algorithms for monochrome images generally are based on one of two basic properties of gray level values: discontinuity and similarity. In the first category, the approach is to partition an image based on abrupt changes in gray level. The principal areas of interest within this category are detection of isolated points and detection of lines and edges in an image. The principal approaches in the second category are based on thresholding, region growing, region splitting and merging. The concept of segmenting an image based on discontinuity or similarity of the gray-level

9 45 values of its pixels is applicable to both static and dynamic (time varying) images. Table 3.1 presents sample videos and their properties. The first column presents frame of a video. The second column presents the RGB values of each frame. The number of frames shown in each plot varies for different videos. All these videos are obtained from the internet resource. The range of average color map values for each video is different. This is one factor used for uniquely identifying a video. The number of objects in each frame of a video is presented. Definitely, the number of objects can be used instead of specifying a particular object. The gray level co-occurrence matrix values like contrast, energy, correlation and homogeneity are also presented. These values are different for different videos.

10 46 Table 3.1. Showing average Red, Green, Blue color of each frame in a video Evening cloud-a frame in video Numer of objects Contrast Energy x Correlation Homogeneity 3 x

47 Contrast 1.5 2 x 14 1 2 4 6 Correlation.2 -.

11 47 Contrast x Correlation Ball oscillation-a frame in video-2 Energy 2 x Homogeneity Numer of objects

48.7.7.55 Red color map.6.5.4.3 CT scanner A frame in video-3.2 5 1 Green color map.6.5.4.3.2 5 1 Blue color map.5.45.4.35.3.25.

12 Red color map CT scanner A frame in video Green color map Blue color map Contrast Energy x x Correlation Homogeneity Numer of objects

49 Fmri experiment-a frame in video-4 12 1 Numer of objects 8 6 4 2 5 1 15 2 25 3 35 4 45 Contrast

13 49 Fmri experiment-a frame in video Numer of objects Contrast Energy 1.5 x x Correlation Homogeneity

5.7.7.6.6.6.5 Red color map.5.4 Green color map.5.4 Blue color map.4.3.3.3.2.2 5 1.2 5 1.1 5 1 Indian cricket Video -5 Contrast 1.

14 Red color map.5.4 Green color map.5.4 Blue color map Indian cricket Video -5 Contrast x Correlation Energy 4 x Homogeneity Numer of objects

51.7.7.7.6.6.6 Red color map.5.4.3.2 Green color map.5.4.3.2 Blue color map.5.4.3.2 Horse in motion.1.1.1 Video-6 5 1 6 5 1 5

15 Red color map Green color map Blue color map Horse in motion Video Numer of objects Contrast 4 x Correlation x Energy 2 x Homogeneity

52.7.7.7.65.65.65 fmri video 7 Red color map.6.55.5.45.4 Green color map.6.55.5.45.4.35 Blue color map.6.55.5.45.4.35.35 5 1 5 1 5 1 Contrast 5 4 3 5 1 Energy 4 x 1-4 2 14 12 5 1 Correlation Homogeneity.

16 fmri video 7 Red color map Green color map Blue color map Contrast Energy 4 x Correlation Homogeneity Numer of objects Feature representation for texts in a frame The different texts that appear in a video can be exclusively mentioned as keywords in a text document that can be used for retrieval of videos. Alternatively, the frame can be preprocessed and segmented followed by labeling of objects and getting the bounding box of objects. The original

53 contents corresponding to each bounding box can be used to find the presence of text by using a template matching.

Step 5: The intensities corresponding to the bounding box are compared with the character templates to find out the

17 53 contents corresponding to each bounding box can be used to find the presence of text by using a template matching. Step 1: A frame is read. Step 2: The image is preprocessed and segmented suitably. Step 3: The segmented image is labeled. Step 4: By using region properties, bounding box of the segmented objects is obtained. Step 5: The intensities corresponding to the bounding box are compared with the character templates to find out the presence of characters in the image. These characters are combined in sequence to obtain a word. Appearance of text in a frame is presented in Table 3.2. Table 3.2. Appearance of text in a frame of different videos Text in a frame SRILANKA OILSHAN IDBI FORTIS

18 54 NUMERALS Cricket ball in motion SERVO DAIKIN EMIRATES

19 55 MONEY GRAM CAUTION KOREAN VODAFONE DAILY

20 56 FLOOD NASA NDTV AIRCEL

21 57 SAHARA IPL CHAMPIONS SAHARA INDIA

22 58 THE WEEK THAT WASN T ACJ S FIGHT KARNATAKA CAUVERY WATER

23 Feature representation for audio track in the video The audio is extracted by using video to audio process. The stereo track is converted into mono. The words in the track are separated. The words present in thirteen videos are shown in Table 3.3. Table 3.3. Words in videos Video Video File Name WORDS EXIST IN VIDEO Number V1 11 (V1) Noisy V (V2) Noisy V3 Ct-Scanner-How It Works (V3) Scanner, Medical, Receiver, Patients, X-Ray, Medical. V4 How Does Mri-Works (V4) Hydrogen, Radio Frequency, Magnetic V5 Mri Animation (V5) Tissue, Bodies, Bone Fractures V6 Fmri-Experimentation With Virtual Reality Application (V6) Silence V7 Introduction To Ct-Imaging (V7) Silence V8 Lung Cancer Video (V8) Silence V9 Mri Of Brain (V9) Silence V1 Real Time 3d Geometry (V1) Geometry, Video, Facial Expression, Beautify Smile V11 Simple Demonstration Of Silence Magnetic (V11) V12 The Ct-Scan Process (V12) Silence V13 Voxel Mri Dataset Rendering V(13) Silence

24 6 Table 3.4. presents a word available in different videos. This table helps us retrieve, all the videos that contain a particular spoken word. Table 3.4. Presents important words in different videos Words V1 V2 V3 V4 V5 V6 V7 V8 V9 V1 V11 V12 V13 In Video Scanner Medical Receiver Patients X-Ray Bodies Radio Freq. Tissue Bone Fractures Magnetic Geometry Facial Expressio n Beautiful Smile Noisy Silence

25 61 Table 3.5. Sample wave file for words Words In Video Scanner Speech wave file Medical Receiver Patients X-Ray

62 Bodies Noisy Cross sectional Fig.3.1.

Y-axis represents magnitude of the cepstrum values.

26 62 Bodies Noisy Cross sectional Fig.3.1. Cepstrum values for words Figure 3.1 presents the cepstrum values for each word. X-axis represents the 8 different values of cepstrum. Y-axis represents magnitude of the cepstrum values. It can be noted that, the cepstrum values are different for different words. In some cases, one of the cepstrum values are very close

27 63 for the words shown. The cepstrum values for the noise is much higher when compared ot that of the remaining words SUMMARY This chapter presents the different feature representation for image, text and audio. These features are generated as part of indexing procedure while submitting the video in the internet resource. During the query process, depending upon the requirements, these features are extracted and retrieval process is carried out. Chapter 4 presents the ANN algorithms training and testing for video retrieval.

Region-based Segmentation

Region-based Segmentation Image Segmentation Group similar components (such as, pixels in an image, image frames in a video) to obtain a compact representation. Applications: Finding tumors, veins, etc.