Video Aesthetic Quality Assessment by Temporal Integration of Photo- and Motion-Based Features. Wei-Ta Chu

1 Video Aesthetic Quality Assessment by Temporal Integration of Photo- and Motion-Based Features Wei-Ta Chu H.-H. Yeh, C.-Y. Yang, M.-S. Lee, and C.-S. Chen, Video Aesthetic Quality Assessment by Temporal Integration of Photo- and Motion-Based Features, IEEE Transactions on Multimedia, vol. 15, no. 8, pp. 1944-1957, 2013.

Introduction 2 Six factors for visual aesthetic impression: color, form (shape), motion, spatial layout, depth, and human body. Goal: design an automatic VAQ assessor by integrating the aesthetic features in the temporal domain.

Overview of the Approach 3 Let be a set of videos, where each video is associated with a label representing its aesthetic value, say, 1 for appealing and 0 for unappealing, respectively. A video is separated into several nonoverlapping snippets and each snippet contains consecutive frames in one second. Two stage approach Snippet-based features (SBF) Temporal integration to combine the SBFs Perform a binary classification by SVM

4 Preprocessing for Motion-Related Features Suppose each video is divided into snippets, and each snippet is composed of frames, Motion-extraction preprocessing. Use the optical flow techniques to do pixel-based motion estimation. Saliency-based preprocessing. Use a video-based saliency region detection method.

5 Preprocessing for Motion-Related Features We binarize the video-based saliency map obtained for each frame, and then extract the largest connected component of the binarized map and refer it to as the foreground region. Motion-related features Motion space Motion direction entropy Hand-shaking Shooting type

Motion-Related Features 6 Motion Space (MS). As cameramen are videotaping a moving subject, they must beware that the space in front of the moving direction of the subject should be reserved for better VAQ. Setting the direction of the averaged optical flow of the entire image as v and the vector between the foreground center and image center as d. The feature of MS is

Motion-Related Features 7 Motion Direction Entropy (MDE). Compute the velocity directions of pixels and votes them into five predefined bins (upward, rightward, downward, leftward, and quasi-steady). A motion histogram is formed as where

Motion-Related Features 8 Hand shaking (HS). We set shaking detection area at the border which can better distinguish subject s self-shaking from the hand-shaking. The horizontal motion of the border region in frame j is defined as, which is the average of the horizontal projections of the optical flow in the region, and its direction is. The horizontal unstableness feature is defined as Similarly, vertical unstableness

Motion-Related Features 9 Shooting Type (ST). Define the border to central unstableness ratio for horizontal motion, where is the horizontal motion of the frame border, and is the horizontal motion of the frame except for the border. For border moving faster than the centre type, it may be the recording type that the shot traces and focuses on the subject. For border moving equally with the centre type, it may be a panorama shooting. For border moving slower than the centre type, it may be regarded as a static shot on a deformable subject.

Adoption of Photo-Based Features 10 Color harmonization Color saturation and value Composition Lightness

Snippet-Based Features 11 The frame-based features are pooled for a video snippet Six pooling operations: = {min, 1 st quartile, mean, median, 3 rd quartile, max} Combining these six pooling operators with the 22 frame-based features, there are thus a total of 132 SBFs allowed to be selected. For example, the SBF of (f 2, min) means that the snippet feature is built by extracting f 2 (the motion direction entropy) for each frame at first, and then pooled by min (i.e., taking the minimum). We denote the Cartesian product

Temporal-Variation-Aware Integration 12 Change of SBFs with time will affect the visual preference for audiences. A temporal-variation-aware (TVA) method to integrate the SBFs. Divide the video into non-overlapping segments and propose to use the correlation between segments to perform the integration. Let the k snippets in a video be divided into m segments (groups of snippets); each segment contains d successive snippets, and so k=md. Each group can be characterized by a d-dimensional feature vector if the snippet feature type has been selected in.

Temporal-Variation-Aware Integration 13 When the type of SBF is given, there are m feature vectors,, where. Let be the kernel function that measures the similarity in some implicit mapping space. To reflect the ith group s relevance By doing so, we obtain m integration operators To construct the kernel where is the sign function.

Temporal-Variation-Aware Integration 14 We include further the (m+1)th and (m+2)th operators which are respectively the mean and standard deviation, denoted as and We have conducted three versions of the proposed methods for comparison.

Selection of Key Dimensions 15 The first version, AFSim, selects the features from, i.e., it integrates the SBFs by using only the simple average operators, mean and standard deviation. The second version, AFTVA, selects the features from i.e., it uses the TVA operators to integrate the SBFs. The third version, TVASim, selects the features from i.e., all operators are allowed to be used for temporal integration. The combination of SBFs and temporal integration is very high-dimensional. We employ the sequential forward-search method that is a greedy approach to pick the combination of SBF and integration-operator in turn.

Selection of Key Dimensions 16 The procedure of selection 1. Find a SBF-operator pair that best classifies all videos 2. Pick another SBF-operator pair, which ahs the best accuracy jointed with the previously selected pairs. 3. Repeat Step 2 until the number of selections reaches a predefined number Combining the SBF of (f 1, mean) with the temporal integration operator means that we build the video-based feature by extracting f 1 for each frame at first, pooling the features for frames in a snippet by mean to obtain the SBF, and then perform the temporal operator to integrate the values of all SBFs in a video.

Experimental Results 17 Telefonica dataset: 160 rated videos. Each video is cropped as a 15-seconds clip. 33 participants rated the videos. The rating values range between [-2,2]. We sort these MOSs and choose the median value as a threshold to separate these 160 videos into positive (high quality) and negative (low quality)

Experimental Results 18 Comparisons of designed features. 5-fold cross-validation and repeat it 200 times to obtain the averaged accuracy. The assessment accuracies are by using AFSim and by using MoorthySim. The motion-based features such as ST and HS have been chosen in the first three runs.

Experimental Results 19 Comparisons of temporal integrations. AFTVA and TVASim perform better than AFSim and MoorthySim. When the number of selection is increased, AFTVA can further boost the performance by keeping finding more discriminating SBFs. AFTVA can turn more video-based features into distinctive clues.

Experimental Results 20 By considering both TVA and the averaged long-term operators (i.e., mean, std), TVASim achieves even-better performance. The top ranked feature types in TVASim are ST, MS, and HS, showing again that motion-related features are highly effective in determining VAQ.