Semantic Video Indexing and Summarization Using Subtitles

Similar documents
General Instructions. Questions

Data Distortion for Privacy Protection in a Terrorist Analysis System

LRLW-LSI: An Improved Latent Semantic Indexing (LSI) Text Classifier

Scalable Hierarchical Summarization of News Using Fidelity in MPEG-7 Description Scheme

Recognition, SVD, and PCA

Collaborative Filtering based on User Trends

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Parallel Architecture & Programing Models for Face Recognition

MSA220 - Statistical Learning for Big Data

Clustering and Dimensionality Reduction. Stony Brook University CSE545, Fall 2017

Information Retrieval. (M&S Ch 15)

CSE 547: Machine Learning for Big Data Spring Problem Set 2. Please read the homework submission policies.

Mining Web Data. Lijun Zhang

Distributed Information Retrieval using LSI. Markus Watzl and Rade Kutil

Feature Selection Using Modified-MCA Based Scoring Metric for Classification

Similarity Image Retrieval System Using Hierarchical Classification

70 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 1, FEBRUARY ClassView: Hierarchical Video Shot Classification, Indexing, and Accessing

Dimension Reduction CS534

Content-based Dimensionality Reduction for Recommender Systems

A Graph Theoretic Approach to Image Database Retrieval

Unsupervised learning in Vision

Facial Expression Recognition using Principal Component Analysis with Singular Value Decomposition

Lab # 2 - ACS I Part I - DATA COMPRESSION in IMAGE PROCESSING using SVD

CS 664 Structure and Motion. Daniel Huttenlocher

Information Retrieval: Retrieval Models

Based on Raymond J. Mooney s slides

SHOT-BASED OBJECT RETRIEVAL FROM VIDEO WITH COMPRESSED FISHER VECTORS. Luca Bertinetto, Attilio Fiandrotti, Enrico Magli

Training-Free, Generic Object Detection Using Locally Adaptive Regression Kernels

Clustering. Informal goal. General types of clustering. Applications: Clustering in information search and analysis. Example applications in search

ADAPTIVE LOW RANK AND SPARSE DECOMPOSITION OF VIDEO USING COMPRESSIVE SENSING

Feature Selection for fmri Classification

Numerical Analysis and Statistics on Tensor Parameter Spaces

Document Summarization using Semantic Feature based on Cloud

International Journal of Advancements in Research & Technology, Volume 2, Issue 8, August ISSN

Lecture 3: Camera Calibration, DLT, SVD

A GENTLE INTRODUCTION TO THE BASIC CONCEPTS OF SHAPE SPACE AND SHAPE STATISTICS

Workload Characterization Techniques

SOM+EOF for Finding Missing Values

Analysis and Latent Semantic Indexing

Detecting Burnscar from Hyperspectral Imagery via Sparse Representation with Low-Rank Interference

Video Syntax Analysis

VIDAEXPERT: DATA ANALYSIS Here is the Statistics button.

Clustered SVD strategies in latent semantic indexing q

A NEW ROBUST IMAGE WATERMARKING SCHEME BASED ON DWT WITH SVD

Image Compression with Singular Value Decomposition & Correlation: a Graphical Analysis

Web page recommendation using a stochastic process model

2.3 Algorithms Using Map-Reduce

Robust Face Recognition via Sparse Representation Authors: John Wright, Allen Y. Yang, Arvind Ganesh, S. Shankar Sastry, and Yi Ma

Improving Suffix Tree Clustering Algorithm for Web Documents

A Content Vector Model for Text Classification

Clustering. Bruno Martins. 1 st Semester 2012/2013

CHAPTER 4 SEMANTIC REGION-BASED IMAGE RETRIEVAL (SRBIR)

Principal Coordinate Clustering

Recall precision graph

Tag-based Social Interest Discovery

Homework 4: Clustering, Recommenders, Dim. Reduction, ML and Graph Mining (due November 19 th, 2014, 2:30pm, in class hard-copy please)

CS231A Course Notes 4: Stereo Systems and Structure from Motion

DETECTION OF CHANGES IN SURVEILLANCE VIDEOS. Longin Jan Latecki, Xiangdong Wen, and Nilesh Ghubade

CSE 494: Information Retrieval, Mining and Integration on the Internet

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

An ICA-Based Multivariate Discretization Algorithm

Content Based Image Retrieval Using Combined Color & Texture Features

CSC 411 Lecture 18: Matrix Factorizations

Mining Web Data. Lijun Zhang

Short Survey on Static Hand Gesture Recognition

Singular Value Decomposition, and Application to Recommender Systems

Collaborative Filtering for Netflix

Introduction to Information Retrieval

Feature Selection Using Principal Feature Analysis

Video Key-Frame Extraction using Entropy value as Global and Local Feature

CS 195-5: Machine Learning Problem Set 5

Automatic Video Caption Detection and Extraction in the DCT Compressed Domain

TELCOM2125: Network Science and Analysis

Globally Stabilized 3L Curve Fitting

An Introduction to Content Based Image Retrieval

vector space retrieval many slides courtesy James Amherst

Linear Discriminant Analysis in Ottoman Alphabet Character Recognition

Concept Based Search Using LSI and Automatic Keyphrase Extraction

Function approximation using RBF network. 10 basis functions and 25 data points.

Multiple View Geometry in Computer Vision

Motion Interpretation and Synthesis by ICA

A Robust and Efficient Motion Segmentation Based on Orthogonal Projection Matrix of Shape Space

Supplementary Material : Partial Sum Minimization of Singular Values in RPCA for Low-Level Vision

Clustering and Visualisation of Data

Document Clustering: Comparison of Similarity Measures

Two-view geometry Computer Vision Spring 2018, Lecture 10

Stereo and Epipolar geometry

Compression, Clustering and Pattern Discovery in Very High Dimensional Discrete-Attribute Datasets

MATRIX BASED SEQUENTIAL INDEXING TECHNIQUE FOR VIDEO DATA MINING

A Linear Approximation Based Method for Noise-Robust and Illumination-Invariant Image Change Detection

A Miniature-Based Image Retrieval System

Information Retrieval. hussein suleman uct cs

Video shot segmentation using late fusion technique

Reddit Recommendation System Daniel Poon, Yu Wu, David (Qifan) Zhang CS229, Stanford University December 11 th, 2011

Repeating Segment Detection in Songs using Audio Fingerprint Matching

CS435 Introduction to Big Data Spring 2018 Colorado State University. 3/21/2018 Week 10-B Sangmi Lee Pallickara. FAQs. Collaborative filtering

An Intelligent Clustering Algorithm for High Dimensional and Highly Overlapped Photo-Thermal Infrared Imaging Data

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Content-Based Image Retrieval of Web Surface Defects with PicSOM

Name: Math 310 Fall 2012 Toews EXAM 1. The material we have covered so far has been designed to support the following learning goals:

Transcription:

Semantic Video Indexing and Summarization Using Subtitles Haoran Yi, Deepu Rajan, and Liang-Tien Chia Center for Multimedia and Network Technology School of Computer Engineering Nanyang Technological University, Singapore 639798 {pg03763623, asdrajan, asltchia}@ntu.edu.sg Abstract. How to build semantic index for multimedia data is an important and challenging problem for multimedia information systems. In this paper, we present a novel approach to build a semantic video index for digital videos by analyzing the subtitle files of DVD/DivX videos. The proposed approach for building semantic video index consists of 3 stages, viz., script extraction, script partition and script vector representation. First, the scripts are extracted from the subtitle files that are available in the DVD/DivX videos. Then, the extracted scripts are partitioned into segments. Finally, the partitioned script segments are converted into a tfidf vector based representation, which acts as the semantic index. The efficiency of the semantic index is demonstrated through video retrieval and summarization applications. Experimental results demonstrate that the proposed approach is very promising. Keywords: Subtitles, retrieval, video summarization, script vector. 1 Introduction As the size of multimedia databases increase, it becomes critical to develop methods for efficient and effective management and analysis of such data. The data include documents, audio-visual presentations, home made videos and professionally created contents such as sitcoms, TV shows and movies. Movies and TV shows constitute a large portion of the entertainment industry. Every year around 4,500 motion pictures are released around the world spanning approximate 9,000 hours of video [8]. With the development of digital video and networking technology, more and more multimedia content are being delivered live or on-demand over the Internet. Such a vast amount of content information calls for efficient and effective methods to analyze, index and organize multimedia data. Most of the previous methods for video analysis and indexing are based on low level visual or motion information, such as color histogram [6] or motion activity [7]. However, when humans deal with multimedia data, they prefer to describe, query and browse the content of the multimedia in terms of semantic K. Aizawa, Y. Nakamura, and S. Satoh (Eds.): PCM 2004, LNCS 3331, pp. 634 641, 2004. c Springer-Verlag Berlin Heidelberg 2004

Semantic Video Indexing and Summarization Using Subtitles 635............ Subtitle File Script Extraction Script Partition Script Vector Represenation 14 00:00:26,465 --> 00:00:28,368 You didn't tell anybody I was, did you? 15 00:00:28,368 --> 00:00:29,682 No. 16 00:00:30,810 --> 00:00:32,510 I'll be right back. 17 00:00:33,347 --> 00:00:35,824 Now, why don't we get a shot of just Monica and the bloody soldier............. Retrieval Clustering/ Summarization (a) (b) Fig. 1. (a) The proposed approach to building a semantic index (b) Example of a script file. keywords rather than low level features. Thus, how to extract semantic information from digital multimedia is a very important, albeit a challenging task. The most popular method to extract semantic information is to combine human annotation with machine learning [3]. But such methods are semiautomatic and complex because the initial training set need to be labelled by human and the learned classifiers may also need to be tuned for different videos. Subtitle files of a video provides direct access to the semantic aspect of the video content because the semantic information is captured very well in the subtitle files. Thus, it seems to be prudent to exploit this fact to extract semantic information in videos, instead of developing complex video processing algorithms. In this paper, we provide a new approach to building the semantic index for video content by analyzing the subtitle file. This approach is illustrated in Figure 1(a). First, the scripts with time stamps are extracted from the subtitle file associated with the video. Such subtitle files are available in all DVD/DivX videos. The second step is to partition the scripts into segments. Each segment of the script is then converted into a vector based representation which is used as the semantic index. The vector based indexes can be used for retrieval and summarization. The organization of the paper is as follows: Section 2 describes in detail the process of building a semantic index based on a script file. Section 3 describes two applications with the script vector representation - retrieval and summarization. Section 4 presents the experimental results and concluding remarks are given in Section 5. 2 Semantic Video Indexing In this section, we describe in detail the 3 stages of the proposed technique to index video sequences based on semantic information extracted from a script file in a DVD/DivX video. The stages are script extraction, script partition and script to vector mapping. 2.1 Script Extraction DVD/DivX videos come with separate subtitle or script files for each frame in the video sequence. There are two types of subtitle files - one in which the scripts

636 H. Yi, D. Rajan, and L.-T. Chia are recorded as bitmap pictures which are drawn directly on the screen when the video plays and the other in which the scripts are recorded as strings in a text file. The text based subtitle files are much smaller and more flexible than those based on bitmaps. The advantage of using a text based subtitle file is that it is not only human-readable, but the user can easily change the appearance of the displayed text. However, bitmap pictures can be converted to text using readily available software such as VOBSUB [1]. Hence, we focus on script extraction from the text base subtitle file. An example of a text based subtitle file is shown in Figure 1(b). Each script in the file consists of an index for the script, the time of appearance and disappearance of the script with respect to the beginning of the video and the text of the script. The subtitle file is parsed into ScriptElements, where each ScriptElement has the following three attributes: Start Time, End Time and Text. We use the information in the ScriptElements in order to partition them in the next step. 2.2 Script Partitioning The objective of script partitioning is to group together those ScriptElements that have a common semantic thread running through them. Clearly, it is the temporally adjacent ScriptElements that are grouped together because they tend to convey a semantic notion when read together. At the same time, some ScriptElements may contain only a few words, which by themselves do not convey any semantic meaning. This leads us to the question of how to determine which ScriptElements should be grouped together to create a partitioning of the entire script. We use the time gap between ScriptElements as the cue for script partition. This time gap, which we call the ScriptElement gap is defined as the time gap between the EndTime of the previous ScriptElement and the StartTime of the current ScriptElement. In a video, when there is a dialogue or a long narration that extends to several frames, the ScriptElement gap is very small. Concurrently, it is evident that ScriptElements that constitute an extended narration will also have a high semantic correlation among themselves. Hence, it is seen that the ScriptElement gap is a useful parameter by which to group together semantically relevant ScriptElements, thereby creating a partition of the scripts. In the proposed method, the ScriptElements are partitioned by thresholding the ScriptElement gap. We call each partition as a ScriptSegment. 2.3 Script Vector Representation After partitioning the scripts into segments, we build an index for each script segment. We adopt the term-frequency inverse document frequency(tfidf) vector space model [4], which is widely used for information retrieval, as the semantic index for the segments. The first step involves removal of stop words, e.g. about, I etc. The Potter Stemming algorithm [5], is used to obtain the stem of each word, e.g., the stem for the word families is family. The stems are collected into a dictionary, which are then used to construct the script vector

Semantic Video Indexing and Summarization Using Subtitles 637 for each segment. Just as the vector space model represents a document with a single column vector, we represent the script segment using the tfidf function [2] given by S s tfidf(t k,d j )=#(t k,d j ) log (1) #S s (t k ) where #(t k,d j ) denotes the number of times that a word t k occurs in segment d j, S s is the cardinality of the set S s of all segments, and #S s (t k ) denotes the number of segments in which the word t k occurs. This function states that (a) the more often a term occurs in a segment, the more it is representative of its content, and (b) the more segments a term occurs in, the less discriminating it is. The tfidf function for a particular segment is converted to a set of normalized weights for each word belonging to the segment according to tfidf(t k,d j ) w k,j = (. (2) T i=1 (tfidf(t i,d j )) 2 ) Here, w k,j is the weight of the word t k in segment d j and T is the total number of words in the dictionary. This is done to ensure that every segment extracted from the subtitle file has equal length and that the weights are in [0,1]. It is these weights that are collected together into a vector for a particular segment such that the vector acts as a semantic index to that segment. We call this vector as the tfidf vector in the following discussion. 3 Applications We now illustrate two applications of the semantic index that has been extracted from the script files of DVD/DivX videos using the proposed method. In the first instance, we retrieve a video sequence using a keyword or a sentence as the query. The second application is video summarization wherein the semantic index is used to create a summary of the entire video - the summary can be expressed as a set of keywords or as a video. 3.1 Video Retrieval In this subsection, we illustrate video retrieval with script vector based representation which acts as a semantic index. As described in the previous section, each script segment is represented as a tfidf vector. We collect all the column script vectors together into a matrix of order T S s, called the script matrix. The query can be in the form of a single word in which case the query vector (which has the same dimensions as the tfidf vector) will consist of a single nonzero element. For example, a query with the word bride will result in a query vector like [0 1 0], where only the entry of the vector corresponding to the word bride is set to 1. The query can also take the form of a sentence like The bride and groom are dancing ; here the query vector will look like [0 1/ 3 1/ 3 ]. As we see, the word(s) that are present in the query

638 H. Yi, D. Rajan, and L.-T. Chia will have higher values in the query vector. The result of the querying process is the return of script segments which are geometrically close to the query vector; here, we will use the cosine of the angle between the query vector and the columns of the script matrix using, cos θ j = a j T q a j 2 q 2 = T i=1 a ijq i T T i=1 a2 ij i=1 q2 i (3) for j =1 S s, where a j is a column vector from the script matrix, q is the query vector and T is the number of words. Those script vectors for which equation (3) exceed a certain threshold are considered relevant. Alternatively, we could sort the values of cos θ j to present the top n results. We could also use other similarity/distance measures, such as the norm of the difference between the query vector and script vector. Since both the computations are monotonic, they will achieve the same result. In both cases, we have to normalize the vectors. We observe that the sparsity of the vectors, especially the query vector, is a key feature in the model. Consider what happens when we take the similarity of a very sparse query vector with a dense script vector. In order to compute the Euclidean distance between such vectors, we would have to subtract each entry of the query from each entry in the script vector, and then square and add each of them. Even precomputing the norms of the script vectors is not feasible since it is computationally expensive and, furthermore, large storage will be required to store the values when dealing with videos with thousands of script segments. However, using cosines, we can take advantage of the sparsity of the query vector, and only compute those multiplications (to get the numerator in the equation) in which the query entry is non-zero. The number of additions is also then limited. The time saved by taking advantage of sparsity would be significant when searching through long videos. Another observation about the script matrix is that it is very sparse because many of its elements are zero. On the other hand, the dimensionality of the script vectors is very high. Hence, it is desirable to reduce the rank of the matrix. This is viable because if we assume that the most represented words in the script matrix are in many basis vectors, then deleting a few basis vectors will remove the least important information in the script matrix resulting in a more effective index and search. We use the Singular Value Decomposition (SVD) to reduce the rank of the script matrix. SVD factors a T S s script matrix A into three matrices: (i) a T T orthogonal matrix U with the left singular vectors of A in its columns, (ii) a S s S s orthogonal matrix V with the right singular vectors of A as its columns, and (iii) a T S s diagonal matrix E having the singular values in descending order along its diagonal, i.e.a = UEV T. If we retain only the largest k singular values in the diagonal matrix E, we get the k th rank matrix A k, which is the best approximation of the original matrix A (in terms of Frobenius norm) [4]. Hence,

Semantic Video Indexing and Summarization Using Subtitles 639 A A k F = min A X = σk+1 2 + + rank(x) k σ2 r A. (4) Here A k = U k E k V t k, where U k is a T k matrix, V k is a S s k matrix, and E k is k k diagonal matrix whose elements are the ordered k largest singular values of A. The σ s in the equation (4) are the singular values, or the diagonal entries in E. Using the approximate k th rank script matrix made by SVD, we can recompute equation (3) as [4] cos θ j = (A k e j ) T q (A k e j ) 2 q 2 = (U k Σ k Vk te j) T q (U k Σ k Vk te = e j T V k Σ k (Uk T q) j) 2 q 2 Σ k Vk te (5) j 2 q 2 cos θ j = s j T (Uk T q), j =1,, S s (6) s j 2 q 2 where s j = E k Vk T e j, and e j is the jth canonical vector of dimension S s. The SVD factorization of the script matrix will not only help to reduce the noise of script matrix, but also improve the recall rate of retrieval. 3.2 Summarization In this subsection, we propose a new video summarization method using the script matrix. Recall that the columns of the script matrix are the script vectors for the script segments. The script vectors can be viewed as points in a high dimension vector space. Principle Component Analysis(PCA) is used to reduce the dimensions of the script vector. The PCA used here has the same effect as the SVD used in the retrieval application on reducing the script vector representation. The script vectors are then clustered in the high dimension space using the K-means algorithm. After clustering, those script segments whose script vectors are geometrically closest to the centroids of the clusters are concatenated to form the video summary. Besides, the script text of the selected segments can be used as the text abstract of the videos. The number of clusters can be determined by the desired length of the video summary, e.g, if the desired length of the video summary is 5% of the original video, then the number of clusters should be one twentieth of the total number of the script segments. 4 Experimental Results In this section, we present the experimental results to demonstrate the efficacy of the proposed semantic indexing method. Our test data consists of a single episode from the popular TV sitcom Friends (season 8 episode 1). Figure 2(a) shows the distribution of the time gap between ScriptElements for a total of 450 ScriptElements. In our implementation, we use 2 seconds as the threshold to partition the scripts into segments. With this threshold, the 450 ScriptElements are partitioned into 71 script segments. For each script segment, a script vector is extracted from the text as described in subsection 2.3.

640 H. Yi, D. Rajan, and L.-T. Chia 18 16 1 14 12 0.8 Gap Length (Seconds) 10 8 6 Energy Ratio 0.6 0.4 4 0.2 2 0 0 200 400 600 800 1000 1200 1400 0 0 50 100 150 200 250 300 350 400 450 500 Script Start Time (Seconds) (a) (b) Rank Fig. 2. (a) Time gap between the script element of friends video. (b) Energy ratio VS Dimension of reduced script vector with PCA. Several queries by keywords are performed on the script matrix to retrieve the corresponding script segment as described in subsection 3.1. The retrieval results using the keywords bride, dance, groom and wedding are shown in Figures 3 (a), (b), (c) and (d), respectively. As we can see, the proposed method has successfully retrieved the relevant scripts as well as the associated video sequences. Thus, a high level semantic notion like bride can be easily modelled using the technique described in this paper. 1 2 3 Script 1: Well then, why don't we see the bride and the groom and the bridesmaids. Script 2: we'll get Chandler and the bridesmaids. Script 3: How about just the bridesmaids? (a) 1 2 3 4 Script 1: You can dance with her first. Script 2: You ready to get back on the dance floor? Script 3: embarassed to be seen on the dance floor with some Script 4: So, I'm gonna dance on my wedding night with my husband. (b) 1 2 Script 1: Well then, why don't we see the bride and the groom and the bridesmaids. Script 2: You know I'm the groom, right? (c) 1 2 3 Script:1 Sure! But come on! As big as your wedding? script:2 Come on! It's my wedding! script:3 So, I'm gonna dance on my wedding night with my husband. (d) Fig. 3. Example retrieved results: (a) bride query, (b) dance query, (c) groom query, (d) wedding query In order to illustrate the results for video summarization, we use Principal Component Analysis (PCA) to reduce the script vector from 454 to 50 dimensions. Figure 2(b) shows the plot of the percentage of the total energy of the script vectors when the dimension of the script vectors is reduced with PCA.

Semantic Video Indexing and Summarization Using Subtitles 641 The first 50 dimensions capture more than 98% of the total energy. We extract 10 keywords from 5 principle components with the largest 5 eigenvalues. We examine the absolute value of each entry in those principle component vectors and pick out the largest two entries for each principle component vector as the key words. The extracted ten key words are happy, Chandler, marry, Joey, Monica, bride, husband, pregnant, baby dance. This episode talks about two friends Chandler and Monica getting married, Rachel getting pregnant (with a baby ) and Rose dancing with bridesmaids at the wedding party. We see that the extracted key words give a good summary of the content of this video. We also extracted video summaries from the original video with the lengths of 5%, 10%, and 20% of the original video. We find that the video summaries capture most of the content of the video. We observe that the 10% video summary is the optimal one. While the 5% summary is too concise and a little difficult to understand, the 20% summary is quite a few redundancies (The result summary videos are available at ftp://user:123456@155.69.103.62/). 5 Conclusion and Future Work In this paper, we provide a new approach to tackle the semantic video indexing problem. The semantic video index is extracted by analyzing the subtitle file in a DVD/DivX video and represented by the vector based model. Experimental results on video retrieval and summarization demonstrate the effectiveness of the proposed approach. In future, we would consider other Information Retrieval models and incorporate the extracted video summary into MPEG-7 standard representation. References 1. http://www.doom9.org/dvobsub.htm. 2. M. W. Berry, Z. Drmavc, and E. R. Jessup. Matrices, vector spaces, and information retrieval. SIAM Review, 41(2):335 362, June 1999. 3. C.-Y. Lin, B. L. Tseng, and J. R. Smith. VideoAnnEx: IBM MPEG-7 annotation tool for multimedia indexing and concept learning. In IEEE International Conference on Multimedia & Expo, Baltimore, USA, July 2003. 4. M.W.Berry, S.T.Dumais, and G.W.O Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37:301 328, 1995. 5. M. F. Porter. An algorithm for suffix stripping. Program, 14(3):130 137, July 1980. 6. S. Smoliar and H. Zhang. Content-based video indexing and retrieval. IEEE Multimedia, 1:62 72, 1994. 7. X. Sun, B. S. Manjunath, and A. Divakaran. Representation of motion activity in hierarchical levels for video indexing and filtering. In IEEE International Conference on Image Processing, volume 1, pages 149 152, September 2002. 8. H. D. Wactlar. The challanges of continuous capture, contemporaneous analysis and customized summarization of video content. CMU.