Exploring video content structure for hierarchical summarization

Size: px

Start display at page:

Download "Exploring video content structure for hierarchical summarization"

Randolph Elliott
5 years ago
Views:

1 Multimedia Systems 10: (2004) Digital Object Identifier (DOI) /s Multimedia Systems Exploring video content structure for hierarchical summarization Xingquan Zhu 1, Xindong Wu 1, Jianping Fan 2, Ahmed K. Elmagarmid 3, Walid G. Aref 3 1 Department of Computer Science, University of Vermont, Burlington, VT 05405, USA 2 Department of Computer Science, University of North Carolina, Charlotte, NC 28223, USA 3 Department of Computer Science, Purdue University, W. Lafayette, IN 47907, USA Published online: 15 September 2004 c Springer-Verlag 2004 Abstract. In this paper, we propose a hierarchical video summarization strategy that explores video content structure to provide the users with a scalable, multilevel video summary. First, video-shot- segmentation and keyframe-extraction algorithms are applied to parse video sequences into physical shots and discrete keyframes. Next, an affinity (self-correlation) matrix is constructed to merge visually similar shots into clusters (supergroups). Since video shots with high similarities do not necessarily imply that they belong to the same story unit, temporal information is adopted by merging temporally adjacent shots (within a specified distance) from the supergroup into each video group. A video-scene-detection algorithm is thus proposed to merge temporally or spatially correlated video groups into scenario units. This is followed by a scene-clustering algorithm that eliminates visual redundancy among the units. A hierarchical video content structure with increasing granularity is constructed from the clustered scenes, video scenes, and video groups to keyframes. Finally, we introduce a hierarchical video summarization scheme by executing various approaches at different levels of the video content hierarchy to statically or dynamically construct the video summary. Extensive experiments based on real-world videos have been performed to validate the effectiveness of the proposed approach. Keywords: Hierarchical video summarization Video content hierarchy Video group detection Video scene detection Hierarchical clustering 1 Introduction Owing to the decreased cost of storage devices, higher transmission rates, and improved compression techniques, digital videos are becoming available at an ever-increasing rate. Correspondence to: X. Zhu ( xqzhu@cs.uvm.edu) This research has been supported by the NSF under grants EIA, IIS, EIA, and IIS, a grant from the state of Indiana 21 th Century Fund, and by the U.S. Army Research Laboratory and the U.S.Army Research Office under grant DAAD However, the manner in which video content is presented for access (such as browsing and retrieval) has become a challenging task, both for video application systems and for users. Some approaches have been described in [1 3] for presenting the visual content of the video through hierarchical browsing, storyboard posters, etc. The users can quickly browse through a video sequence, navigate from one segment to another to rapidly get an overview of the video content, and zoom to different levels of details to locate segments of interest. Research in the literature [3] has shown that, on average, there are about 200 shots for a 30-min video clip across various program types, such as news and drama. Assuming that one keyframe is selected to represent each shot, the presentation of 200 frames will impose a significant burden in terms of bandwidth and time. Using spatially reduced images, commonly known as thumbnail images, one can reduce the size, but it will still be expensive if all shots must be shown for a quick browse of the content. Hence, a video summarization strategy is needed to provide users with a compact video digest that shows only parts of the video shots. Generally, a video summary is defined as a sequence of still or moving pictures (with or without audio) presenting the content of a video in such a way that the respective target group is rapidly provided with concise information about the content, while the essential message of the original is preserved [4]. Three kinds of video summary styles are commonly used: A pictorial summary [3,7 13,18,19,25,30,31] is a collection of still images (possibly varying in size) arranged in time. The use of compact images provides the user with an overview of the video directly and efficiently; however, the motion and audio information of the video is lost. A video skimming [14 16,21 24,30] is a collection of video clips (e.g., video shots) arranged in a time series. Compared with still pictorial summaries, video skimming is more attractive to the users, since both motion and audio signals are preserved. However, the amount of time required for viewing a skimming suggests that skimmed video is not appropriate for quick browsing. A data distribution map [20 22] is a synthetic picture that illustrates the distribution of some specific data in the database. Due to the inherent complexity of images and videos, using pictorial frames or video skimming to

2 X. Zhu et al.: Exploring video content structure for hierarchical summarization 99 abstract a video database may be impractical. The data distribution map, therefore, appears to be an efficient mechanism for providing users with an overview of the whole dataset. Unfortunately, such a distribution map may not be as friendly and intuitive as pictorial frames or skimming, since synthetic points or curves may not make sense for naïve users. A survey of video summarization is given in [5]. When users are provided with a compact video digest, they can obtain video content quickly and comprehensively. Moreover, the power of visual summaries can be helpful in many applications such as multimedia archives, video retrieval, home entertainment, and digital magazines. To achieve a suitable video summarization for the various applications above, we introduce a hierarchical video summarization strategy that uses pictorial frames for presentation of the video summaries. In comparison with most approaches that use only visual and temporal clustering, we integrate the video content structure and visual redundancy among videos. Moreover, much emphasis has been put on scalable and hierarchical video summarization. The final results consist of various levels of static or dynamic summaries that address the video content in different granularities (Sect. 2, Definition 1). The rest of the paper is organized as follows. In the next section, we review related work on video summarization. Our overall system architecture is presented in Sect. 3. In Sect. 4, we introduce techniques on video content hierarchy detection that are used to construct video content structure from keyframes, groups, and scenes to clustered scenes. With the constructed content structure as a basis, in Sect. 5 we discuss a hierarchical video summarization scheme that uses the hierarchical clustering algorithm to construct a video summary at different levels. In addition to supporting the passive browsing of statically constructed video summaries, a dynamic summarization scheme has also been integrated to allow users to specify their preferred summary length. In Sect. 6, the effectiveness of the proposed approach is validated by using realworld videos. Concluding remarks are given in Sect Related work In practical terms, video summarization is an inseparable research area for many video applications, including video indexing, browsing, and retrieval [1,2,29]. The earliest developed research on video summarization is perhaps best exemplified in Video Magnifier [6], where video summaries were created from a series of low-resolution frames that were uniformly sampled from the original video sequence at certain time intervals. The obvious drawback of such an arrangement is the loss of content in short but important videos. Accordingly, various research efforts have addressed the generation of video overviews using more suitable approaches [7 25]. Basically, the quality of a generated video summary is influenced by the following factors: 1. How the selected summary units are visualized. 2. What kinds of units are selected to generate the summary. 3. Whether the summary is static or dynamic (Definitions 1 and 2 below). Since most research assumes the video summary is generated for efficient browsing, indexing, and navigating, the pictorial summary is popularly used to visualize the video summary via icon frames [3,7,11 13,25,30,31], mosaic images [18, 19,34], video objects [8 10], or a data distribution map [20 22]. A curve-simplification video-summarization strategy is introduced in [12], which maps each frame into a vector of high- dimensional features and segments the feature curve into units. The video summary is extracted according to the relationship between the units. In Video Manga [7], a pictorial video summary is presented with keyframes in various sizes, where the importance of each keyframe determines its size. Given a camera panning/tilting shot, however, none of the selected keyframes generated by the above strategies captures the underlying dynamics of the shot. Hence, the mosaic-based approach is utilized to generate a synthesized panoramic image that can represent the entire content in an intuitive manner [18,19,34]. Unfortunately, this approach works well only when specific camera motions are contained in the shot, and this critical drawback prevents it from being applied effectively to real-world videos. Instead of using features from the whole image for summarization, object- based strategies have been proposed [8 10]; however, due to the challenging task of object segmentation [17], it might be a long time before impressive results in general videos can be acquired. Instead of abstracting videos by pictorial images, some approaches provide video skimming [14 16,21 24,30], which trims the original video into a short, highlight stream. In [4, 15], the most representative movie segments are extracted to produce a movie trailer. The approach in [24] uses a method that applies different sampling rates to the video to generate a summary. The sample rate is controlled by motion information in the shot. In addition to using visual features, the strategies in [21 23] utilize closed-caption and speech signals to select segments of interest for skimming. Video skimming may be useful for some purposes, since a skimmed video stream is a more appealing method of browsing for users. However, the amount of time required for viewing a skim suggests that it is not appropriate for a quick overview or transmission. Defining what kinds of units should be selected is the most challenging and subjective process in creating video summaries. Many algorithms believe that the scenario (such as a story unit) is more suitable for summarization [3,30,31,42] than others, where video scenarios are detected using timeconstrained clustering [3], speaker or subject change detection [30,31,42], etc. The summary is constructed using the representative frames (or shots) of the story units in the form of a storyboard or a skimming. A summarization scheme that integrates video scenarios outperforms most other strategies since the presentation of scenarios in the summary may be the most effective way to tell the user what the video talks about. Unfortunately, although many videos from our daily life have scenario evolution (i.e., it is possible to detect story units), there are a significant number of videos that have no scenario structure, such as surveillance videos and sports videos. In these cases, a selection of video highlights might be more suitable for summarization. Generally, video highlights are selected by predefined rules, such as selecting frames with the highest contrast [4], close-up shots of leading action, special events like explosions and gunfire [15], frames with faces, camera motions, or text [21 23], or by considering the time

3 100 X. Zhu et al.: Exploring video content structure for hierarchical summarization and data information of the video [14]. On the other hand, since domain knowledge may assist us in acquiring semantic cues, some abstracting strategies have been executed for specific domains, such as home videos [14], stereoscopic videos [8], online presentation videos [16], and medical videos [25, 30,42], where rules and knowledge followed by the videos are used to analyze semantics [26 28]. However, another problem emerges: rather than abstracting the video, highlight strategies usually present some selected important samples that address the video details but not the topic and theme; as a result, a glance at the constructed summary imparts very limited information about video content to users. Moreover, since most highlight selection strategies are domain specific, they cannot be extended to other videos easily. Therefore, compared with scenario-based strategies, the generality of highlight-based strategies is limited. For the generation of static or dynamic summaries,many approaches supply a dynamic scheme that integrates user specifications to dynamically generate a summary online. Usually, a dynamic strategy requires the user to directly or indirectly specify the summary parameters, e.g., the length of the summary [4,8,9,13 15,24] (number of summary frames or time duration) or the types of interesting features for determining the summary frames (e.g., slides, close-ups of people) [44]. In this way, the scalability of the summary is preserved, since each user is supplied with a summary he/she prefers. Unfortunately, a user s specification for the summary is not always reasonable for addressing video content, especially if the user is unfamiliar with the videos in the database. Conversely, the approaches in [3,7,25,30,42] construct static summaries beforehand and supply all users with the same summary (the one that has been optimized by the system). However, static methods have the drawback of depriving the user of scalable summaries, which may be desirable in some cases. A dynamic scheme maintains the scalability of the summary but may not result in the best summary point for naive users, and a static scheme may supply the users with an optimal point in generating the summary, but the scalability is lost. Obviously, neither is perfect, and a new approach that integrates both may provide a better solution. Ideally, a video summary should briefly and concisely present the content of the input video source. It should be shorter than the original, focus on the content, and give the users an appropriate overview of the whole video. However, the problem is that appropriateness varies from user to user, depending on the user s familiarity with the source and genre, and with a given user s particular goal in watching the summary. Hence, hierarchical summarization strategies [13,20, 30,42] have been introduced to support various levels of summary and assist users in determining what is appropriate. In [13], a keyframe-based hierarchical strategy is presented, where keyframes are organized in a hierarchical manner from coarse to fine temporal resolution using a pairwise clustering method. Unfortunately, this strategy is unable to merge frames that are visually similar but temporally separated. By considering video content structure and domain knowledge, a multilevel static summary [30,42] has been constructed for medical video data. This strategy, however, still limits the scalability of the summary by supplying only the static summaries, and, moreover, it works only on medical video data. All the reviews above support the following observations: 1. For efficient presentation and browsing purposes, a pictorial summary may outperform video skimming. 2. Since defining video highlights is a relatively subjective and domain-dependent process, if the video content structure is available, a scenario-based summarization strategy might be a better choice. 3. Due to the obvious problems in static and dynamic summarization, an efficient scheme should integrate the two strategies for maximum benefit. 4. To capture an appropriate summary for a video overview, a hierarchical strategy should be adopted. Based on these observations, we propose a hierarchical video summarization strategy. We use pictorial keyframes to visualize the video content and introduce a hierarchical strategy to construct multilevel summaries, increasing in granularity from top to bottom. Moreover, instead of selecting video highlights, we use the semantically richer video units (scenes and groups) for summarization. In this paper, we concentrate on summarizing videos with content structures. For those having no story lines, the highlight selection scheme might be the best choice for summarization. Before we explore video summarization techniques, it would be worthwhile to define the terminology that will be used in our analysis. Definition 1. A static summarization scheme is an approach that requires no user participation for summary generation. For all users of the same video, this strategy will generate the same summary, which is called a static summary in this paper. Definition 2. A dynamic summarization scheme is an approach where user actions may result in differences in the generated summary. The summary generated by this scheme is called a dynamic summary in this paper. Definition 3. The video content structure is defined as a hierarchy of clustered scenes, video scenes, video groups, and video shots (keyframes), increasing in granularity from top to bottom in addressing video content, as shown in Fig. 1. A content structure can be found in most videos in our daily life. Definition 4.A video shot (denoted by S i ) is the basic element in videos; it records the frames resulting from a single continuous running of the camera, from the moment it is turned on to the moment it is turned off. Definition 5. A keyframe (denoted by K i ) is the frame that is representative to describe the content of a given shot. Depending on the content variance of the shot, one or multiple keyframes can be extracted. Definition 6. A video supergroup (denoted by SG i ) consists of visually similar shots within an entire video. Definition 7.A video group (denoted by G i ) is an intermediate entity between physical shots and semantic scenes and consists of visually similar shots that belong to the same story unit. Definition 8. A video scene (denoted by SE i ) is a collection of semantically related and temporally adjacent groups depicting and conveying a high-level concept or story. Definition 9. A clustered scene (CSE i ) is a collection of visually similar video scenes.

4 X. Zhu et al.: Exploring video content structure for hierarchical summarization 101 CSE1 CSE2 Clustered scenes Clustered scenes consist of visually similar scenes SE1 SE2 SE3 SEM Video scenes Scene Clustering Video scenes consist of temporally interlaced video groups G2 G1 Gi Video groups Scene detection S1 S2 S3 Si SN Video groups consist of visually similar shots (within a certain temporal distance) Video shots and key-frames S1 S2 S3 Si SN Group detection Video shots (Key-frames) Shot segmentation Key-frame extraction Video sequence Video frames Fig. 1. Pictorial video content structure Video content structure detection Hierarchical video summarization Merge the representative frames at level X+4 which belong to the same super-group to generate the highest video summary Summary Level 5+X Clustered scenes Summary Level 4+X CSE 1 CSE 2 CSE 3 Scene clustering Key-frames among clustered scenes Video scenes Summary Level 3+X SE 1 SE 2 SE 3 SE n Key-frames among scenes G1 G n Temporal interlaced video group correlation Summary Level 2+i (i=1, 2,..,X) Video groups Temporal adjacent video group detection Hierarchical clustering among each video group Static summary Dynamic summary Video super-groups Summary Level 2 SG 1 SG 2 SG 3 SG 4 SG n Key-frame grouping Key-frames among super-groups Video shot segmentation & Keyframe extraction Key-frames Digital video database Fig. 2. System architecture S1 S2 S3 S4 S5 S6 Si Sn Key-frames of shots Summary Level 1 video Summary database

5 102 X. Zhu et al.: Exploring video content structure for hierarchical summarization 3 System architecture Figure 2 shows that our system consists of two relatively independent components: video content structure detection and hierarchical video summarization. To explore video content structure, the shot segmentation and keyframe extraction are first applied to parse continuous video sequences into independent shots and discrete keyframes. For all extracted keyframes, an affinity matrix is constructed to explore the correlations among them. With the transferred affinity matrix, the visually similar shots are merged into supergroups. For each supergroup, the video shots with temporal distance smaller than a predefined threshold are then taken as one video group. Generally, a video scene consists of temporally interlaced (or visually similar and neighboring) video groups to convey a relatively independent story unit. Accordingly, the temporal and visual information among them are integrated to cluster video groups into scenes. To eliminate visual redundancy caused by repeated scenes in different parts of the video, the scene-clustering strategy is employed to merge visually similar scenes into unique clusters. As a result, a video content hierarchy from clustered scenes, scenes, groups to keyframes is constructed. Given the detected video content structure, an X +5level video summary (X indicating the dynamic summary level, which is determined by the user s specification of the summary length) is constructed, where the summary becomes more concise and compact from lower to higher levels. The lowest summary level consists of keyframes of all shots. By utilizing the affinity matrix, the visually similar shots are clustered into supergroups. It is likely that any supergroup containing only one shot will convey very limited content information. Hence, by eliminating supergroups with only one shot, the summary at the second level consists of all remaining keyframes. For each supergroup, a predefined threshold is applied to segment the temporally adjacent member shots into groups. Given any video group, the Minimal Spanning Tree [50] algorithm is utilized to cluster keyframes into a hierarchical structure. A dynamic summarization scheme is employed to select a certain number of representative frames from each group to form the user-specified dynamic summary. Beyond the dynamic summary levels, the summary at level X +3is constructed by eliminating groups that contain less than three shots and do not interlace with any other groups, since they probably contain less scenario information. With the scene-clustering scheme, visual redundancy among similar scenes is partially removed. Consequently, the summary at level X +4consists of representative frames of all clustered scenes. It is not unusual for different video scenes to contain visually similar video groups, and eliminating similar groups provides the user with a visually compact summary. Hence, the summary at the highest level is constructed by merging all representative frames at level X+4 that belong to the same supergroup. 4 Video content structure detection Most videos in our daily life have their own content structure, as shown in Fig. 1. While browsing videos, the hidden structure behind frames (scenes, groups, etc.) conveys themes and scenarios to us. Hence, the video summary should also fit with this structure by presenting overviews at various levels and with various granularities. To achieve this goal, we first explore the video content structure. The simplest method of parsing video data for efficient browsing, retrieval, and navigation is segmenting the continuous video sequence into physical shots [35,37,38] and then selecting a constant number of keyframes for each shot to depict its content [1,2,32,43,45,47]. Unfortunately, since the video shot is a physical unit, it is incapable of conveying independent scenario information. To this end, many solutions have been proposed to determine a cluster of video shots that conveys relatively higher- level video semantics. Zhong et. al [33] propose a strategy that clusters visually similar shots and supplies the user with a hierarchical structure for browsing. Similar strategies are found in [11,32,43,44]. However, spatial shot clustering considers only visual similarities, and the video context information is lost. To address this problem, Rui et. al [39] present a method that merges visually similar shots into groups and then constructs a video content table by considering temporal relationships among groups. A similar approach has been reported in [36]. In [41], a time- constrained shot-clustering strategy is proposed to parse temporally adjacent shots into clusters, and a scene transition graph is constructed to detect video scenes by utilizing the acquired cluster information. A temporally time- constrained shot-grouping strategy has also been discussed in [40,42] that considers the similarity between the current shot and the shots preceding and following it (within a small window size) to determine the boundary of the video scene. In a narrow sense, both temporal and spatial clustering schemes above have their disadvantages in exploring video content. The spatial clustering ignores temporal information, so video context information will not be available in the final result. On the other hand, the temporal clustering crucially depends on a predefined distance function that integrates the temporal window size and visual similarity between shots, but this function may not fit with user perceptions very well [39,36,40 42]. In this section, a new method is proposed to detect video content structure. We first discuss techniques to determine shots and keyframes. Then, an affinity matrix is constructed to cluster visually similar shots into supergroups. Although visually similar shots have a high probability of belonging to one scene, similar shots with a large temporal distance often belong to similar scenes in different parts of the video, e.g., similar dialogs between two actors may appear in different parts of the movie. Therefore, in each supergroup, adjacent shots with their temporal distance less than a predefined threshold are segmented into video groups. Due to the fact that video scenes usually consist of interlaced video groups (or visually similar and neighboring groups), a videoscene- detection algorithm is proposed by considering visual and temporal information among video groups. Finally, we apply a scene-clustering scheme that merges visually similar scenes to remove visual redundancy and produces a compact summary. 4.1 Video shot detection and keyframe extraction To support shot-based video content access, we have developed a shot-detection technique [38] that works on MPEG

X. Zhu et al.: Exploring video content structure for hierarchical summarization 103 a b Fig. 3. The video shot detection results from a medical video. a Detected video shots.

It adaptively adjusts the threshold for shot detection by considering the activities in different video sequences.

To adapt the thresholds to local activities within the same sequence, we adopt a small window (e.g.

6 X. Zhu et al.: Exploring video content structure for hierarchical summarization 103 a b Fig. 3. The video shot detection results from a medical video. a Detected video shots. b Corresponding frame difference and the determined threshold for different video shots, where the small window shows the local properties of the frame differences compressed videos. It adaptively adjusts the threshold for shot detection by considering the activities in different video sequences. Unfortunately, such techniques are not able to adapt the thresholds for different video shots within the same sequence. To adapt the thresholds to local activities within the same sequence, we adopt a small window (e.g., 2 3 s), and the threshold for each window is adapted to local visual activity by applying our automatic threshold detection and local activity analysis techniques. Figure 3 demonstrates the results from one video source in our system. As we can see, the threshold has been adapted to small changes between adjacent shots for successful shot segmentation, such as changes between eyeballs from various shots in Fig. 3a. After shot segmentation, the keyframe for each shot is extracted using the following strategy. 1. Given shot S i, its tenth frame is always taken as the keyframe of the shot, and this frame is denoted by K 1. (In this way, we may avoid selecting the gradual effect frames between shots as keyframes). 2. From the position of the most recently select keyframe, sequentially calculate the similarity between the succeeding frames (F i ) and all former selected keyframes K l (l =1, 2,...,N) using Eq. 1. If this value is less than a threshold T keyframe (which is determined by the threshold used to detect the current shot), the current frame F i is taken as a new keyframe and added to the keyframe set. KF mselect(f i )=Max{FmSim(F i,k l ); l =0, 1,...,N}, (1) where FmSim(F i,k l ) indicates the visual similarity between F i and K l, which is defined by Eq Iteratively execute step 2 until all frames in shot S i have been processed. To calculate the visual similarity between frames, three sets of visual features (256-dimensional HSV color histogram, 10- dimensional tamura coarseness, and 8-dimensional tamura directionality texture) are extracted. Suppose H i,j, j [0, 255], TC i,j, j [0, 9], and TD i,j, j [0, 7] are the normalized color histogram, coarseness, and directionality texture of frame F i. The similarity between F i and F j is defined by Eq. 2: 255 FmSim(F i,f j )=W c min(h i,k,h j,k ) +W TC +W TD k=0 1 9 (TC i,k TC j,k ) 2 ) k=0 1 7 (TD i,k TD j,k ) 2 ), (2) k=0 a b Fig. 4. Keyframe extraction results. a Constantly sampled frames in a given shot, where similar frames appear repeatedly (from left to right, top to bottom). b Extracted keyframes with our strategy

7 104 X. Zhu et al.: Exploring video content structure for hierarchical summarization where W C, W TC, and W TD indicate the weight of each feature. We set W C =0.6, W TC =0.2, W TD =0.2 in our system. Figure 4 presents the detected keyframes from a single shot, where Fig. 4a shows constantly sampled frames in the shot (a doctor examines a patient s face) and Fig. 4b denotes the keyframes detected with our strategy. In some cases, similar frames may appear repeatedly in different locations of a shot. The strategy that considers only the similarity between the current frame F i and the most recently selected keyframe [47] may result in redundancies among keyframes. With our strategy, the selected keyframes are those that exhibit enough dissimilarity (low redundancy) and capture the most visual variances of the shot. 4.2 Video supergroup detection After the keyframe extraction, we apply a new strategy to merge visually similar shots into one cluster (supergroup). The following steps will segment the temporally adjacent shots within each supergroup into semantically richer units. Assume N keyframes K l (l =1,...,N) have been extracted from a given video. To explore the correlations among them, an affinity matrix M [48,49,52,53] is constructed, with entries of the matrix given in Eq. 3 M(i, j) =M(j, i) =Exp[ (1 FmSim(K i,k j )) 2 /σ]; i, j [1,N], (3) where FmSim(K i,k j ) is the visual similarity between K i and K j (defined by Eq. 2) and σ is a constant scaling factor for stretching (σ =0.5in our system). In fact, M(i, j) is a selfcorrelation matrix, where the relationship among keyframes of the current video can be visualized and analyzed directly. As shown in Fig. 7c, where the transferred affinity matrix of 105 keyframes is utilized to explore the correlations, the gray intensity in the figure indicates the similarities between elements: the higher the intensity, the greater the similarity. To address the correlations within the affinity matrix M (in the following sections we use M as an abbreviation of M(i, j)) and adopt the Scott and Longuet-Higgins (SLH) [49] relocalization algorithm since it has been successfully utilized in image segmentation [52,53] and video event detection [46]. Given an affinity matrix M and a number k, SLH outputs a newly transferred matrix Q that is calculated in the following steps [48]: 1. Calculate the eigenvector of affinity matrix M, rank the eigenvectors in descending order by their eigenvalues, and denote the ranked eigenvectors by E 1,E 2,.., E N. 2. Construct the affinity matrix W whose columns are the first k eigenvectors of M, i.e., W =[E 1,E 2,.., E k, 0,..0]. 3. Normalize the rows of W with Eq. 4 so that they have the unit Euclidean norm: W (i, ) =W (i, )/ W (i, ). (4) 4. Construct the matrix Q with Q = WW T, i.e., Q i,j = k r=1 W irw jr. Then, each element Q i,j can be interpreted as an enhanced measurement of the interaction between keyframes K i and K j. 5. Segment the points by looking at the elements of Q. Ideally, the larger Q i,j is, the higher the probability that keyframes K i and K j belong to one group. Figure 7c represents the constructed Q matrix of 105 keyframes with the top 10 eigenvectors Q matrix processing As we have indicated, each element of matrix Q provides information about whether a pair of keyframes belongs to the same group. Now, the problem of grouping the keyframes has been reduced to that of sorting the entries of matrix Q by swapping pairs of rows and columns until it becomes block diagonal, with each diagonal indicating one cluster. Hence, the Q matrix processing for keyframe grouping consists of two steps: (1) sorting and (2) diagonal block detection, as shown in Fig. 5. By swapping pairs and rows of matrix Q, highly correlated keyframes are merged together. To do this, we apply an energy maximal resorting process to rearrange the matrix into diagonal block styles, as shown in Fig. 6. Given Q matrix, Q i,j (i = 1,...,N; j = 1,...,N), select one row as the seed for sorting. At each iteration, say, k, the current state is represented by a k k submatrix Q k that contains the sorted keyframes so far. To select a candidate keyframe for resorting, a cost function (Eq. 5) is given by the energy of the first k elements, E k h = k Q 2 l,h for (h = k +1,...,N), (5) l=1 which represents the total energy of interaction between each of the candidate keyframes and the set of already sorted keyframes. By maximizing the cost function Eh k, our search strategy selects the keyframe whose global energy of interaction with the current segmentation is the largest. Q Matrix Sorting Fig. 5. Q matrix processing Current Sorting * k+1 Q Next sorting k Q * k+1 Diagonal blocks Possible columns (Candidates) Detection N SG 1 SG 2 SG 3 Fig. 6. Q matrix sorting algorithm: at iteration k, columns k +1to N are permuted and the column with the highest norm selected to form Q k+1

X. Zhu et al.: Exploring video content structure for hierarchical summarization 105 The updated state Q k+1 is obtained by augmenting Q k with the column and row of the best selected keyframe.

Matrix Q k+1 is then formed with the first (k +1) (k +1)elements of the permuted shape interaction matrix.

Recursively execute the resorting steps above until all rows have been selected. The rows and columns of matrix Q will then be sorted in descending order of similarity.

8 X. Zhu et al.: Exploring video content structure for hierarchical summarization 105 The updated state Q k+1 is obtained by augmenting Q k with the column and row of the best selected keyframe. The column corresponding to this keyframe is first permuted with column k +1, followed by a permutation of rows with the indices. Matrix Q k+1 is then formed with the first (k +1) (k +1)elements of the permuted shape interaction matrix. As a result of this maximization strategy, submatrix Q k+1 has the maximal energy among all possible (k +1) (k +1) submatrices of Q. Recursively execute the resorting steps above until all rows have been selected. The rows and columns of matrix Q will then be sorted in descending order of similarity. Due to the fact that visually similar keyframes have similar interactions with all other keyframes, the resorted matrix Q will take a roughly block diagonal form, as shown in Fig. 5b, with each diagonal block representing one cluster of keyframes (supergroup). The successfully detected diagonal block will help us in determining the final keyframe clusters. To sort the Q matrix, a critical row (the seed row) should be selected before sorting. Since it is not possible to optimize the selection of such a seed, we randomly select several rows as seeds and determine the corresponding sorted Q matrices. Those that have distinct diagonal block results are retained for further processing. After the Q matrix has been sorted into diagonal blocks, various approaches might be utilized to detect the boundary (clusters) of the blocks [52]. However, since we do not actually know how many diagonal blocks the Q matrix contains, the performance of these strategies might be decreased since 0104 most of them need information about the number of clusters in advance. Therefore, a simple threshold scheme is applied to eliminate those elements belonging to a predefined threshold and detect the blocks along the diagonal. The successfully detected elements in one block are taken as one supergroup. As we indicated above, instead of using only one keyframe as the seed, we execute the algorithm several times using different seeds. The union of all detected supergroups is taken as the detection result. For example, if K 2,K 9,K 14,K 35 are detected as one supergroup in the first iteration, and K 2,K 9,K 22 are detected as one supergroup in the next iteration, their union, K 2,K 9,K 14,K 22,K 35, will be taken as the final detected supergroup Keyframe grouping for supergroup detection After the diagonal blocks have been detected, the keyframes belonging to the same block are taken as one supergroup, denoted by SG i. The keyframe that does not belong to any diagonal block is taken as a separate supergroup. Given a video shot S i, assume there are T keyframes K l (l =1,...,T) contained in S i. In the ideal situation, all T keyframes should be merged into one supergroup. Unfortunately, this assumption does not always hold due to the fact that a shot may exhibit significant visual variances. Suppose all T keyframes in S i have been clustered into NSG supergroup SG l (l =1,...,NSG) with T 1 keyframes contained in SG 1, T 2 keyframes contained in SG 2, and so on. It is obvious that T 1 +T T NSG = T. The final supergroup to which S i belongs is selected using Detected super video group a (a) b(b) c Key-frames (in original temporal order) (c) Sorting Sorted key-frames (temporal order discarded) d (d) Fig. 7. Video supergroup detection. a Sampled frames from 105 keyframes (a kidney transplant surgery), with its original temporal order from left to right, top to bottom. b Ten detected supergroups. c Q matrix of corresponding 105 keyframes. d Sorted Q matrix

9 106 X. Zhu et al.: Exploring video content structure for hierarchical summarization Shot ID SG i S 1 S 2 S 3 S 12 Ranked temporal index orders of all shots in SG i G 1 G 2 G 3 G Final segmented video groups in SG i Step 1 Step 3,4, 5 G 1 G 2 G 11 G Step 2 G 1 G 2 G 7 G Fig. 8. Video group detection by temporal shot segmentation Eq. 6: S i SG t, t = arg l {max(t l ); l =1, 2,...,NSG}, (6) i.e., the supergroup with the maximal number of keyframes of S i is selected as the target to which S i belongs. Figure 7 shows the pictorial results of our supergroup detection strategy applied to a real-world video with 105 keyframes. The sample frames are shown in Fig. 7a in their temporal order, from left to right, top to bottom. Figures 7c and d show the constructed Q matrix with the top ten eigenvectors and the sorted result. Obviously, the sorted diagonal blocks can be used separately for supergroup detection. The representative frames of ten detected super video groups are shown in Fig. 7b. 4.3 Video group detection In the strategy above, the temporal order information among shots has not been considered. Although visually similar shots have a high probability of belonging to one scene, the shots with a large temporal distance are more likely to belong to different scenarios.accordingly, given any supergroup (SG i ), those adjacent shots in SG i with their temporal distance larger than a predefined threshold are segmented into different video groups. Given two shots S l ands h in video V i, assume their temporal index order (shot ID) is denoted by ID Sl and ID Sh, respectively. The temporal index order distance (IOD) between S l and S h is defined by Eq. 7: IOD(S l,s h )= ID Sl ID Sh 1. (7) That is, the IOD between S l and S h is the number of shots between S l and S h in their temporal orders. Given a supergroup SG i with NT 1shots and a specific threshold T Group, the strategy below will segment all NT shots into corresponding groups. 1. Rank all shots S l (l =1,...,NT) of SG i in increasing order by their temporal index order (that is, the shot with the smallest index number (ID) will be assigned to the first place, as shown in Fig. 8). 2. Initially, assume there are NT groups G l (l =1,...,NT) in SG i with each group containing one shot. 3. Given any two neighboring groups G i (with NT i shots) and G j (with NT j shots) in SG i, if the IOD between G i and G j is smaller than the threshold T Group, G i and G j are merged into one group. Accordingly, we define the IOD between G i and G j, denoted by IOD(G i,g j ), using Eq. 8: IOD(G i,g j ) = Min {IOD(S l,s h ); l=1,...,nt i;s l G i h =1,...,NT j,s h G j }. (8) If IOD(G i,g j ) T Group, G i and G j are merged into a new video group; otherwise they remain separate. 4. Iteratively execute step 3 until there are no more groups to be merged. Assume that NG groups (G i,i =1,...,NG) remain. For any two remaining groups G l and G h, IOD(G l,g h ) >T Group ; G l,g h SG i. All remaining units are taken as video groups in SG i. Figure 8 presents an example to indicate our group segmentation algorithm. Given a supergroup SG i containing 12 shots, with the ranked temporal index orders of all shots given by 8, 14, 15, 17,..., 41. Initially, each shot is taken as one group, and any two neighboring groups with an IOD less than T Group are merged into one group (we define T Group =4 in our system, and a detailed analysis on selecting T Group is discussed in Sect. 6.1). Iterative execution finally segments all shots in SG i into four groups. 4.4 Video scene detection Using the strategy above, all shots can be parsed into groups, with each group containing visually similar and temporally adjacent shots. However, as Definition 7 in Sect. 2 indicates, video group is an intermediate entity between physical shots and semantic scenes. This means that a single video group may be incapable of acting as an independent story unit to convey a video scenario. Actually, a video scene usually consists of temporally interlaced groups or visually similar and neighboring groups. In this section, a group-merging algorithm is proposed for video scene detection. Before describing our scene-detection algorithm, we will discuss the characterization of video scenes. By considering the definitions in [51], we conclude that three types of scenes are widely used by filmmakers when they capture or edit videos: N-type scene: These scenes (also known as normal scenes) are characterized by a long-term consistency of chromatic composition, lighting conditions, sound, etc. All shots in the scene are assumed to exhibit some similarity with their visual or audio features. An example of this type of scene is shown in Fig. 9a, where various shots captured from different cameras are organized to report a news story on a flight show in an international aviation exhibition. Since all shots are captured in the same scene, they are somewhat similar in visual features, and all shots in the scene

10 X. Zhu et al.: Exploring video content structure for hierarchical summarization 107 a G 1 G 4 b G 2 G 3 c N-type scenes M-type scenes Fig. 9. Examples of various types of scenes. a N-type scenes. b M-type scenes. c Hybrid scenes where both N- and M-type scenes exist in the same story unit might be merged into one or several visually similar video groups. M-type scenes: These scenes (also known as montage scenes) are characterized by widely different visuals (different in location, time of creation, and lighting conditions) that create a unity of theme by the manner in which they have been juxtaposed. An example of this type of scene is shown in Fig. 9b, where three actors are involved in a dialog scene. Obviously, the shots in the scene have a large visual variance and have been classified into four distinct groups (G 1, G 2,G 3, and G 4 ). However, these four groups are temporally interlaced to convey the same scenario information. Hybrid scenes: The hybrid scenes consist of both N- and M-type scenes, where both visually similar neighboring groups and temporally interlaced groups are utilized to convey a scenario unit. This is shown in Fig. 9c, where the beginning is a typical N-type scene and the end is an M-type scene. Using the above observations and definitions, we now introduce a temporally and spatially integrated strategy for parsing video groups into semantic scenes. Given groups G i and G j, Fig. 10 presents three types of temporal relationship among them: interlaced, uninterlaced, and neighboring. To parse video semantic scenes, the temporally interlaced video groups and visually similar neighboring groups should be merged into one scene since they have a high probability of conveying the same scenario. Assume IODG b i and IODG e i denote the minimal and maximal temporal index order of the shots in G i, as shown in Fig. 10a. To classify G i and G j into interlaced, uninterlaced, or neighboring categories, the temporal group distance (TGD) between G i and G j (TGD(G i,g j ) is defined by Eq. 9: ID G b j IDG e i ; if IDG b j IDG b i TGD(G i,g j )= ID G b i IDG e j ; if IDG b j <IDG b i. (9) If TGD(G i,g j ) > 1, there is no interlacing between shots in G i and G j, and hence G i and G j are classified as uninterlaced groups, as shown in Fig. 10a. If TGD(G i,g j )=1, we clas- a b c Uninterlaced Groups Interlaced Groups Neighboring Groups b ID Gi G i Temporal index orders of shots in Gi Keep unmerged b e ID Gj G j ID G j e ID Gi Temporal index orders of shots in Gj G i Temporal index orders of shots in Gi G j Temporal index orders of shots in Gj G i Temporal index orders of shots in Gi G j Temporal index orders of shots in Gj Group merging GpSim(Gi,Gj)> TScene G i Temporal index orders of shots in Gi G j Temporal index orders of shots in Gj G k Temporal index orders of shots in Gk G k Temporal index orders of shots in Gi Fig. 10. Group merging for scene detection. a Uninterlaced groups. b Interlaced groups are merged into one unit. c Neighboring groups with similarity larger than the threshold are merged into one unit sify G i and G j as neighboring groups, as shown in Fig. 10c. If TGD(G i,g j ) < 0, G i and G j are classified as interlaced groups, as shown in Fig. 10b. Obviously, since a shot can be assigned to only one group, TGD(G i,g j ) will never equal 0. After classifying any two video groups into the appropriate categories, our scene detection follows the steps below. 1. Given any two video groups G i and G j,iftgd(g i,g j ) < 0, G i and G j are merged into a new group by combing all shots in G i and G j.as shown in Fig. 10b, the interlaced group G i and G j are merged together to form a new group G k, with numbers in different styles (bold & italic vs. roman) indicating shots from different groups.

108 X. Zhu et al.: Exploring video content structure for hierarchical summarization 2. For TGD(G i,g j )=1, if the similarity between G i and G j (given by Eq.

11 108 X. Zhu et al.: Exploring video content structure for hierarchical summarization 2. For TGD(G i,g j )=1, if the similarity between G i and G j (given by Eq. 11) is larger than a given threshold T Scene, G i and G j are merged into a new group. Otherwise, they remain unmerged, as shown in Fig. 10c. 3. Iteratively execute the steps above until no more groups can be merged. All remaining and newly generated groups are taken as video scenes. To measure the similarity between groups for scene detection, we first consider the similarity between a shot and a group. Based on Eq. 2, given shot S i and group G j, the similarity between them is defined by Eq. 10: StGpSim(S i,g j ) = Max {FmSim(K l,k h )}. K l S i,k h G j (10) This implies that the similarity between S i and G j is the similarity between the keyframes in S i and G j that have the highest similarity. In general, when we evaluate the similarity between two video groups using the naked eye, we usually take the group with fewer shots as the benchmark and then determine whether there are any shots in the other group similar to shots in the benchmark group. If most shots in the benchmark group are similar enough to the other group, these two groups are treated with a high similarity [30,42]. Given groups G i and G j, assume Ĝi,j represents the group containing fewer shots and G i,j denotes the other group. Suppose NT(x) denotes the number of shots in group x; then the similarity between G i and G j is given by Eq GpSim(G i,g j )= NT(Ĝi,j) NT(Ĝi,j) i=1;s i Ĝi,j StGpSim(S i, G i,j ) (11) That is, the similarity between G i and G j is the average similarity between shots in the benchmark group and their most similar shots in the other group. Some typical scene detection results from real-world videos (movies, news, and medical videos) are presented in Fig. 11, where the video shots in the same story units are successfully detected. Actually, in some situations, the scene boundaries might be too vague to be distinguished by even the naked eye; we don t believe any state-of-the-art methods can attain satisfactory results from those videos. However, for the scenes with relatively clear boundaries, our method obtains attractive results (as demonstrated in Sect. 6.3), and these results will guide us in constructing hierarchical video summaries. 4.5 Video-scene clustering Thus far, a video content structure from video scenes and groups to keyframes has been constructed. Nevertheless, since our final goal is to use the video content structure to construct video summaries, a more compact level beyond the scenes is necessary for creating an overview of the video. This is accomplished by clustering visually similar scenes that appear at different places in the video. Given a scene SE i in video V k, assume there are NT shots contained in SE i, and assume further that all shots in SE i have been merged into NSG i supergroups SG l (l =1,...,NSG i ). The union of all these supergroups is then denoted by Eq. 12: USG(SE i )=SG 1 SG 2 SG NSGi. (12) Obviously, given any two scenes SE i and SE j, if USG(SE i ) USG(SE j ), this indicates that all shots in SE i come from the same supergroups as SE j. Hence, the visual redundancy between SE i and SE j can be eliminated by Movies News Medical videos Fig. 11. Some typical examples of the successfully detected video scenes among various types of videos (movies, news, medical video) with each row denoting one scene

70 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 1, FEBRUARY ClassView: Hierarchical Video Shot Classification, Indexing, and Accessing

70 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 6, NO. 1, FEBRUARY 2004 ClassView: Hierarchical Video Shot Classification, Indexing, and Accessing Jianping Fan, Ahmed K. Elmagarmid, Senior Member, IEEE, Xingquan