What is in that video anyway? : In Search of Better Browsing

Size: px

Start display at page:

Download "What is in that video anyway? : In Search of Better Browsing"

Rosalind McLaughlin
5 years ago
Views:

1 What is in that video anyway? : In Search of Better Browsing Savitha Srinivasan, Duke Ponceleon, Arnon Amir, Dragutin Petkovic ZBM Almaden Research Center 650 Harry Road San Jose, CA USA savitha, duke, arnon, petkovic@almaden. ibm. corn Abstract EfSective use ~Jdigital video can be greatly improved by a combination of two technologies: computer vision for automated video analysis and information visualization for data visualization. The unstructured, spatio-temporal nature of video poses tough challenges in the extraction of semantics using fully automated techniques. In the CueVideo project, we combine these automated technologies together with a user interfnce designed for rapid filtering and comprehension of video content. Our interface introduces two new techniques for viewing video and builds upon existing techniques to provide synergistic views of the video content. We also report on a preliminary user study that compares the efzcacy of these views in providing comprehension of video content. 1. Introduction Ultimately, information is only valuable if it can be found, accessed, and shared. With advances in computing power and high speed networks, digital video is becoming increasingly popular. Large collections of multimedia documents can be found in diverse application domains such as the broadcast industry, education, medical imaging, and geographic information systems. As digital video libraries are becoming pervasive, finding the right video is becoming a challenge. The problem with video is that it tends to exist in a very linear space, one frame after the other. Therefore, cataloging and indexing of video has been universally accepted [1,2,3,8,12] as a step in the right direction towards enabling intelligent navigation, search, browsing and viewing of digital video. These technologies are working towards the goal of being able to search video with the same ease with which we search text documents today. However, we are not there yet. The spatio-temporal nature of video hinders even the convergence on a single definition of a video summary [5,14]. In the context of digital video, different search levels and search modalities have been identified [4]. The search and browse patterns of users have been classified into two broad categories: subject navigation, where an initial search generates a result set which must then be browsed, and interactive browsing where a collection of videos is browsed without any prior search. Often times, the searchhrowse phase is followed by refining search parameters or using the search results in a new query, as in Show me more like this. In both usage patterns, browsing plays a significant role. While we recognize the importance of a seamless integration of querying, browsing and exploration of data in a digital library collection [6], this paper is focused on the challenges associated with browsing digital video. In this paper, we define video data to mean video as a rich media, not only the video images but also its associated audio. We use video content to denote the information contained in the video data and potentially of interest to the user such as objects, people, motion, static charts, drawings, maps, equations, transparencies and auditorial information. In video production, a shot corresponds to the segment of video captured by a continuous camera recording. Shot-boundary detection algorithms [ 11 are used to partition the video into elemental units called shots. We define metadata to be any descriptor that tells US something about the video content, which can then be used as an index for video browsing to help locate the desired material and deliver it in a manageable format. 2. Related Work Efforts to support video browsing date back to the early 1990 s. The early systems extracted keyframes at evenly-spaced time intervals and displayed them in chronological order. By 1993 [ 141, content-based solutions started appearing, which segmented the video using shot-boundary detection algorithms and selected one or more frames fiom each shot. These approaches typically resulted in one particular view of the video content, namely, a sequence of still images in a two dimensional array: a video storyboard. Subsequently, there have been efforts in semantic grouping and visualization of video content at different levels [2,8,12,13] /99 $ IEEE 388

2 The Informedia [ 121 project combines speech recognition, image processing, and natural language understanding techniques for processing video automatically in a digital library system. A basic element of their interface design is the provision of alternate browsing options in response to the query such as headlines, thumbnails, filmstrips and skims. The headlines, thumbnails and filmstrips are viewed statically whereas the skim is played back to communicate the content of the video. The filmstrip view reduces the need to view each video paragraph in its entirety by providing a storyboard for quick viewing. PanoramaExcerpts [ 1 I] goes beyond the keyframes by creating a storyboard which has a combination of mosaicing and keyframes. The MoCA project [7] has developed automatic techniques to create video skims that act as movie trailers, which are short versions of a longer video intended to attract the viewer s attention. Similar to these efforts, our objective is to automatically detect and convey the semantics of the video content to the user. This is a difficult problem because semantics are subjective and very much application dependent and content specific. For example, a goal may be an interesting event in a hockey video, whereas, emphasized speech may be an interesting event in a talk video. Therefore, hlly automated tools are not capable of creating a video summary in a domain independent manner. Our challenge therefore, is to combine browsing interfaces that deal with partial, potentially noisy data together with new techniques to extract semantics from the video content. In this context, we present the CueVideo user interface and the underlying algorithms that support these browsing methods. 3. CueVideo browsing interface Figure 1 shows the first screen of the browsing interface in our system. The image occupying most of the screen represents the metaphor we embody in our design which is intended to convey that the computer processes the digitized video and produces different visualizations of this content. Each of the visualizations is browsable and the user has the option of browsing one or more views by clicking on the Sto yboard. Animation, Audio or the Statistics icons. In addition. each view has contextual links to other views to assist with navigation between views. The Storyboard view is one of the widely prevalent [ 1,2,3,14] means of browsing video. The technology to automatically generate the Statistics view based on the shot boundaries in the video also exists although it has not been explicitly considered as a view of the video content. We combine these existing techniques with additional means of viewing the video content to provide multiple ways of getting the story. We have classified our new techniques into the category of Animation where the rationale was to create a useful movie of a movie, given the constraints of the underlying technology and bandwidth. Existing ideas for audio summarization [9] gave rise to our Audio Events category of viewing the video Video storyboard The video storyboard is composed of representative still frames called keyframes which can be automatically selected based on time intervals or on a segmentation of the video into shots. The bottom strip of the screen shot in figure 1 is an example of a one dimensional video storyboard, as a sequence of horizontally positioned keyframes. In addition, clicking on the Storyboard icon in figure 1 displays a two dimensional storyboard. The numbers below each keyframe represent the range of frame numbers corresponding to that particular shot in the original video. Clicking on any image plays the video corresponding to that shot. The slider below the images provides feedback on the position of the keyframe in the context of the whole video. Scrolling the slider horizontally provides a visual preview of the video content at a fraction of the time taken to watch the entire video or fast forward it. Compressed video differs from uncompressed video in several aspects; it contains quantization errors, motion compensation errors, and sometimes the block and macro-block boundaries are apparent. The corrupted frame might be detected by a shot-boundary detection algorithm as a false shot-boundary. Our algorithm addresses this issue by including a state machine with special states to identify and handle some of these cases. Such errors are typically introduced either by a low quality encoder, low quality video editor, or due to communication errors. All these errors reflect in errors in the reconstructed image, and a robust shot-boundary detection algorithm should not be sensitive to such errors. The use of a coarse color histogram provides robustness against these problems. Our shot-boundary detection algorithm works in a single pass by processing one frame at a time across the video. We first calculate a three dimensional color histogram of the image in RGB space. The image pixels are sub-sampled to speedup the histogram calculation. Histograms of several frames are stored in a buffer to allow a comparison between multiple pairs of frames. In addition to the histogram, we calculate certain global image characteristics such as the color mean and variance. These characteristics are used to determine if the frame is black, monochrome or colorful. As each frame is processed, the statistics related to the difference between pairs of frames is updated. These statistics are used to evaluate the adaptive thresholds that are used in the state machine. At each frame, the state machine advances from its old state to a new state, and actions are taken accordingly. For example, when the 389

~ ~ Figure 1 : CueVideo Browsing Interface end of a E It-boundary is detected, a record of the ended shot is stored, and a near-middle keyframe is stored as a representative keyframe in a JPEG file.

3 ~ ~ Figure 1 : CueVideo Browsing Interface end of a E It-boundary is detected, a record of the ended shot is stored, and a near-middle keyframe is stored as a representative keyframe in a JPEG file. The shot record contains shot information and shot-boundary information, such as the type of effect. The near-middle keyframe is selected from a sparse buffer of frames which keeps a small number of frame (8 frames). The buffer is handled in such a way that it always contains a frame near the middle of the shot, as long as the end of the shot has not been found. When the end of the LUt 1 shot is found the shot length is calculated, and the frame in the buffer closest to the middle of the shot is selected. It is guaranteed that this shot is close to the middle of the shot, within a fixed percentage of the shot length (3%). This avoids the need for a second pass to extract middle keyframes. D ~ rnpg din1,biff3,diff? and rhreshofas mslo0524 m a shot bomdarles states aradh I $:$ f ' s i=o icr, IJO Frm II imibrr Figure 2b: Processing states in the above example An example of the operation of the algorithm over a video sequence of 400 frames is shown in figures 2a and 2b. Figure 2a shows the measured distance between frames and the adaptive thresholds. Figure 2b shows the state machine at each frame. Our shot-boundary detection algorithms produce as few as 100 keyframes for a video clip of about 3 minutes and as many as 1000 keyframes for an hour long video. These numbers can vary considerably depending on the type of video content. Therefore, while this view provides a rapid visual preview of the video content, it does not scale up for long videos since scrolling through hundreds of still images in an attempt to get the story is time consuming, tedious and not effective [SI. It is also unsuitable for certain domains like music or education where most of the information is in the audio track. Figure 2a: Algorithm operation: This includes a fade-in, followed by three cuts and five dissolves Animation: Motion storyboard (MSB) We address the issue of viewing too many still images and the missing audio information by introducing the 390

motion storyboard view. This view consists of animated images from the still video storyboard that are fully synchronized with the original audio track.

4 motion storyboard view. This view consists of animated images from the still video storyboard that are fully synchronized with the original audio track. The animation together with the audio conveys a sense of motion, as compared to the video storyboard. The MSB is played as a video where each keyframe has the duration of the associated shot. If more than one keyframe is used to represent the shot, all keyframes are animated within the duration of the associated shot thereby preserving their temporal relationship. The audio track is synchronized with the animated images. The duration of the MSB is the same as the original video, however it can be reduced by using techniques to speed up the audio. It also takes less screen real state in comparison to the storyboard. As a concept the MSB is straightforward and qualifies as a low bandwidth video summarization that retains the audio content. We generate the MSB using the following steps: Demultiplex video and audio layershracks Process the video layer to generate shot-list and associated indexing information Generate the video containing the selected keyframes, this constitutes the video layer/track of the MSB Generate a different version of the audio layedtrack that is a reasonable compromise between quality and compression. Our experience suggests that we need a compression scheme beyond MPEG-I audio layer 1 and/or I1 to really achieve a compact MSB Multiplex video and audio layerskracks to create MSB Our implementation of the MSB can process several video formats: MPEG-1, AVI, H.263, and QuickTime. We have selected QuickTime(QT) to be the format for the MSB video because of the versatility offered by the QT architecture. The MSB video track may be encoded as a set of PEGS, as a Motion JPEG or any video format supported within QT. Our implementation can generate the MSB as a separate movie or as a new track on the original movie. The latter constitutes one movie with two video windows side by side together with a single audio track where both video windows are synchronized with the audio. This also serves as a useful tool for the visual evaluation of the shot-boundary detection algorithm. For rapid browsing the QT interface allows the user to step through the key frames, one at a time. Figure 3 shows an example of a motion storyboard, where a representative key frame is being displayed together with the original audio track. The slider below the image provides feedback on the position of the audio in the context of the whole audio track. The slider may be moved back and forth to change the starting point of the playback and the audio continues to be synchronized with the animated images. This view takes 2-3% of the bandwidth required by the original MPEG-1 video at 1.5Mbitlsec and is best suited for news, education or commercial clips where the combination of audio with still images conveys most of the content. However, it is not suitable for summarizing high motion events such as a tennis match, a car race or a dance performance, where the amount of motion is relevant in conveying content. Figure 3: Example of Motion Storyboard 3.3. Animation: Fast video playback We address the need for rapid viewing of high action videos by introducing the fast video playback concept. This view comprises of a new video stream whxh is composed of sub-sampled frames from the original video, while taking the amount of motion into account. It appears to be like fast forward play and contains no audio. However, unlike traditional fast forward techniques, it composes the fast video using an adaptive frame rate. It runs faster (sparser frame samples) within long shots which do not contain much motion and slower (denser frame samples) within short shots or high-action scenes. The result is a much shorter video, which preserves all the fast, short events, while cutting out most of the long, low-action scenes. A non linear sub-sampling approach has been introduced in earlier work where an adaptive frame rate is selected based on the amount of motion in the frame. Their approach keeps the spatio-temporal changes in the image constant. However, the perception of the actual motion is missed (e.g., a video containing a slow moving car and a video containing a fast moving car show a similar driving speed in the fast playback). Our algorithm selects the frames for fast playback in a content-based, nonlinear fashion. We neither miss a fast event nor skip a short shot. The selection of frames is based on the detected shot-boundary and the detection of other relevant video content. This allows the eye to fixate on the scene before it starts to play the shot at a higher speed. The average frame rate can be ten to fifteen times faster than the original frame rate. Figure 4 shows the difference between fast forward and fast video in terms of frame sampling rate. The first line shows the shot boundaries in the video clip. The shot between boundaries 1 and 2 is the longest shot and the shots between boundaries 5,6 and 7 are the shortest shots. The 391

5 second line shows the sampling rate of the frames for regular fast forward; the frames are evenly sampled without regard to shots and shot durations. The third line shows the sampling rate for the fast video stream where the long shot has fewer frames and the short shot has a greater number of frames. Therefore, the long, low-action shot plays faster and the short, high-action shot plays slower. Shots - E? A a r r Fast 1 - Video Figure 4: Fast forward and Fast Video Playback The fixed speed fast forward (of the same duration), typically looks jittery within high-action scenes and preserves the relative long time of the longer, low-action shots. While fast video playback alleviates the jarring visual effect of the high-action scenes, an undesirable side effect of this view is the introduction of time warping. The fact that fewer frames represent the longer, low-action shots and that more frames represent the high-action shots modifies the amount of time spent in each shot as compared to the original video. This could be significant for certain applications where the amount of time spent in each shot is relevant in selecting the video. Fast video playback is particularly well suited for summarizing long sport events, action movies and interviews. However, it misses the audio channel, which cannot be synchronized with the faster video track Audio events In order to convey some semantics derived fiom the audio track, we introduce the audio event view. The audio analysis algorithm classifies the audio track into silence, music and speech, and segments the video based on interesting audio events. What comprises an interesting audio event is entirely domain specific and must be defined for each application. Our experimental data consists of education and training videos, we therefore define an interesting audio event to be a speech segment. The basis for the video segmentation is different, however, we use the same viewing options such as the motion storyboard or the full motion video player to playback each speech segment. This view provides a visuavaura1 representation of the world of action [lo], in this case, the action being an interesting audio event such as speech. While the concept of detecting audio events and browsing the events for a rapid aural summary is useful, the caveat associated with this view is that each application domain must develop its own specific audio event filter. It is not a general purpose view that will produce reasonably useful summaries in all domains. For example, the definition of an interesting audio event in a music video, a sports video or an education video may be completely different Video statistics The video statistics view consists of global statistics that are computed during the automatic video analysis. This includes a detailed shots table, the total number of shots, the average duration of the shots, the types of shot-boundary effects and their counts, the number and length of speech and non-speech audio segments. This global metadata is potentially of interest to technical users such as I I I I Figure 5: Example of Video Statistics film editors and producers. Figure 5 shows an example of such automatically generated statistics. This view of the video content requires less than 1% of the bandwidth required by the original video. The statistics are linked to contextual information: clicking on a shot number will play the video from that particular shot and clicking on a keyframe number will display the corresponding image. 4. User study and findings We identified a specific professional user group to work closely with: the colporate marketing group at the IBM Almaden Research Center. This group is involved in research on technologies that will be available in the year 2000 and beyond. They have given hundreds of multimedia presentations and interviews on research activities and technology trends in the science, technology and application of computers and computing. Finding relevant videos from the archives is an important facet of putting together these presentations. They identified retrieval of possibly relevant videos and rapid visual comprehension of the video content as key problems in their workflow. We arrived at the following approach for browsing video: we decided to provide multiple, tightly coupled, synergistic views of the video content by processing the information in the different tracks such as the video, audio and closed captions using state-of-the-art enabling technologies. Each view attempts to bring out some distinct characteristic of the video content, with varying bandwidth requirements. Certainly, all views are not appropriate for a specific video domain. However, our rationale is that a combination of tightly coupled views [ 101 may succeed in 392

6 providing rapid visual/aural comprehension, where a single view, however informative, may fail. We believe that any view that quickly rules out a video as being relevant is as important as a view that effectively conveys video content After designing the interface, we asked the user group to browse a video collection using our system. Subsequently, we conducted informal interviews with the users regarding the usefulness of the technology and interface in their workflow. We summarize their feedback as follows: A global, unified view of the browsing interface as represented by our metaphor in figure 1 was greeted favorably. The ability to access the different visualizations from the first screen was reported to simplify the browsing process. Despite the lack of audio information and the fact that the storyboard view does not scale up for very long videos, the storyboard view was most popular. However, the users felt a need for viewing the top 10 keyframes that best summarize the video, rather than representative keyframes in each shot. The tight coupling between the different views for contextual switching was found to be helpful for content comprehension. Finally, browsing digital video was perceived as one component of digital video systems; the cataloging, retrieval and reuse aspects of digital video are essential components as well. 5. Conclusion and future work Multimedia Tools and Applications. Vol. 3, pp , Kluwer Academic Publishers [2] Bach, J.R. et al., Virage image search engine: An open framework for image management, in Proceedings of SPlE Storage and Retrieval for Still Images and Video Databases IV, Vol. 2670, ls&t/spie, February [3] Chang.. S.F.. Chen, W., Meng, H.J., Sundaram, H. and Zhong, D. VideoQ: An Automated content Based Video Search System Using Visual Cues, in Proceedings of MM 97, pp , ACM Press, November [4] Chang.. S.F., Eleftheriadis, A. and McClintock, R. Next-Generation Content Representation. Creation, and Searching for New-Media Applications in Education, in Proceedings of the IEEE, Vol. 86, No. 5, pp , IEEE Inc, May [5] Dimitrova, N. The Myth of Semantic Video Retrieval, in ACM Computing Surveys, Vol27, No. 4, pp ACM Press, December [6] Furnas, G. Effective View Navigation. In Proceedings of CHI 97, Atlanta. GA. Mar [7] Lienhart, L., Pfeiffer, S. and Effelsberg, W. Video Abstracting, in Comm ACM, Dec 1997, pp We have described the basic CueVideo system and the underlying algorithms that support the different browsing options. The contribution of this work is in a unified browsing interface for digital video specifically targeted towards rapid visual comprehension of the content by providing multiple, synergistic views. We have introduced two new browsing methods: the motion storyboard and fast playback towards this goal. This work establishes a framework in which we continue our work in video indexing and retrieval. We have integrated speech recognition technology to decode the audio track into words and have indexed the video using keywords. We have ongoing efforts in audio analysis, topic based segmentation of video and other forms of video summarization. We would also like to analyze speech segments in the video to explore patterns in the frequency or temporal domain to make reasonable interpretations at a level higher than the word level. [8] Meng. H.J. and Chang, S.F. CVEPS: A Compressed Video Editing and Parsing System, in Proceedings of MM 96, pp. 43, ACM Press, November [9] Pfeiffer, S., Fischer. S. and Effelsberg, W. Automatic Audio Content Analysis, in Proceedings of MM 96, pp. 21, Acm Press. [ 101 Shneiderman, B., Designing the User Interface: Srarregies for Efective Human-Computer Interaction: Second Edition, Addison-Wesley Publ. Co., Reading, MA, [1 13 Tanipuchi, Y., Akutsu, A. and Tonomura, Y. PanoramaExcerpts: Extracting and Packing Panoramas for Video Browsing, in Proceedings of MM 97, pp. 427, ACM Press, November [ 121 Wactlar, H., Christel, M., Gong, Y. and Hauptmann, A. Lessons Learned from Building a Terabyte Digital Video Library. In IEEE Computer, Vol. 32, Number2, Feb Acknow4edgments We the contribution of Laurence [13]Yeung, M., Yeo, B.L., Wolf, W., and Liu, B. Video Browsing using Clustering and Scene Transitions on Compressed &cadias in creating the graphic art for the Sequences. In Multimedia Computing and Networking Proc. CueVideo project. PIE, February, References [ 1 ] Aigrain, Zhang and Petkovic. Content-Based Representation and Retrieval of Visual Media: A State-of-the-Art Review. In [ 141 Zhang, H.J., Kankanhalli, A. and Smoliar, S.W. Automatic partitioning of full-motion video, In ACMLSpringer Muhimedia Sysfems, vol. 1. NO. 1, pp ,

Searching Video Collections:Part I

Searching Video Collections:Part I Introduction to Multimedia Information Retrieval Multimedia Representation Visual Features (Still Images and Image Sequences) Color Texture Shape Edges Objects, Motion