Text, Speech, and Vision for Video Segmentation: The Informedia TM Project

Size: px

Start display at page:

Download "Text, Speech, and Vision for Video Segmentation: The Informedia TM Project"

Delilah Todd
5 years ago
Views:

1 Text, Speech, and Vision for Video Segmentation: The Informedia TM Project Alexander G. Hauptmann Michael A. Smith School Computer Science Dept. Electrical and Computer Engineering Carnegie Mellon University Carnegie Mellon University Pittsburgh, PA Pittsburgh, PA Abstract We describe three technologies involved in creating a digital video library suitable for fullcontent search and retrieval. Image processing analyzes scenes, speech processing transcribes audio signal, and natural language processing determines word relevance. The integration se technologies enables us to include vast amounts video data in library. 1 Introduction The Informedia Digital Video Library Project at Carnegie Mellon University is creating a digital library text, images, videos and audio data available for full content retrieval [Stevens94][Christel94]. The initial testbed will be installed in several K-12 schools and students will use system to explore multi-media data for educational purposes. The Informedia system for video libraries goes far beyond current paradigm video-on-demand by retrieving a short video paragraph in response to user s query. The project can be divided into two phases: library creation and library exploration (See Figure 1). 1.1 Library creation The Informedia project is creating intelligent, automatic mechanisms for populating a video library and allowing for its full-content and knowledge-based search and segment retrieval. The material is obtained from video assets WQED/Pittsburgh as well as British Open University video courses.the project uses Sphinx-II speech recognition system to transcribe and align narratives and dialogues automatically. The resulting transcript is n processed through methods natural language understanding to extract subjective descriptions and mark relevant key words. Acoustic signal analysis identifies potential segment boundaries paragraph size. Within a paragraph, scenes isolated and clustered into video segments through use various image understanding techniques. These components described in Figure Library exploration Users able to explore Informedia library through an interface that allows m to search using TV Footage Extra Footage New Video Footage Speech & Language Interpretation and Indexing CREATION Indexed Video Database Video Library EXPLORATION Figure 1: Indexed Transcript Text Indexed Transcript Text Raw Video audio video Video Segmentation and Description Segmented Described Video Distribution or Sale to Users OFFLINE Segmented Described Video Interactive Video Search Visual, Spoken, and Natural Language Query Video Segments Presentation Store ONLINE Overview Informedia Digital Video Library System typed or spoken natural language queries, select relevant documents retrieved from library and display material on ir PC workstations. The library retrieval system can effectively process spoken queries and deliver relevant video data in a compact format, based on information embedded with video during library creation. Video and or data may be explored in depth for related content. During retrieval based on keyword searches by a user, only relevant video segments displayed. Prototype exploration systems have been implemented on both Macintosh and PC platforms. In this paper we will focus on library creation aspect Informedia Project. In particular, we

2 Video Paragraph - Speech SNR Speech Transcript - Keywords y tough y demanding y jury every toy manufacturer hopes to please creators new toy knex have received a large amount Scene Isolation - Image Analysis Representative Frame Icon - Keyword Search Figure 2: Combined technology to select representative frame (icon) will describe how to segment a video meaningfully using integration different technologies. Through combined efforts Carnegie Mellon s speech, image and natural language processing groups, this system provides a robust tool for segmenting many types video data in order to utilize m within a digital video library. 2 Video Segmentation Generally, videos in Informedia Library full one-hour feature broadcast videos based on educational documentaries. To allow efficient access to relevant content videos, we need to separate m into small pieces. To answer a user query by showing an hour long video is rly a reasonable response. The Informedia library creation phase uses three different levels segmentation for a video. The first and generally largest segment shows a video paragraph, which consists a series related scenes with a common content. The second level segmentation identifies a single scene on video within video paragraph. Finally, within a single scene we also need to be able to select a representative frame icon for static displays. 2.1 Video paragraphs When a user receives response to a query, system needs to determine how much content and context to display. Where should video clip start and where does it end? The answer to this is partly determined by content user query. But answer is also dependent on natural segments within video which we call video paragraphs. In ideal case, a video paragraph starts at natural boundary relevant content and ends wherever video moves to a different context. 2.2 Individual scenes Segment breaks produced by image processing examined along with boundaries identified by speech and natural language processing transcript, and an improved set segment boundaries heuristically derived to partition video paragraphs into scenes. All frames from each new scene will be used to select frame icon. This technique will allow for inclusion all relevant image information in video and elimination redundant data.

3 2.3 Frame icons For purposes static displays, most characteristic frame a scene is included in static (nonanimated) representations user s selection. A single frame is displayed as representative for whole video segment. This is used in an outlined display showing results a user query. Showing frame icons allows user to simultaneously look at a static representation multiple video paragraphs and to obtain some information about ir content and possible relevance to user s query, before selecting any one paragraph for playback. Frame icons also important as encapsulations video paragraph for printed reports and viewgraphs. In order to create se various levels segmentation, we integrate a number different technologies which will be described in next section. 3 Component Technologies There 3 broad categories technologies we can bring to bear to problem identifying video segments from broadcast video materials. a. Text processing looks at textual (ASCII) representation words that were spoken, as well as or annotations derived from transcript, production notes or closecaptioning that may be available. b. Speech signal analysis provides basis for analyzing audio component material. c. Image analysis looks at images in video-only portion. Currently in library creation phase Informedia Digital Video Library following specific approaches used to create segmentation information. 3.1 Text Analysis Text analysis can work on an existing ASCII transcript to help segment text into paragraphs. An analysis keyword prominence allows us to identify important sections in transcript [Mauldin 89]. Or more sophisticated language based criteria under investigation. The notion semantic connections between text portions might be exploited for segmentation as well. Currently we use two main techniques in natural language analysis. a. If we have a complete time aligned transcript available from close-captioning or through a human generated transcription, we can exploit natural structural text markers such as punctuation to identify segments video paragraph granularity b. To identify and rank contents various segments, we use well-known technique TF/IDF (term frequency/inverse document frequency) to identify critical keywords and ir relative importance for video document [Salton83]. 3.2 Speech Analysis Speech analysis operates only on audio portion video. Using speech recognition we can obtain a transcript, although it may contain errors. We can also detect transitions between speakers and topics which usually marked by silence or low energy as in acoustic signal. Recognition To transcribe content video material, we recognize spoken words with Sphinx-II speech recognizer. The CMU Sphinx-II system uses semi-continuous Hidden Markov Models to model contextdependent phones (triphones), including between word context [Hwang94]. The recognizer processes an utterance in 3 steps: It makes a forward time synchronous pass using full between word models, Viterbi scoring and a trigram language model. This produces a word lattice where words may have only one begin time but several end times. The recognizer n makes a backward pass which uses end times from words in first pass and produces a second lattice which contains multiple begin times for words. An A* algorithm is used to generate best hyposis from se two lattices. The language model consists words (with probabilities), bigrams/trigrams which word pairs/triplets with conditional probabilities for last word given previous word(s). The language model was constructed from a corpus news stories from Wall Street Journal from 1989 to 1994 and Associated Press news service stories from 1988 to 199. Only trigrams that were encountered more than once were included in model, but all bigrams and most frequent 588 words in corpus were included [Rudnicky95]. Processing video tape using speech recognition system gives us a transcript. This transcript contains errors, which depending on quality tape and subject matter, currently range from 2% to 7% word error rate. 1 Power = log -- n Si 2 Acoustic Segmentation To detect breaks between utterances we use a modification Signal to Noise ratio (SNR) techniques which compute signal power. This algorithm computes power digitized speech samples where Si is a preemphasized sample speech within a frame 2 milliseconds. A low power level indicates that re is little

4 active speech occurring in this frame (low energy). Segmentation breaks between utterances set at minimum power as averaged over a 1 second window. To prevent unusually long segments, we force system to place at least one break within 3 seconds. 3.3 Image Analysis Image analysis is primarily used for identification breaks between scenes and identification a single static frame icon that is representative a scene. Histogram Analysis Video is segmented into scenes through use comparative difference measures [Zhang93]. Images with small histogram disparity considered to be relatively equivalent. By detecting significant changes in weighted color histogram each successive frame, image sequences can be separated into individual scenes. A comparison between cumulative distributions is used as a difference measure. The histogram difference plot is shown in bottom graph Figure Motion Vector Confidence Measure [Akutsu94]. We can interpret camera motion as a pan or zoom by examining geometric properties optical flow vectors. Using Lucas-Kanade gradient descent method for optical flow, we can track individual regions from one frame to next [Lucas81]. By measuring velocity that individual regions show over time, a motion representation scene is created. Figure 4 shows examples optical flow analysis for different types camera motion. Drastic changes in this flow describe random motion, and refore, new scenes. These changes will also occur during gradual transitions between images such as fades or special effects. Only regions low ambiguity selected for tracking. Trackable regions found by searching entire image for subwindows whose gradient derivatives exhibit relatively similar eigenvalues. In order to accurately track a region over large as, a multiresolution structure is used. With this structure we can track regions across many pixels and reduce time needed for computation. When optical flow is minimal frames suitable for an iconic frame representation. Since we primarily interested in distinguishing static frames from motion frames, it was sufficient to track only top 3 regions. I I 1 I 2 Flow Histogram Difference Analysis Frames Figure 3: Scene segmentation and motion vector error. This result is passed through a high pass filter to furr isolate peaks and an empirical threshold is used to select only those regions where scene breaks occur. To make analysis more robust, we examine individual images in tiled subwindows. This reduces noise in our difference data and compensates for motion between frames. The images initially subsampled to provide an efficient means computation. Using only histogram difference, we have achieved 9% accuracy on a test set roughly 2, video images (2 hours). Optical Flow One important method visual segmentation and description is based on interpreting camera motion Figure 4: Camera motion analysis using optical flow. Flow vectors amplified for visibility. These techniques work well when scene changes abrupt, however, camera motion and gradual changes can severely affect accuracy system. The first graph in Figure 3 shows optical flow error for a given sequence. When changes gradual, we combine optical flow results with histogram analysis. This allows for segmentation under conditions that do not involve drastic changes in image content and detection accuracy as high as 95%.

5 Histogram Scene Analysis Scenes Audio Segments and Text 1.5 x Audio 4 Signal Samples 6 7 despite heroic efforts many worlds wild creatures doomed loss species is now same as when great dinosaurs become extinct will se creatures become dinosaurs our time today mankind is changing entire face planet earth x 5... Figure 5: Analysis scene changes in video and audio signal 4 Technology Synsis We now describe how we integrate different component technologies. In our early work on Informedia digital video library, all segmentation was done by hand. We have now moved to a procedure where segmentation boundaries suggested by system, but adjusted and verified by a person supervising digital video library creation process. Eventually we will transition from computer-assisted procedures to fully automatic video segmentation, as algorithms described above become better tested and more robust. Our current library creation process starts with a raw digitized video tape. The audio portion is fed through speech analysis routines, which produces a transcript spoken text. The speech signal is also analyzed for low energy sections that indicate acoustic paragraphs through silence. This is first pass at segmentation. If a close-caption transcript is available, we use that instead speech recognition output, since it is less errorful. The transcript is processed by natural language system and important keywords identified. Using results returned from image analysis, we n match acoustic paragraph to nest scene break. This gives us an appropriate video paragraph clip in response to a user s request. The keywords and ir corresponding paragraph locations in video indexed in informedia library catalogue. To obtain video clips suitable for viewers, we first search for keywords from user query in recognition transcript. When we find a match, surrounding video paragraph is returned. For a static icon representative a video clip, we place most emphasis on image data. The paragraph is determined by transcript and keywords. Within paragraph most prominent keywords identify most prominent scene. The scene boundaries determined by image analysis color histogram differences and optical flow analysis. Figure 5 shows integration technologies used by system. 5 Conclusion We currently using se techniques to create digital video library collections suitable for full content retrieval. While some steps not yet fully

6 integrated, each one has been shown to work independently, and several techniques fully integrated within informedia system. Anor use combined technologies will be development video skim [Smith95]. By only presenting significant regions, a short synopsis video paragraph can be used as a preview for actual segment. The Informedia Project will establish an online digital video library consisting over hours video material. In order to be able to process this volume data, practical, effective and efficient tools essential. We have outlined a practical set techniques for video segmentation that allows us to automatically process volume data required. 6 References [Akutsu94] Akutsu, A. and Tonomura, Y. Video Tomography: An efficient method for Camerawork Extraction and Motion Analysis, Proc ACM Multimedia 94, Oct. 15-2, 1994, San Francisco, CA, pp [Christel94] Christel, M., Stevens, S., & Wactlar, H. Informedia Digital Video Library, Proceedings Second ACM International Conference on Multimedia, Video Program. New York: ACM, October, 1994, pp [Salton83] Salton, G., McGill, M.J. Introduction to Modern Information Retrieval, McGraw-Hill, New York, McGraw-Hill Computer Science Series, [Stevens94] Stevens, S., Christel, M., Wactlar, H. Informedia: Improving Access to Digital Video. Interactions 1 (October 1994), pp [Zhang93] [Smith95] Zhang, H., Kankanhalli, A., and Smoliar, S. Automatic partitioning fullmotion video, Multimedia Systems (1993) 1, pp Smith, M., Kanade, T., Video Skimming for Quick Browsing Based on Audio and Image Characterization, CS Technical Report, Carnegie Mellon University, Summer Acknowledgment The authors would like to thank Howard Wactlar and or members Informedia Project for ir valuable discussions and contributions. This work is partially funded by National Science Foundation, National Space and Aeronautics Administration, and Advanced Research Projects Agency. [Hwang94] [Lucas 81] [Mauldin89] Hwang, M., Rosenfeld, R., Thayer, E., Mosur, R., Chase, L., Weide, R., Huang, X., Alleva, F., Improving Speech Recognition Performance via Phone-Dependent VQ Codebooks and Adaptive Language Models in SPHINX-II. ICASSP-94, vol. I, pp Lucas, B.D., Kanade, T. An Iterative Technique Image Registration and Its Application to Stereo, Proc. 7th International Joint Conference on Artificial Intelligence, pp , August Mauldin, M. Information Retrieval by Text Skimming, PhD Thesis, Carnegie Mellon University. August Revised edition published as Conceptual Information Retrieval: A Case Study in Adaptive Partial Parsing, Kluwer Press, September [Rudnicky95] Rudnicky, A., Language Modeling with Limited Domain Data, Proceeding 1995 ARPA Workshop on Spoken Language Technology, in press.

INFORMEDIA TM : NEWS-ON-DEMAND EXPERIMENTS IN SPEECH RECOGNITION

INFORMEDIA TM : NEWS-ON-DEMAND EXPERIMENTS IN SPEECH RECOGNITION Howard D. Wactlar, Alexander G. Hauptmann and Michael J. Witbrock ABSTRACT In theory, speech recognition technology can make any spoken