CIMWOS: A MULTIMEDIA ARCHIVING AND INDEXING SYSTEM

Size: px

Start display at page:

Download "CIMWOS: A MULTIMEDIA ARCHIVING AND INDEXING SYSTEM"

Albert Daniel
5 years ago
Views:

CIMWOS: A MULTIMEDIA ARCHIVING AND INDEXING SYSTEM Nick Hatzigeorgiu, Nikolaos Sidiropoulos and Harris Papageorgiu Institute for Language and Speech Processing Epidavrou & Artemidos 6, 151 25

gr Abstract This paper describes a multimedia, multilingual and multimodal research system called CIMWOS (Combined IMage and WOrd Spotting).

1 CIMWOS: A MULTIMEDIA ARCHIVING AND INDEXING SYSTEM Nick Hatzigeorgiu, Nikolaos Sidiropoulos and Harris Papageorgiu Institute for Language and Speech Processing Epidavrou & Artemidos 6, Maroussi, Greece nikos@xanthi.ilsp.gr, nsidir@xanthi.ilsp.gr, xaris@ilsp.gr Abstract This paper describes a multimedia, multilingual and multimodal research system called CIMWOS (Combined IMage and WOrd Spotting). CIMWOS incorporates an extensive set of multimedia technologies, integrating three major subsystems (text, speech, and image processing). It produces a rich collection of XML metadata annotations following the MPEG-7 standard. These XML annotations are further merged and loaded into the CIMWOS Multimedia Database. An ergonomic and user-friendly Web-based interface allows the user to efficiently retrieve video segments by a combination of media description, content metadata and natural language text. Currently the system is under evaluation and contains broadcast news and documentaries in three languages (English, Greek, and French), while the open architecture allows for more languages to be incorporated in the future. Key Words: Multimedia Databases, Electronic Libraries, User Interfaces, Applications recognition, video text detection and character recognition. All outputs converge to a textual XML metadata annotation scheme following the MPEG-7 standard. These XML annotations are further merged and loaded into the CIMWOS Multimedia Database. Additionally, they can be dynamically transformed for interchanging semantic-based information into RDF and Topic Maps documents via XSL stylesheets. The CIMWOS retrieval engine is based on a weighted boolean model with intelligent indexing components. An ergonomic and user-friendly Web-based interface allows the user to efficiently retrieve video segments by a combination of media description, content metadata and natural language text. The database is a large collection of broadcast news and documentaries in three languages (English, Greek, and French), while the open architecture allows for more languages to be incorporated in the future. The system is currently under evaluation. 1. Introduction There are several projects aiming at developing advanced technologies and systems to tackle the problems encountered in multimedia archiving and indexing [1-3]. CIMWOS (Combined IMage and WOrd Spotting) [4] incorporates an extensive set of multimedia technologies, combining the areas of expertise of eight participating institutions with the goal of creating an integrated system that can retrieve information from video sources. CIMWOS supports content-based indexing, archiving, retrieval and on-demand delivery of audiovisual content. The system uses state-of-the-art algorithms for text, speech and image processing. The audio processing operations employ robust continuous speech recognition, speech/non-speech classification, speaker clustering and speaker identification. Text processing tools operate on the text stream produced by the speech recognizer and perform named entity detection, term recognition, topic detection, and story segmentation. Image processing includes video segmentation and key frame extraction, face detection and face identification, object and scene Figure 1: CIMWOS modules 2. System modules Each video is processed by the various CIMWOS information extraction modules (Fig. 1). These modules can be grouped in three main categories that comprise the speech, text and image processing subsystems

2 2.1. Speech Processing Subsystem Broadcast news exhibit a wide variety of audio characteristics, including clean speech, telephone speech, conference speech, music, and speech corrupted by music or noise. Transcribing the audio, i.e., producing a (raw) transcript of what is being said, determining who is speaking when, what topic a segment is about, or which organizations are mentioned, are all challenging problems. Adverse background conditions can lead to significant degradation in performance. Consequently, adaptation to the varied acoustic properties of the signal or to a particular speaker, and enhancements to the segmentation process, are generally acknowledged to be key areas for research and improvement to render indexing systems usable. This is reflected in the effort dedicated to advance the current state-of-the-art in these areas. In CIMWOS, these tasks are handled by the speech processing subsystem [5], which comprises speaker change detection (SCD), automatic speech recognition (ASR), speaker identification (SID) and speaker clustering (SC). The ASR engine is a real-time, large vocabulary, speaker-independent, gender-independent, continuous speech recognizer trained in a wide variety of noise conditions encountered in the broadcast news domain Text Processing Subsystem Text processing tools, operating on the text stream produced by the speech processing subsystem, perform detection, identification, and classification tasks Named Entity Detection The NED module [6] identifies all named locations, persons and organizations, as well as dates, percentages and monetary amounts in video transcriptions. CIMWOS uses a series of basic language technology building blocks, which are modular and combined in a pipeline. An initial finite state preprocessor performs tokenization and sentence boundary identification on the output of the speech recognizer. A part-of-speech tagger trained on a manually annotated corpus and a lexicon-based lemmatizer carry out morphological analysis and lemmatization. A lookup module matches name lists and trigger-words against the text, and, eventually, a finite state parser recognizes named entities on the basis of a pattern grammar. Training the NED module consists in populating the gazetteer lists and semi-automatically extracting the pattern rules. A corpus of words per language already tagged with the NED classes was used to guide system training and development Term Recognition The term recognizer identifies possible single or multiword terms, using both linguistic and statistical modeling. Linguistic processing is based on an augmented term grammar, the results of which are statistically filtered using frequency-based scores Story Detection & Topic Classification Story detection (SD) and topic classification (TC) use a set of models trained on an annotated corpus of stories and their associated topics. The basis of SD and TC is a generative, mixture-based HMM including one state per topic and one state modeling general language, i.e., words not specific to any topic. After emitting a single word, the model re-enters the beginning state and the next word is generated. At the end of the story, a final state is reached. In SD, a sliding window of fixed size is used to note the change in topic-specific words, resulting in a set of stable regions in which topics change only slightly or not at all. TC then classifies the text sections according to the set of topic models. The modeled inventory of topics is a flat, Reuters-derived structure, containing a few main categories and several sub-categories, a structure shallow enough to provide reasonable amounts of training data per topic but still fine-grained enough to allow for flexibility and detail in queries. All technologies used are inherently language independent and of a statistical nature Image Processing Subsystem A video sequence consists of many individual images which are called frames and are generally considered as the smallest unit we are concerned with, when segmenting a video. An uninterrupted video stream generated by one camera is called a shot. The image processing subsystem consists of Video segmentation and Key frame extraction, Face Detection and Face Identification spotting persons at any location, scale and orientation, Object Recognition marking an object in a 'recognition view', given knowledge cumulated from a set of previously seen 'learning views' and Video Text recognition detecting and recognizing close captions in images and video. This work has been reported in [7-10]. 3. Integration Architecture All processing modules in the corresponding three modalities (Audio, Image and Text) converge to a textual XML metadata annotation scheme [11] following the MPEG-7 descriptors. These XML metadata annotations are further processed, merged and loaded into the CIMWOS Multimedia Database. The merging Component of CIMWOS is responsible for the amalgamation of the various XML annotations and the creation of one self-contained object that is compliant with the CIMWOS Database. Additionally, the resulting object can be dynamically transformed for interchanging 227

3 semantic-based information into RDF and Topic Maps documents via XSL stylesheets. The CIMWOS Database is a relational database that contains all the information of each one of the XML metadata files and also all the relevant indexes needed to facilitate search and retrieval of information based on user-specified search terms. 4. Retrieval and Presentation Because of its size, a video clip can take a long time to move from one location to another, such as from the digital video library to the user. Likewise, if a library consists of only 30-minute clips, when users check one out, it may take 30 minutes to determine whether the clip meets users needs. To make the retrieval and viewing of information faster, the digital video library will need to support partitioning video into small-sized clips. In CIMWOS, the retrieval engine is based on a weighted Boolean model equipped with intelligent indexing components. The basic retrieval unit is the passage. The passage, which plays the role of a document of a standard retrieval system, is indexed on a set of textual features: words, terms, named entities, speakers and topics. Each passage is linked to one or multiple shots, and each shot is indexed on another set of textual features: faces, objects and video text (caption or scene). Via the linking of passages and shots, each passage is finally assigned a broader set (union) of features that will be used for retrieval. All the features are textual, no image or audio conceptual feature is supported (i.e. motion parameters or similarity between histograms). Each passage is represented as a set of features and retrieval of passages is performed based on computing similarity in the feature space. The results are ranked on the basis of the computed similarity values. The retrieval procedure is the following: first, passages are filtered on the basis of the advanced features in order to reduce the search space of the free query. If no such features have been selected, the search space remains unchanged. Next, a boolean-based matching operation takes place in order to compute the similarity between the query and each passage of the possibly reduced space. Finally, passages are ranked based on this similarity. The result set can be visualized by summarizing the relevant passages. While skimming a passage and its associated metadata, the end user can select a passage and view its associated metadata, read a summary containing the first words of the transcribed speech and a sequence of thumbnails, or play the passage using intelligent streaming. 5. The CIMWOS User Interface The CIMWOS user interface is a Web interface that is based on well-known Web technologies, like HTML and ASP. The user views standard HTML Web pages and does not need to download specific software. The user can browse the CIMWOS database or submit queries. The system processes user-submitted queries using both ASP code and dll modules, formulates an appropriate SQL query, retrieves the matching information from the CIMWOS multimedia database and formulates the results according to user selections. Table 1: Selection steps for the Browse Interface Language Selection Output Format Categories English French Greek All Lang. HTML XML Creators Faces Locations Named Entities Objects Video Text Speakers Terms Topics Figure 2: Time search form Figure 3: Standard search form 228

4 Figure 4: Advanced Search Figure 5: Search results 229

The interface offers four different modes of access: Browse, Time Search, Standard Search and Advanced Search.

2) allows the user to choose a specific video from the database and the view a particular video segment. Standard search (Fig.

5 The Web interface is accessible through the following address: Some of the video material included in the digital library has copyright restrictions (e.g. for research purposes only), so anonymous access is not yet available; however, a limited-time password can be obtained by the webmaster or the authors of this paper. The interface offers four different modes of access: Browse, Time Search, Standard Search and Advanced Search. The Browse mode permits the user to look trough the videos using three selection steps (Language selection, Output format, Categories) with the choices of table 1. Time search (Fig. 2) allows the user to choose a specific video from the database and the view a particular video segment. Standard search (Fig. 3) is a simplified search interface offering a subset of the advanced search capabilities. Figure 6: Search results in XML Advanced Search (Fig. 4) allows the user to access the full power of the search engine. It permits any Boolean combination of multiple search criteria that include text (extracted by the speech processing subsystem), named entities, topics, speakers, terms, faces, objects, locations and video text. The user can also choose the output format for the results (HTML or XML), whether the ranking engine will be used on the results and the number of thumbnails that will be returned for each result. The results from any kind of search can be viewed as a Web page with a list of video segments that satisfy the search criteria (Fig. 5). In this list the user can view the video title, its duration and its format (currently streaming video in MS WMP format is the only available format), as well as the ranking for each result. However, the user can also request the results to be presented in XML format. In that case, the user receives an XML document (Fig. 6) with a limitation of 10 results per query. Figure 7: Details of a video segment For each of the resulting segments, it is possible to view the full details of the available metadata (Fig. 7) including information for the parent video and the transcription of speech extracted from the segment. Finally, the user can also view the video segment itself together with the transcription (Fig. 8), as well as adjacent video segments. Figure 8: Viewing a video segment 230

6 6. Conclusions CIMWOS has developed an integrated environment for accessing multimedia digital content providing advanced searching and browsing capabilities. Using the results from research efforts on emerging technologies for multimedia data processing, indexing and retrieval, and embedding them in a video library system that covers broadcast news and documentaries in English, French and Greek, has produced a system that can index and retrieve video segments based on a variety of search criteria. The CIMWOS Digital Video Library uses intelligent, automatic mechanisms that provide full-content search and retrieval from a large (scaling to several hours) online digital video library. The tools that have been exploited during the project can automatically populate the library and support access to it. The adopted approach uses combined speech, language and image understanding technology to automatically transcribe, segment and index the video. Retrieval is text-based, performed on the material that results from image analysis, speech recognition and natural language processing of the transcribed text. The information that is produced by the speech, image, and text components is suitable for intelligent indexing and retrieval so that a user can find a video segment of interest within the digital library. The user interface for the resulting indexed multimedia database is a Web-form based interface. It is a user-friendly interface that offers a number of different modes of access to the data contained in the library. The results of a search are video segments, and the user can view them using streaming video, or save the detailed information for each segment in XML format. 7. Acknowledgements This material is based on work supported by cost reimbursement contract number IST for research and technological development projects from the Commission of the European Communities, Directorate- General Information Society. 8. References [1] Wactlar H., Olligschlaeger A., Hauptmann A., Christel M., Complementary Video and Audio Analysis for Broadcast News Archives, Communications of the ACM, 43(2), February 2000, [2] Michael R. Lyu, Edward Yau, Sam Sze., Video and multimedia digital libraries: A multilingual, multimodal digital video library system, Proc. 2nd ACM/IEEE-CS joint conf. On Digital Libraries, July 2002, [3] Sankar A., Gadde R.R. and Weng F., SRI s broadcast news system Toward faster, smaller and better speech recognition, Proc. DARPA Broadcast News Workshop, 1999, [4] EU IST project CIMWOS ( /cimwos/) [5] F. Kubala, J. Davenport, et al., The 1997 BBN BYBLOS System applied to Broadcast News Transcription, Proc. DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne VA, Feb [6] Demiros, I., Boutsis, S., Giouli, V., Liakata, M., Papageorgiou, H., Piperidis, S., Named Entity Recognition in Greek Texts, Proc. 2nd Int. Conf. on LREC 2000, Athens 2000, [7] Fabien Cardinaux and Sebastien Marcel, Face Verification using MLP and SVM, XI Journees NeuroSciences et sciences pour l'ingenieur (NSI 2002), [8] T. Tuytelaars, A. Zaatri, L. Van Gool and H. Van Brussel, Automatic Object Recognition as part of an Integrated Supervisory Control System, Proc. IEEE Conference on Robotics and Automation, ICRA 2000, [9] D. Chen, H. Bourlard and J-Ph. Thiran, Text identification in complex background using SVM, Proc. Int. Conf. on computer vision and pattern recognition, Dec. 2001, [10] J.M. Odobez and D. Chen (2002), Robust Video Text Segmentation and Recognition with Multiple Hypotheses, Proc. ICIP, Sep [11] Papageorgiou H., Prokopidis P., Demiros I., Giouli G., Constantinidis A., and Piperidis S. Multi-level XMLbased Corpus Annotation. Proc. 3rd LREC Conference, Las Palmas 2002,

CIMWOS: A MULTIMEDIA, MULTIMODAL AND MULTILINGUAL INDEXING AND RETRIEVAL SYSTEM

CIMWOS: A MULTIMEDIA, MULTIMODAL AND MULTILINGUAL INDEXING AND RETRIEVAL SYSTEM Ebroul Izquierdo, editor. Digital Media Processing for Multimedia Interactive Services. Proceedings of the 4 th European Workshop on Image Analysis for Multimedia Interactive Services. World Scientific