COALA: CONTENT-ORIENTED AUDIOVISUAL LIBRARY ACCESS

COALA: CONTENT-ORIENTED AUDIOVISUAL LIBRARY ACCESS NASTARAN FATEMI Swiss Federal Institute of Technology (EPFL) E-mail: Nastaran.Fatemi@epfl.ch OMAR ABOU KHALED University of Applied Sciences of Western Switzerland (EIA-FR) E-mail: Omar.Aboukhaled@eif.ch This paper describes the COALA (Content-Oriented Audiovisual Library Access) project, which aims to offer a framework for the retrieval of audiovisual TV news documents. COALA s principal design features are based on the analysis of a classical archive system: TSR (Television Suisse Romande) archive. It provides a retrieval interface allowing simple/advanced search capabilities combined with hierarchical video browsing features. Also, it uses a TV news model developed based on MPEG-7 standard, in order to provide an interoperable audiovisual description language used in different steps of news processing: production, retrieval and archiving. 1 Introduction Audiovisual documents carry rich content information, which is mainly due to their multimedia nature; They are composed of audio, video (moving images), still images, and even text and therefore they contain an aggregation of the content information of these different media. Moreover, the different relationships between these media (spatio-temporal, composition, etc.) create supplementary structural information. Such rich content of audiovisual documents makes their management, especially their indexing and retrieval a complex issue. The main problems involved in the retrieval of audiovisual documents are audiovisual data analysis, representation, browsing and querying. Several research and commercial products have been recently dedicated to the problem of audiovisual data analysis and representation, or what is often referred to as audiovisual indexing. For instance, Informedia [1] combines speech recognition, image understanding and natural language processing technologies to automatically transcribe, segment and index video. Virage [2] provides a video indexing tool, which automatically detects shot boundaries, key frames, speech and teletext. Also, it offers the possibility of manual annotation of video segments. There has been research on the querying and browsing of audiovisual documents. VideoQ [3] is a retrieval system with a sketch-drawing query tool to define objects, their visual features and their movement trajectory. In Fischlar project several video browsing interfaces have been described and analyzed [4, 5].

All these systems offer interesting features for audiovisual indexing and retrieval. However, many issues still should be explored in order to provide an effective audiovisual retrieval system. Concerning the retrieval, a better interactive strategy needs to be provided to facilitate the complexity of this task. COALA aims to offer such interaction by an integration of querying and browsing methods. Very little existing works are based on a study of the real users needs in an audiovisual library. COALA is designed following an analysis of a classical TV news archive: TSR (Television Suisse Romande) archive [6]. One of the main objectives of the project is to provide access to the audiovisual library to users with different needs and profiles. Also as anther result of the TSR study, COALA considers the audiovisual library not only as a retrieval module but also in a larger context, which is the chain from production to archiving and retrieval and therefore gives a particular attention to the descriptions of the content produced and exchanged in this chain. To facilitate this information exchange we proposed a TV news description model based on the new multimedia description language, MPEG- 7. To our knowledge COALA is the first TV news library based on MPEG-7. The current focus of the project is on TV news documents. This choice was mainly motivated by their very rich content and also the important retrieval need for such documents. However, we believe that the results of this experience will be profitable for other types of audiovisual documents too. The rest of this paper is organised as follows. Section 2 presents the requirements of digital TV news library by remarking three main design issues resulted from the study of the TSR news archive. Section 3 gives an introduction to the MPEG-7 standard and describes why we propose it as the news video description language. Section 4 describes the COALA architecture, its corpus and its indexing and retrieval interfaces. Finally section 5 concludes the paper and discusses the future directions of the project. 2 Requirements of a TV news library Our approach to the design of an audiovisual TV news library is based on the results of the analysis of a classical archive system at TSR (Television Suisse Romande). This study was realized through several discussions, interviews and a survey done on the TSR users. In the following we describe three important observations resulted from this study, which we have considered in the design of the COALA application. 2.1 Interchangeable news video description language The main use of an audiovisual news archive system is the retrieval of news segments in order to be reused in further applications. The production, archiving and retrieval of TV news programs are indeed three processes strongly interconnected.

Figure 1. Video descriptions in the chain of production, archiving and retrieval As figure 1 shows, during archiving produced material, the production structure of the audio-visual document and also different sources of information created during production, such as scripts, captions, journalist commentaries and descriptions of the video rushes are exploited to prepare an adequate description of news video program. The retrieval process uses the descriptions created during archiving, which consists of production information plus a set of interpretations of the audiovisual content. The result of the retrieval is also a set of described video segments, which can be reused for further productions. In order to facilitate the exchange, the reuse and the automatic management of information between these three processes, an interchangeable video news description language is needed. For this purpose we propose to use MPEG-7 [7,8] which is the new standard currently being developed by Moving Pictures Expert Group (MPEG) for multimedia content description. More details concerning the use of this standard are explained in section 3. 2.2 Users with different levels of expertise Users of a news audiovisual library vary from professional users: producers, journalists, video editors and archivists to non-professional users (any individual person who needs to retrieve news video segments). Also, independently of their professions, they have different expertise levels in using the audiovisual library. This difference is due to their experience in retrieval process and therefore is related to their knowledge of the video description scheme. In most of the current news archive systems the only actual operators of retrieval are the archivists, who know

the best the video description scheme and are most familiar with the contents of the library. As will be shown in section 4, in order to make the audiovisual library accessible to other users, COALA provides two different retrieval strategies: simple and advanced. Moreover, we plan to offer personalization capabilities for the retrieval interfaces in order to better consider different users profiles and needs. 2.3 Retrieval via combination of querying and browsing Audiovisual documents carry rich content information. Users of an audiovisual library have therefore a variety of retrieval needs concerning different aspects of the content, the most important of which are the thematic content, or what the content is about and the visual content, or what is seen in the content [9]. Formulation of queries from such complex content is not an easy task because it should take into account from one side different users needs and from the other side the characteristics of the audiovisual content. Currently, COALA uses a query model we have implemented based on the observations of the TSR users needs and using the XQL query language [10]. We are also working on a more elaborate query language allowing formulation of more precise needs based on the structure and content (visual and thematic). Another important issue arisen by the users is the possibility of browsing the news video programs. This is specially needed in order to provide a flexible way of exploring the database content and also to let the users get familiar with the video structure and description scheme. Browsing inside the structure of a news program is also a rapid way of content verification after the querying phase. We believe (as also proposed in [11]) that an integration of querying and browsing can be very helpful to facilitate the retrieval of complex multimedia documents and therefore in our case of news video documents. 3 News video description language based on MPEG-7 As described in section 2, in order to make interoperable the news video descriptions created in different processing phases from production to retrieval, a standard news video description language is needed. Recently different news description standards have been developed such as XMLNews [12,13] and Encoded Archival Descriptions (EAD) [14]. However, these standards are oriented to textual news (e.g. newspapers) and are not suitable for audiovisual news: they do not offer any tools for description of the news audiovisual content, such as video segmentation characteristics, colour histograms, etc. We propose the use of MPEG-7 [8] as the news video description language. It is the new standard currently being developed by Moving Pictures Expert Group (MPEG) for multimedia content description. It provides a rich set of standardized tools to describe multimedia (audiovisual) content and its scope covers a large

number of applications between which are the audiovisual indexing and retrieval systems. MPEG-7 uses a Data Definition Language (DDL) [15] based on XML Schema [16,17] and defines a set of Descriptors (Ds) and Description Schemes (DSs) [18], which allow the description of different aspects of multimedia (audio, video and image) content, such as semantic descriptions, media structure, spatio-temporal characteristics, low-level media features, etc. MPEG-7 Ds and DSs are generic and can be used by different multimedia applications. We have studied the adaptability of these Ds and DSs for the news video description and we have proposed some improvements to MPEG-7 in order to support the features needed in TV news applications [19,20]. This study was accomplished by proposing an MPEG-7 news model, which will be soon published in a separate article.. One of the advantages of MPEG-7 as a description language is its DDL based on XML Schema. XML Schema offers several possibilities for defining types (basic data types such as integer and other special data types needed in each application) and object oriented concepts such as inheritance. Moreover, as we will see in section 4, we are using XML-based technologies in the implementation of the COALA application, which offers a flexible and interoperable way for information management. 4 COALA system In this section we describe the COALA architecture, the corpus containing TV news video descriptions, the indexing and retrieval interfaces and finally the technological choices for the implementation of the system. 4.1 Architecture Figure 2 shows the COALA s architecture, which is composed of three main parts: indexing, retrieval and the corpus. The indexing process is based on several algorithms for temporal segmentation, colour histogram detection, script alignment, etc., each one creating a specific description of the video content which will be stored in the MPEG-7 database. Also, a particular attention is given to offer the possibility of manual descriptions by using a special tool called LogCreator, which will be described in section 4.3. The retrieval process consists of a combination of structured querying, inter/intra news video browsing and finally video segment visualization. The structured querying module is based on XML query language and allows the retrieval of news video segments following the MPEG-7 news model. It offers two modes of querying: simple and advanced.

Figure 2. COALA Architecture In the simple mode, the query consists of a set terms related by Boolean operators. The query contains no specification of the structure of the news video documents neither the description scheme. This mode does not require the user to know the video description scheme, however, it permits less precise queries. In the advanced mode the query model allows retrieval based on each element of the TV news document structure but it implies an advanced knowledge of the description scheme. The browsing consists of two types: inter-document browsing, which allows to navigate into hierarchical structure of the news video documents and the intradocument browsing which makes possible to go through several documents using predefined links and based on similarities of the documents contents. The visualization module allows playing selected video segments in order to verify the matching of the query results and the users needs. The last part of the architecture consists of the TV news video corpus containing the video data and their corresponding MPEG-7 descriptions. More details on the corpus are described in the next section. 4.2 Corpus One of the important design features of COALA system is its use of a richly described news video corpus. This rich description of the content allows making explicit more content information and therefore improves the retrieval process. For example descriptions on the hierarchical structure of the TV news and the types of

the video segments in this hierarchy (program, presentation, report, shot, etc.) make possible to perform precise queries specifying the type of the desired video segment and also allows to navigate into this hierarchy. Our corpus contains 10 hours of TSR TV news programs in MPEG-1 format, the corresponding scripts and TSR manual descriptions. This corpus is now being extended to 20 hours containing also the teletext subtitles. The descriptions are translated into the MPEG-7 news model. Also other information such as shot boundaries, key frames, etc. have been added to them using the LogCreator application. 4.3 Indexing and retrieval interfaces COALA offers an interface for semi-automatic description of video documents called LogCreator (figures 3 and 4). The system allows the possibility to upload an MPEG video document and to choose between several algorithms (currently M-Edit [21] and DCU [22]) in order to obtain the video shots with their timing information and their corresponding key frames. The key frames are presented in a mosaic view as is shown in figure 3. The user is then given the possibility to describe the whole document and also each separate shot by clicking on a key frame and going to the shot description interface shown in figure 4. All descriptions follow a predefined scheme based on the mpeg-7 news model. This model can represent more semantically high-level segments based on the different TV news users. LogCreator is currently being extended in order to allow the possibility of creating the hierarchical structure of TV news programs (by regrouping shots into other segment types such as reports, interviews, etc.), adding descriptions related to the segments in the news structure, and finally alignment of corresponding teletext subtitles and scripts with the video shots. The results of the annotation are added to the corpus. The retrieval interface is a combination of a querying module (figure 5) and a browsing module (figure 6). This approach has the advantage to allow a better interaction with the users; As remarked in [11], querying and browsing are in general two complementary retrieval strategies whose mutual advantages and drawbacks are quite complementary: in terms of orientation and cognitive load problems (being disadvantages of browsing versus querying) and in terms of free access to documents (being advantage of browsing versus querying).

Figure3: Mosaic interface of LogCreator Figure 4: Shot detection in LogCreator Moreover, we consider that in the special case of video documents, the hierarchical inter-document browsing is also a very effective way to verify the

content of videos resulted from querying. This capability has two advantages: first, it transforms the time consuming linear method of video content verification into a rapid hierarchical method. Second, it allows verifying for each video segment its more general context (by going up the tree) and more details of it (by going down the tree). Figure 5: Advanced querying interface As mentioned before, COALA provides two modes for querying: simple and advanced. Figure 5 shows the advanced query interface. The upper part of this interface consists of the different fields corresponding to the MPEG-7 TV news model, such as the news broadcast date, presenter, words to be appeared in the title of a news item, its summary, etc. The lower part is dedicated to the results of the query. The description fields appeared for each result are dependent of the type of the segment asked for in the query. For news items, title, date, duration and the summary are presented. The choice of these fields was made based on the survey done on TSR users. Personalization of the interface will allow each user to choose the description fields that he wants in the result interface. Once the results are presented, the user can

select them one by one, by crossing the radio button besides each result and then pressing the brows button to go to the browsing interface and verify its content. Figure 6: Hierarchical browsing interface Figure 6 shows the hierarchical browsing interface. Once the user chooses to browse a video segment in the results list, the browsing interface presents the whole hierarchy of the TV news program in which the segment occurs. In order to help the user to find quickly the desired segment, its path in the hierarchy (beginning from the root and ending to the goal segment) is highlighted by red rectangles around the corresponding key frames (the key frames surrounded by circles in figure 6). The user can further explore the context of the segment or more details of it by going up and down the tree. Also, for each video segment in the hierarchy, descriptions can be found by clicking on the information button below the corresponding key frames. These descriptions help the user to get more information on the content of the segment without visualizing it. Once a user decides to visualize a segment, clicking on the

corresponding key frame opens another window in which the video segment is being played. 4.4 XML and WEB technologies Figure 7 shows the three-tiered architecture of the retrieval application. In the client side the user uses a classical web navigator as the retrieval interface. The middleware application is based on the Orion WEB server [23] and the server side consists of an XML native database: TAMINO [24]. The main advantage of the three-tiered architecture is the separation between the application logic and the database. The native XML database allows to better exploit the hierarchical structure of the XML documents and supports XQL query language which implements our query model. Figure 7: Three-tired architecture Figure 8: Functional architecture Figure 8 shows the functional architecture of the news video retrieval application. The user interface consists of html forms, which are generated automatically based on the XML files defining the interface model and using the XSL stylesheet respecting the user profile. 5 Conclusion and future works The COALA project described in this paper is an ongoing research project. Several features have been developed, such as the TV news corpus, the indexing, querying and hierarchical browsing interfaces. However, many research and technological aspects are still under investigation. One of the most important is the design of an elaborate indexing and querying model, which goes beyond the capabilities of structured document indexing and querying models, specially by integrating the elements weights. Also, we tend to integrate the MPEG-7 XM software [25] in order

to provide more indexing functionalities in the LogCreator. Finally the integration between the querying and browsing modules needs to be more elaborated, especially we tend to investigate relevance feedback possibilities from browsing to querying. 6 Acknowledgment The authors would like to express their gratitude to TSR TV news production and archiving experts for their precious collaboration during the study of their system. References 1. http://www.informedia.cs.cmu.edu/ 2. http://www.virage.com/ 3. S.-F. Chang, W. Chen, H. Meng, H. Sundaram, and D. Zhong, "VideoQ: An Automated Content Based Video Search System Using Visual Cues," ACM Multimedia 97, Seattle, USA, pp. 313-324, November 1997. 4. H. Lee, A.F. Smeaton, C. Berrut, N. Murphy and N.E. O Connor, Implementation and Analysis of Several Keyframe-Based Browsing Interfaces to Digital Video, 4th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2000). Lisbon, Portugal, 18-20 September 2000. 5. http://lorca.compapp.dcu.ie/video/ 6. http://www.tsr.ch/ 7. MPEG-7 Context, Objectives and Technical Roadmap, ISO/IEC JTC1/SC29/WG11/N2861 Vancouver, July 1999. 8. http://www.mpeg.7.com 9. N. Fatemi and O. Abou Khaled, "Indexing and Retrieval of TV News Programs Based on MPEG-7", in Proceedings of the IEEE International Conference on Consumer Electronics (ICCE'2001), Los Angles, CA, June 2001. 10. J. Robie, J. Lapp, D. Schach, "XML Query Language (XQL) ", In QL'98 - The Query Languages Workshop, Boston, Massachusetts, December 1998. (http://www.w3.org/tands/ql/ql98/pp/xql.html) 11. Y. Chiaramella Browsing and Querying: Two Complementary Approaches for Multimedia Information Retrieval, in proceedings of Hypertext Information retrieval Multimedia conference (HIM 97), Dortmund, 1997. 12. NewsML V.1: Document Type Definition. International Press Telecommunications Council (IPTC), October 2000 http://www.iptc.org/site/newsml/dtd/newsmlv1.0.dtds

13. NewsML V.1: Functional Specification. International Press Telecommunications Council (IPTC), October 2000 http://www.iptc.org/site/newsml/specification/newsmlv1.0.pdf 14. Encoded Archival Description, Network Development and MARC Standards Office of the Library of Congress http://lcweb.loc.gov/ead/ 15. MPEG-7 Final Commitee Draft, Part 2: Description Definition Language, ISO/IEC JTC1/SC29/WG11/N3575 Beijing, July 2000. 16. XML Schema Part 1: Structures, W3C Candidate Recommendation, October 2000, http://www.w3.org/tr/xmlschema-1/ 17. XML Schema Part 2: Datatypes, W3C Candidate Recommendation, October 2000, http://www.w3.org/tr/xmlschema-2/ 18. MPEG-7 Final Commitee Draft, Part 5: Multimedia Description Schemes, ISO/IEC JTC1/SC29/WG11/N3966 Singapore. March 2001. 19. N. Fatemi, J. A. DeBlasio, and G. Amaudruz, "Some Remarks and Propositions to the MDS CD Resulting from the Study of Archive Applications", MPEG-7 contribution: m6696, Pisa, Italy, January 2001. 20. N. Fatemi, J. A. DeBlasio, and G. Amaudruz, "Adding closed-caption and subtitle to VideoText DS", MPEG-7 contribution: m7092, Singapore, Mars 2001. 21. http://www.mediawaresolutions.com 22. A. F. Smeaton, J. Gilvarry, G. Gormley, B. Tobin, S. Marlow and M. Murphy, An Evaluation of Alternative Techniques for Automatic Detection of Shot Boundaries in Digital Video», IMVIP'99-3rd Irish Machine Vision and Image Processing Conference. Dublin, Ireland, 8-9 September 1999. 23. http://www.orionserver.com/ 24. http://www.softwareag.com/tamino/ 25. http://www.lis.ei.tum.de/research/bv/topics/mmdb/e_mpeg7.html