A MPEG-4/7 based Internet Video and Still Image Browsing System Miroslaw Bober 1, Kohtaro Asai 2 and Ajay Divakaran 3 1 Mitsubishi Electric Information Technology Center Europe VIL, Guildford, Surrey, UK 2 Mitsubishi Electric Information Technology Research Center, Ofuna, Kamakura, Japan 3 Mitsubishi Electric Research Laboratories, Murray Hill, NJ, USA ABSTRACT The ongoing MPEG-7 standard intends to provide a Multimedia Content Description Interface. In other words, it will provide a rich set of tools to describe content with a view to facilitating applications such as content based querying, browsing and searching of multimedia content. The MPEG-4 standard provides tools for compressing multimedia content at bitrates that are feasible with typical internet connections. Such bitrates fall significantly short of those supported by prior standards such as MPEG-1 and MPEG-2. Thus, in this paper, we present a remote video and still image browsing system that uses MPEG-7 for the querying/browsing/searching and MPEG-4 for compressing any transmitted content. We use descriptors of features such as color, shape and motion to annotate the stored content with MPEG-7 like metadata. The aforementioned descriptors stem from our previous work and are currently in the working draft of the MPEG-7 standard. In our previous work, we have shown the efficacy of each of the descriptors individually. In this paper, we show how we combine some of the features to effectively browse remote video and still image content. Our emphasis is on accurate and quick browsing of the remote content. Our system consists of a video web server with stored MPEG-4 video/still content that is able to support remote requests through a simple browser interface. We have used a combination of cgi script and servletapplet based configurations. We will present a demonstration of our system at the conference. We have already successfully demonstrated it to the Japanese press. Keywords: Motion Activity, Compressed Domain Feature Extraction, MPEG-7, Video Indexing 1. INTRODUCTION The ongoing MPEG-7 or Multimedia Content Description Interface standard facilitates content based querying, searching and browsing of content. It is thus complementary to previous MPEG standards such as MPEG-4 which emphasize content synthesis. The world wide web has allowed today s consumer to access content all over the world. However, the available bandwidth for the average consumer still remains low and variable, and therefore relatively high bandwidth compression formats like MPEG-1/2 are unsuitable for internet based access. Since MPEG-4 can function over a wide range of bandwidth from low bit rates such as PSTN, to high bit rates such as Cinema quality, it is the logical choice for internet based applications. In this paper we describe a content based browsing framework that uses MPEG-7 to locate the desired content and MPEG-4 to transmit and present it. We use the MPEG-7 color, motion and shape descriptors, developed at Mitsubishi labs, to automatically extract and attach the content description to the content. The content is encoded using MPEG-4, which allows us to retrieve it, by using a standard (Apache) server framework. 2. MOTIVATION AND BACKGROUND At Mitsubishi Electric (MELCO), we have developed color, motion and shape descriptors that are now part of the MPEG-7 working draft. These descriptors all emphasize compact and effective description of content. They are all easy to extract and match. They have been through a rigorous testing and development process as part of the MPEG-7 standard development. Therefore, they constitute a useful and robust set of content descriptors. The features expressed by our descriptors are: 1. Color Descriptor Our color descriptor captures the dominant colors of a picture using a mixture of Gaussians approach. 2. Motion Descriptor Our motion descriptor captures the intensity, spatial and temporal characteristics of the gross or overall motion in a video segment, using block motion vectors. 3. Shape Descriptor Our shape descriptor captures the contour of a region using a curvature scale space representation.
Each of these descriptors lends itself to convenient indexing of images and video. Note that our set of descriptors is only a subset of the MPEG-7 descriptors. Furthermore, MPEG-7 includes both low-level descriptors of color, shape, motion, texture etc. as well as high-level descriptors that capture high level information such as goal scoring moment, romantic scene etc. In the next section we describe our proposed system which is not restricted to the aforementioned subset. It is in fact capable of using all possible MPEG-7 descriptions. 3. THE PROPOSED SYSTEM Figure 1: The MPEG-4/7 Content Retrieval System Figure 1 illustrates our proposed MPEG-4/7 content retrieval framework. The system is divided into two major parts the server side and the client side. The client side presents a convenient interface to the end-user so he can find and then play desired content. It consists of a browsing interface that could function in a separate box by itself or within one or more of the various information appliances used by today s consumer (see figure 1). The server side bears the computationally heavier burden of generating the MPEG-7 descriptions as well as searching the content once the MPEG-7 descriptors have been generated and linked to the content. There are many different ways in which the two sides can be linked as illustrated in Figure 1. Moreover, note that that each application would instantiate the above framework in its own way. For instance, if the content and the client were collocated as in a TV-Anytime type application, the transmission part illustrated above would be a small part of the system and the overall system would be a vastly simplified version of the proposed framework. On the other hand, if the access is over the world wide web, it would lead to a wide diversity in available bandwidth and thus the overall system would be a full fledged realization of the proposed framework.
Thus our system provides a common framework for the end user to access content regardless of its location. Note that the system is feasible only because of the interoperability enabled by MPEG-4 and MPEG-7. 4. EXAMPLES OF APPLICATIONS We describe two applications to illustrate our framework. First, in figure 2, we illustrate the combined use of shape and color to locate a desired cartoon character in a movie. While the descriptors are good by themselves, for any application we typically need to combine two or more features together to get effective indexing. In this case, the combination of shape and color satisfies the need of the end user as can be seen. The interface is based on queries by example, which is reasonable for a cartoon movie in which the characters are few and are often previously known. Second, we illustrate remote video browsing using motion descriptors. In figure 3, we illustrate browsing the remote video sequence, a news program from Spanish TV, by viewing a few thumbnails or key-frames at a time. However, this becomes tedious for a moderately long program. So in Figure 4, we illustrate retrieval of high action segments from the news video using the motion descriptor. Notice that the sports segments bubble up to the top when the highest action segments are requested. Thus, motion descriptors can provide quick access to the sports segments in a news program for example. We can similarly locate the newsanchor using the motion activity descriptor, and thus skim the news video sequence.
Figure 2: Finding favorite Cartoon Character using shape and color
Figure 3: Browsing remote video using thumbnails
Figure 4 : Finding the sports segments by looking for high motion activity 5. DISCUSSION The applications in the previous section made use of low-level features such as motion and shape. Our results show that content-based querying is rendered much more effective by combining more than one low-level feature. Note that low-level features can be extracted automatically. Higher-level features are difficult or impossible to extract automatically. Manual extraction of such features is tedious and hence is not an option for even a content database of moderate size. Furthermore, while all low-level features can be extracted automatically, the complexity of extraction varies from feature to feature. Features extractable in the compressed domain are easier to extract than other features. In our previous work[3], we show that the combination of color and the motion activity in the compressed domain enables quick and effective browsing. Our examples of applications show that choice of features is crucial to the success of the browsing system. The nature of the
application mostly determines the choice of features, followed by the complexity of extraction and matching. For a general purpose system like ours, it will be best to maintain the flexibility of user choice of features. 6. CONCLUSION We presented a MPEG-4/7 content based retrieval framework. We make use of low-level MPEG-7 descriptors to demonstrate two applications of content-based browsing of remote video content. In future work, we will extend this system to enable much more convenient browsing and querying of remote content. 7. REFERENCES [1] A. Divakaran and H. Sun, A Descriptor for spatial distribution of motion activity, Proc. SPIE Conf. on Storage and Retrieval from Image and Video Databases, San Jose, CA 24-28 Jan. 2000. [2] The MPEG-7 Visual part of the XM 4.0, ISO/IEC MPEG99/W3068, Maui, USA, Dec. 99. [3] A. Divakaran, A. Vetro, K. Asai and H. Nishikawa, Video Browsing System based on Compressed Domain Feature Extraction, submitted to the IEEE Transactions on Consumer Electronics. [4] Miroslaw Bober et al Shape [5] Leszek Cieplinski et al Colour