Adaptive Multimedia Messaging based on MPEG-7 The M 3 -Box

Adaptive Multimedia Messaging based on MPEG-7 The M 3 -Box Abstract Jörg Heuer José Luis Casas André Kaup {Joerg.Heuer, Jose.Casas, Andre.Kaup}@mchp.siemens.de Siemens Corporate Technology, Information and Communications 81730 Munich, Germany This paper presents a case study of an adaptive multimedia message box (M 3 -Box) based on MPEG-7 technology. To enable a bi-directional communication, MPEG-7 related techniques such as the Extensible Stylesheet Language (XSL) are incorporated into the M 3 -Box. MPEG-7 descriptions for the multimedia access, user preferences, and media summarization are instantiated and transmitted along the media content of the messages. These components allow to set up a message center which integrates different kinds of structured messages containing media elements such as video, voice and text. They also provide access to these messages with a variety of different devices. The paper analyses the scenario, gives a specification of the application requirements, and describes an implementation of the M 3 -Box. 1. Introduction Nowadays the number of device classes which are used for communication is growing very fast. These devices are most often adapted to the situations where they should be used in. Accordingly we communicate with each other by using different devices which are often restricted with respect to the bandwidth and their capabilities. But this enormous growth of the number of mobile multimedia enabled device types gives us the possibility to be contacted most of the time during the day not depending what we are doing or where we are. To control this contactability an adaptive messaging is required. For traditional communications like phone, facsimile and e-mail first unified messaging center products exist, providing access to messages while not restricting the type of information delivered by the capabilities of the device which is used. In this paper we present the architecture of the Multi Media Message Box (M 3 -Box), a unified multimedia message center. While adaptive technologies based on MPEG-7 description schemes are engaged here, the main focus of this application is the handling of structured multimedia messages. The structuring information which can be generated by using different communication devices is coded according to the upcoming MPEG-7 standard [1,2]. The Paper is structured in the following way: in the next section related technologies are discussed. In section three the scenario of the unified multimedia messaging is proposed in detail. Then we describe the requirements of a unified multimedia message center. In section five the overall design of the M 3 -Box is explained giving details on the generation, storage and consumption of structured messages with different communication devices and the M 3 -Box. Finally a conclusion is drawn. 2. Related Work The MultiMedia Message Box application focuses on the one side on the generation of structured messages which includes several multimedia elements to enable an intuitive offline communication. On the other side it deals with the problem to overcome the barrier between the communication of different communication devices. When describing these structured messages in the MPEG-7 information, the meta information can be used to adapt the presentation to the output device which is used. Algorithms to control the adaptation of multimedia data to different devices have already been proposed. In [3] an InfoPyramid is specified to determine the optimal adaptation of multimedia presentations with respect to the bandwidth which is needed for transmission and the loss of information. For this purpose possible adaptations and transcodings of the media element are described regarding its size and importance. In [4] a dynamic adaptation while transmitting the media element is proposed. This also includes a switching of the media type if required by the transmission conditions. A consistent representation of the content is achieved by annotating the content of the media elements. Both solutions describe alternative representations of the media content with respect to the loss of information and the reduction in size.

A similar approach is presented as a draft "Annotation of web content for transcoding" to the W3C [5]. This Resource Description Framework (RDF) based approach allows to describe these hints separate from the content itself. Similar to the approach taken by the InfoPyramid, MPEG-7 provides the possibility to describe the media content with respect to possible transcodings which can be applied, and how significant these representations are regarding the contained information [6][7]. Additionally also hints to simplify the transcoding can be specified. For instance in the case of sequences which contain slow motion the search range of the motion prediction in a video transcoder can be limited with no loss of information but significant faster transcoding. For transcoding of media on the server side, the client device, the transmission channel, and the user preferences should be known to adapt the media to these circumstances. An architecture to transmit this information is proposed by the "Composite Capabilities/Preference Profiles: Requirements and Architecture" in [8]. In this approach the device capabilities and the user preferences regarding the media formats are specified and transmitted to the server by two or three tier architectures. While a device description is not specified within MPEG-7 it seems that describing media types which can be rendered on the device is sufficient for most cases. So not the client device is described but the media which should be received. This has furthermore the advantage that with using MPEG-7 the target media type for transcoding can also depend on the semantical content of the media described within MPEG-7. As already mentioned composing and handling of structured messages different requirements than dealing with sequential media files or multimedia presentations. Therefore the requirements on the components of the communication chain are specified after presenting the scenario of the multimedia messaging in the next section in more detail. 3. The Scenario of a multimedia message center, the M 3 -Box The application scenario will be illustrated by following the flow of information starting at the message sender. It is assumed that the message sender can use several devices with different capabilities as depicted in Fig. 1: a mobile or wireline phone, a pictopda (a PDA enabled to visualize and to take pictures), a videophone or even a personal computer. These devices are used to record a message. Contrary to the generation of a plain message in this case the message can contain several media elements and also structuring information. For this purpose the process of indexing is adapted to the user interface of used devices. A pictopda for instance allows to capture several images and specify written indeces while a videophone could allow spoken comments. INPUT DEVICES Multimedia Box OUTPUT DEVICES Phone Mobilphone PDA, Palm... Videophone, pictophone... Multimedia data Video Image Text + Structuring hints Phone Videophone, pictophone... Workstation, PC, HPC Workstation, PC, HPC PDA, Palm... Fig. 1: Scenario of the Multimedia Message Box After recording the message, it is transmitted to the multimedia message center, the M 3 -Box. In the receiving part, the box than stores the message, e.g. the media data and the message description. For the storage of the media it is analyzed if a transcoding of important media elements into another representation has to be done offline due to the computational complexity. This can be for example the case for speech recognition tools which convert the audio message to a textual representation. Beside the conversion into another

representation also the coded media might be transcoded into another format so that random access is enabled. This way a multiple storage of the same content can be avoided. This is the case for instance for a video with annotated key frames. For a fast random access these key frames are transcoded into intra coded frames if temporal prediction is used within the video. If the message is stored, the server part of the M 3 -Box, waiting for incoming calls, has access to it. When a user now checks his account of the media box he establishes a connection from his output device to the M 3 - Box. The used output device can be one of the set already used for the input of some messages as depicted in Fig.1. But it has not to be necessarily of the same type. The advantage is that all messages are accessible not depending what device is used for generating or fetching these messages. This is achieved by transmitting a description of media which can be displayed on the output device and the user preferences. According to these descriptions a content presentation of the checked account is generated and transmitted by the server part of the M 3 -Box. The recipient can then use the output device to browse through the presentation. 4. Requirements on unified multimedia messaging To enable communication between consumer devices of different brands the used coding format of the messages has to be a common or standardized one. Thus for multimedia messages standards like MPEG-4 for multimedia data or XML for textual information should be used. Beside this multimedia data also metadata has to be sent to describe the structure and the semantic of the contained media elements. With the upcoming MPEG-7 standard it is now possible to transmit and store metadata providing structuring information in a standardized fashion. So the overall requirement is to transmit and store metadata together with the multimedia data using standards as far as possible. 4.1. Requirements for the input devices While the indexing process for a broadcast may take hours, a customer formulating a message will not use indexing possibilities if the process takes too long. Therefore in a realistic scenario the main constraint on the usage of the input device is the limitation of time for the indexing of the message to a similar amount which was used for generating the message itself. Additionally to the restriction in time also the indexing tools have to be adapted to the capabilities of the input device for an easy and intuitive usage. 4.2. Requirements for the multimedia mail box For the access of the media data additional index information has to be generated and added to the existing meta data to enable a fast access. If a realtime transcoding on demand of media data in other formats is not possible the necessary process should be started on message retrieval and the resulting data should be stored and indexed as described. In the server section a handshake between the mobile device and the M 3 -Box is required to allow the exchange of access legitimization, user preferences and the description of the accessing devices at the beginning of a session. This initialization should configure the information transmitted from the box to the client device enabling the client to browse through the content of the his multimedia mail box. 4.3. Requirements for the output devices The characteristics for the rendering or display of messages are required to be described for the initial handshake. Furthermore the user preferences should be adapted to these abilities of the used communication device. To transmit color key frames to a gray value display for example wastes bandwidth and is time consuming. Also the user interaction with the M 3 -Box should be adapted to the user interface of the client device. For example this may require to map links to buttons in the message viewer. 5. System Design and Realization MPEG-7 specifies only the format of encoding metadata for exchange. Currently the querying using query languages such as XQL is out of scope of MPEG-7. Also the interaction with a MPEG-7 server is not communication is required. The MPEG-7 information can already be used to initiate certain transformation such as transcoding of videos for fast access of indexed be used to manipulate standardized yet. Nevertheless the MPEG-7 description can already the presentation of described media on the server side and on client devices. On the input side of the M 3 -Box no interactive frames. In contrast on the output side the presentation is interactive with respect to the presentation of the M 3 -Box account. To avoid a proprietary solution for the communication from the client device to the server Extendable Stylesheet Language Transformations

(XSLT), one part of XSL [12], can be used. The MPEG-7 description is based on XML Schema definitions and the name space of elements is known by the server and the client. By this way XSL documents applicable to MPEG-7 descriptions on the server side can be defined by the user or the client device. The messages itself are most likely presented in a description centric way (see also section 5.3): the message is not sequentially played back but the structure described within the metadata is used to display the message structure with representatives specified within MPEG-7. Thus the XSL document can be edited without the knowledge of programs (e.g. CGIs) or URIs which have to be called to generate a representative of a message section according to the requirements of the client device. Thus the specification of user preferences and the device description are in general applicable to MPEG-7 descriptions and independent of the message center which is used. The mentioned mechanisms are specified in more detail in the M 3 -Box output section 5.3. As described in the previous section the system can be divided into three main parts: the input device with a user interface for adding indexing information, the message center, composed of a core and the server, and the output device. The functionalities of these elements are depicted in the Fig 2. The input and the accessing devices are also simulated on a PC platform. For simplicity reasons the communication is based on the TCP/IP protocol. But this is not a limitation to the communication at a higher level. The four parts of the M 3 - Box will be described in detail in the following sections. Input / Indexer Functions: Simulating the input device interface Storing of media data Post-editing Generating the XML based message description Client Device Functions: Editing and transmitting the user s preferences and device description to the M 3 BOX Rendering of Message presentation Validating the user interaction Socket connection MM Mail Box Core Functions: Transconding: e.g speech recognition Storing of media elements and description Indexing of media elements Message Center Information interchange Functions: Output / Server Socket connection Validating the login information Generating the Message presentation based on user preferences and device description Transcoding the Media elements Fig. 2: Block diagram of the M 3 -Box scenario 5.1. Input Part For the presented prototype the input devices are simulated on a PC platform. For this purpose we have adopted user interfaces of mobile devices (see Fig. 3a for a mobile video phone) as input interface of the standalone input program simulating the device behavior. The simulated input device is capable of recording several media elements building a message. For instance in the case of the video phone audio, video and short text messages, restricted by the small keyboard, can be captured. In plain multimedia messages these media elements are transmitted in the parallel or sequential order they were recorded. For an intuitive use of multimedia elements within a message the semantic of the media elements is also important and often not obvious for the recipient. In Fig.3b an example of a structured message as it is recorded is shown. It is composed of different media elements as already mentioned above. Some of these media elements are used for the introduction and to welcome the recipient. Then a message subject is given by a video element. This

video element is divided into segments in which representative frames are used to comment or explain what can be seen or what happened. For this purpose audio or text comments can be attached to the media elements. Furthermore references to related shots can be specified for instance to discuss different occurrences of the same object. of the whole message External link to Webpage of Heinz Intro Intro of the Message Fig. 3: a) An example of a user interface of a mobile videophone (upper left corner) b) An example of a recorded message with media elements, key scenes, frames and audio comments This example shows, that beside the information contained in the media objects itself also the semantic of the media element should be captured in the way it was intended by the sender. This is realized by instantiating the following set of top level description schemes of the Multimedia Description Schemes (MDS) [7] in a MPEG-7 conform description: Segment DS for video and audio StructuredAnnotation DS Summarization DS The Segment DS and the contained segment decomposition divides the media elements such as those for the introduction and the subject of the message into segments. This part also incorporates the annotation of these segments. The segments itself can have several representations specified within the summary sections. Beside the description of the media content and semantic also the format and the size of the media elements are described using the MediaInformation DS that is contained in the root segments of the media descriptions. Alternative representations of the content are abstracts and thus are described in the summary DS. The benefit also with respect to a protocol based solution, which can also contain this information, is presented in the next section. The standardization of MPEG-7 has not been finished yet. Currently in the implementation also DSs are used which may change in the final standard. On the other hand also some DSs have been derived from already existing ones to enable descriptions of, for instance, annotation relations between message elements. Also the wrapper around the MPEG-7 description is currently proprietary since this part has not been standardized yet. Further on it has to be emphasized that even though the description definition Language of MPEG-7 is based on XML Schema the instantiation of a standard conform MPEG-7 description is a XML document. Assuming that this XML document is valid it has only to be well formed for processing it in a parser. So for a

MPEG-7 conform communication a MPEG-7 parser evaluating the validity of the document is not required. Due to the fact that a MPEG-7 parser is still under development for the communication components of this application a standard XML parser (XERCES [13]) is used with no restrictions in the functionality. 5.2. M 3 -Box Core The Multimedia Message Box itself is divided into two parts: the core and the server. The core is watching a port for connections of a message sender to the M 3 -Box and is processing the retrieved message. The server is processing the connections of output devices requesting for messages in a box account. The core has four functions: receiving the message, adding supplemental metadata information, transcoding of the message, storing the message for the server access. When receiving the message the time point and the senders IP and domain name (if available) is stored within the transmitted MPEG-7 description. During the following transcoding also the description of alternative representations such as the format description are added to the MPEG-7 description of the message. This also includes format and size of media elements. When the core has received the message, it also starts to transcode the message in a format that allows a fast random access to media chunks addressed by the description data. For instance the encoding of indexed frames as I-Frame within H263 or MPEG videos speeds up the frame extraction. Another more time consuming task is the offline transcoding of spoken content of audio files using a speech recognition engine. This task is enabled by specifying the speaker profile and a server which stores this profile (this is currently not contained as MPEG-7 information). To further enable the structured description of these transcoded media elements the segment specification of the spoken content are transcoded into text segments. Even though the recognition rate varies, the transformed content still preserves information. This is especially valuable when retrieving this component of the message e.g. with a palm not able to play audio. In the opposite direction also the text to speech transcoding is done at this stage. This last function is implemented at the moment in the core, but could also be implemented using the Common Gateway Interface (CGI) for the server to generate the audio message on request to reduce the storage footprint of the messages. 5.3. The M 3 -Box server and output devices After the messages have passed through the M 3 -Box core, they are stored in the data structure of the M 3 -Box server. The M 3 -Box server has three functions: establishing the connection with an output device, generating presentation of a M 3 -Box account online according to user preferences and the client device description, enabling the user to browse through the messages. This part of the multimedia message box is based on a web server e.g. the Apache server [9]. When an output device establishes the connection to the server, an initial handshake takes place. First the authentication information e.g. based on GNU Privacy Guard (GPG) [10] is sent. This information is user but not device dependent. Then the client transmits the device description and the user preferences. These specifications are encoded in XSL files containing XSLT instructions to process the MPEG-7 XML documents. Even though XSLT is a language not standardizing on what kind of information it is operating, this is well defined in the presented case since the tags it is operating on are standardized by MPEG-7. Based on the MPEG-7 encoded structure of the messages of one account and the XSL file a HTML presentation embedding multimedia elements is generated. For this purpose a XML/XSL parser e.g. the one of the Microsoft Internet Explorer 5.0 [11] is used. At that stage media elements can be included or excluded into the presentation depending on the format, the size and the semantic meaning. Also a processing of the message description after the server has received the XSL file is possible: for instance to specify all possible video transcodings in the message description would increase the size tremendously. Thus these specifications can be inserted after receiving and analyzing the XSL file from the client device. Therefore the server can change the MPEG-7 description to refer to CGI programs transcoding the media elements with respect to the specification of possible formats which can be rendered on the client device. This can also be initiated by a user interaction by sending a new XSL file. But usually the communication of user interactions is realized by using linking mechanisms within the parsed HTML file of the MPEG-7 descriptions.

To demonstrate the flexibility of this solution, the representation of a M 3 -Box based on a specification of the XSL device and user preferences are presented in the following. The XSL file is configured to show the content of the account in three levels (see Fig.4): initially after the connection of the client device is established the TOC of the account is displayed using the message subject and a key media element. To give an idea about the specification of information listed in the message box presentation an example for the XSL specification to determine the media type from the MPEG-7 description is given:... <xsl:variable name="mediatype" select="/list/videosegment[@id=($n - 1)]/MediaInformation/ MediaProfile/MediaFormat/FileFormat[position()=2]" />... <xsl:if test="$mediatype= "> <IMG SRC="../images/speaker5.gif" /> </xsl:if>... The first line assigns values in the media information context of the MPEG-7 description to a variable which is checked to display the media type icons already stored on the client device for the TOC presentation. Similar XSL elements can be specified to choose the representations for media elements and even more complex description evaluations. a) b) Fig. 4: Account presentation of the M 3 -Box according to user preferences and device description of a color PDA in a XSL file. In the user preferences three levels are specified: a TOC of the account (a), a TOC of a message if selected by user inter action (b) and the video playback of message segments (not shown). After selecting one message the content of the message is presented in a summarized fashion so that the user can select a specific part he is interested in to retrieve it in full detail. It is also possible based on the preferences, the client device or the bandwidth of the connection to have other elements of the message rendered in the representations or even to have a different structuring than the three presented levels. Currently the output devices used such as PDAs and notebooks support the rendering and interaction with HTML pages. Even for devices which do not provide a visual interface it has been shown in [14] that HTML based content can be presented comprehensively. Beside this an even better adapted presentation can be generated easily as far as a markup based language for the rendering of information exists. In that case the XSL file can be adapted to generate a presentation based on this markup language and the MPEG-7 description of the message.

6. Conclusions In this paper the basic concept for a Multimedia Message Box using structured messages based on the MPEG-7 standard was presented. Even though the standardization process has not been finished yet the present definition of MPEG-7 description schemes can be used to describe the content of the structured message. This can be used to present the message on different output devices by maintaining most of the contained information. Contrary to alternative solutions the description of the message content and its media formats with MPEG-7 meta data allows to transcode media elements of the message according to its semantic importance for the message. To evaluate this meta information on the server side, it has been shown that XSLT can be used to have a standard conform communication from the output device to the server. Subsequently, the realization of this application was investigated for a set of different custom communication devices, the core and the output server of the M 3 -Box. At this stage, the presented application can serve as an integration platform for technology developed in the domain of multimedia analysis, summarization, editing and structuring, and automatic universal multimedia access adaptation technologies. It turned out that the adaptation of the user interface of the custom devices to the annotation and interaction possibilities including the aforementioned technologies will be important to prove the usability of such a system. References [1] MPEG-7 Applications Document v.9, ISO/IEC JTC1/SC29/WG11/N2860, http://www.cselt.stet.it/mpeg/, July 1999. [2] MPEG-7: Context, Objectives and Technical Roadmap, V.12, ISO/IEC JTC1/SC29/WG11/N2861, http://www.cselt.stet.it/mpeg/, July 1999. [3] Smith J. R., Mohan R., Li C., Scalable Multimedia Delivery for Pervasive Computing, In Proc. ACM Intern. Conf. Multimedia 99, pp 130-139, Orlando, November 1999 [4] Boll S., Klas W., Wandel J., A Cross-Media Adaptation Startegy for Multimedia Presentations, In Proc. ACM Intern. Conf. Multimedia 99, pp 37-46, Orlando, November 1999 [5] Annotation of Web Content for Transcoding. W3C Note, http://www.w3.org/tr/annot/, July 1999. [6] MPEG-7 Multimedia Description Schemes XM v.3.0, ISO/IEC JTC1/SC29/WG11/N3410, http://www.cselt.stet.it/mpeg/, May 2000. [7] MPEG-7 Multimedia Description Schemes WD v.3.0, ISO/IEC JTC1/SC29/WG11/N3411, http://www.cselt.stet.it/mpeg/, May 2000. [8] Composite Capabilities/Preference Profiles: Requirements and Architecture, http://www.w3.org/tr/ccpp-ra/, July 2000. [9] The Software Apache Foundation, www.apache.org. [10] The GNU Privacy Guard, http://www.gnupg.org/. [11] Microsoft XML Parser Preview release, msdn.microsoft.com/downloads/webtechnology/xml/msxml.asp, May 2000. [12] Extensible Style Sheet Language an Overview, http://www.w3.org/style/xsl/overview.html. [13] The Apache XML Project, http://xml.apache.org/ [14] Goose S., Wynblatt M., Mollenhauer H., 1-800-Hypertext: Browsing Hypertext With A Telephone, Proc. ACM International Conference on Hypertext, pp. 287-288, June 1998