MPEG-4 Authoring Tool for the Composition of 3D Audiovisual Scenes

1 MPEG-4 Authoring Tool for the Composition of 3D Audiovisual Scenes Petros Daras, Ioannis Kompatsiaris, Member, IEEE, Theodoros Raptis and Michael G. Strintzis* Fellow, IEEE This work was supported by the PENED99 project of the Greek Secretariat of Research and Technology and by the EC IST project INTERFACE. The authors are with the Informatics and Telematics Institute, 1st Km Thermi- Panorama Road, 57001 Thermi - Thessaloniki, Greece and with the Information Processing Laboratory, Electrical and Computer Engineering Department, Aristotle University of Thessaloniki, 540 06 Thessaloniki, Greece, Email: strintzi@eng.auth.gr November 18, 2003 DRAFT

2 Abstract We describe an authoring tool with 3D functionalities for the MPEG-4 multimedia standard. MPEG- 4 offers numerous novel capabilities other than more efficient compression, such as coding of audiovisual (natural and synthetic) objects rather than frames, integration of 2D and 3D content, human face and body specific features and separate transmission of each elementary stream corresponding to each audiovisual object. However, the implementation of these capabilities requires a complex authoring process, employing many different functionalities from encoding of Audio/visual/BIFS to implementation of different delivery scenarios: local access on CD/DVD-ROM, Internet or broadcast. As multimedia system history teaches, however powerful the technologies underlying multimedia computing, the success of these systems depends on their ease of authoring. Existing MPEG-4 authoring tools allow the creation of 2D MPEG-4 scenes. For this reason, we have developed a novel authoring tool fully exploiting the 3D functionalities of the MPEG-4 standard. The scenes presented in the paper integrate unique MPEG-4 features such as Updates and Facial Animation demonstrating the production of totally MPEG-4 compliant scenes which is almost impossible for the non-expert to build from scratch using only text. The presented authoring tool is based upon an open and modular architecture able to progress with MPEG-4 versions and it is easily adaptable to newly emerging better and higher-level authoring features. The authoring tool is available for download from our web site: http://uranus.ee.auth.gr/pened99/demos/ Authoring Tool/authoring tool.html EDICS 8-STDS STANDARDS AND RELATED ISSUES I. Introduction MPEG-4 is the next generation representation standard following MPEG-1 and MPEG- 2. Whereas the former two MPEG standards dealt with coding of general audio and video streams, MPEG-4 specifies a standard mechanism for coding of audio-visual objects. MPEG-4 builds on the proven success of three fields [1], [2], [3]: Digital television. Interactive graphics applications (synthetic content). Interactive multimedia (World Wide Web, distribution of and access to content). MPEG-4 audiovisual scenes are composed of several media objects, organized in a hierarchical fashion. At the leaves of the hierarchy, we find primitive media objects, such as: still images (e.g. as a fixed background), video objects (e.g. a talking person - without the DRAFT November 18, 2003

3 background), audio objects (e.g. the voice associated with this person), etc. Apart from natural objects, MPEG-4 also allows coding of two-dimensional and three-dimensional, synthetic and hybrid, audio and visual objects. Coding of objects enables content-based interactivity and scalability (Figure 1) [4]. MPEG-4 Systems facilitates organization of the audio-visual objects that are decoded from elementary streams into a presentation [5]. The coded stream describing the spatialtemporal relationships between the coded audio-visual objects is called the Scene Description, or BIFS (Binary format for scenes) stream. Scene description in MPEG-4 is an extension of that in VRML (Virtual Reality Modeling Language) [6], so as to include coding and streaming, timing, and integration of 2D and 3D objects. Furthermore, the Extensible MPEG-4 Textual format (XMT) [7] has been designed to provide an exchangeable format between content authors whilst preserving the authors intentions in a high-level textual format. In addition to providing a suitable, author-friendly abstraction of the underlying MPEG-4 technologies, another important consideration for the XMT design was to respect existing practices of content authors such as the Web3D X3D and HTML. Other 3D scene description and authoring frameworks, such as the Extensible 3D (X3D) Graphics specification [8], are still under active development. Thus, the objective of MPEG-4 is to provide an audiovisual representation standard supporting new ways of communication, access, and interaction with digital audiovisual data, and offering a common technical solution to various service paradigms -telecommunications, broadcast, and interactive- whose separating borders are rapidly disappearing. MPEG-4 will supply an answer to the emerging needs of application fields such as video on the Internet, multimedia broadcasting, content-based audiovisual database access, games, audiovisual home editing, advanced audiovisual communications, notably over mobile networks, tele-shopping, and remote monitoring and control. MPEG-4 authoring is undoubtedly a challenge. Far from the past simplicity of MPEG-2 one-video-plus-2-audio-streams, MPEG-4 allows the content creator to compose November 18, 2003 DRAFT

compositor demultiplexer Sync & multiplexors 4 AVobjects coded AVobjects coded AVobjects coded Audio Comp. Info BIFS enc. Dec. Enc. BIFS dec. Audio Stream... Enc. Dec.... Video Streams Enc. Dec. Complex Visual Content Fig. 1. Overview of MPEG-4 Systems. together spatially and temporally large numbers of objects of many different types: rectangular video, arbitrarily shaped video, still image, speech synthesis, voice, music, text, 2D graphics, 3D, and more. In [9] the most well-known MPEG-4 authoring tool (MPEG- Pro) was presented. This includes a user interface, BIFS update and a timeline but it can only handle 2D scenes. In [10] an MPEG-4 compliant authoring tool was presented, which allows the content creator to compose 2D scenes only. In other articles [11], [12], [13], [14], MPEG-4 related algorithms are presented for the segmentation and generation of Video Objects which, however, do not provide a complete MPEG-4 authoring suite. Commercial multimedia authoring tools such as IBM Hotmedia and Veon [15], [16] are based on their proprietary formats rather than widely acceptable standards. In this paper we present a 3D MPEG-4 authoring tool capable of creating MPEG-4 contents with 3D functionalities, from the end-user interface specification phase to the crossplatform MP4 file. Existing MPEG-4 authoring tools allow the creation of 2D MPEG-4 scenes. The presented authoring tool integrates unique MPEG-4 3D functionalities and features such as Updates and Facial Animation, allowing the production of totally MPEG- 4 compliant scenes which is almost impossible for the non-expert to build from scratch using only text. More specifically, the user can insert basic 3D objects such as box, sphere, DRAFT November 18, 2003

5 cone, cylinder, and text and modify their attributes. Generic 3D models can be created or inserted and modified using the IndexedFaceSet node. Furthermore, the behavior of the objects can be controlled by various sensors (time, touch, cylinder, sphere, plane) and interpolators (color, position, orientation). Static images and video can be texture mapped on the 3D objects. The user can modify the temporal behavior of the scene by adding, deleting and/or replacing nodes over time using the Update commands. Synthetic faces can also be added using the Face node and their associated Facial Animation Parameters (FAP) files. Although several FAP extraction [17], [18], [19] and 3D motion estimation algorithms [20] have been presented, there is no authoring suite for integrating those synthetic faces into a complete scene. It is shown that our choice of an open and modular architecture of the MPEG-4 Authoring System endows it with the ability to easily integrate new modules. MPEG-4 provides a large and rich set of tools for the coding of audio-visual objects [21]. In order to allow effective implementations of the standard, subsets of the MPEG-4 Systems, Visual, and Audio tool sets have been identified, that can be used for specific applications. These subsets, called Profiles, limit the tool set a decoder has to implement. For each of these Profiles, one or more Levels have been set, restricting the computational complexity. Profiles exist for various types of media content (audio, visual and graphics) and for scene descriptions. The presented authoring tool is compliant with the following types of profiles: The Simple Facial Animation Visual Profile, The Scalable Texture Visual Profile, The Hybrid Visual Profile, The Natural Audio Profile, The Complete Graphics Profile, The Complete Scene Graph Profile, and the The Object Descriptor Profile including the Object Descriptor (OD) tool. The paper is organized as follows. In section II MPEG-4 BIFS are presented. In section III classes of nodes in a MPEG-4 scene are defined. In Section IV an overview of the authoring tool architecture and the user interface is given. In Section V the building procedure of a MPEG-4 3D scene using the authoring tool is described. Some important November 18, 2003 DRAFT

6 implementation specific issues, especially the MPEG-4 reference software and the method by which OpenGL is used in order to enable a 3D preview of the scene are examined in Section VI. In Section VII, experiments demonstrate 3D scenes composed with the authoring tool. Finally, conclusions are drawn in Section VIII. II. BINARY FORMAT FOR SCENES (BIFS) The BIFS description language [22] has been designed as an extension to the VRML 2.0 specification [6]. VRML is designed to be used on the Internet, intranets, and local client systems. VRML may be used in a variety of application areas such as engineering and scientific visualization, multimedia presentations, entertainment and educational titles, web pages, and shared virtual worlds. The version 2 BIFS (Advanced BIFS, included in MPEG-4 version 2) will be a superset of VRML and can be used as an effective tool for compressing VRML scenes. In Version 2 of MPEG-4 Systems, all VRML nodes are supported. BIFS extended the base VRML specification in various aspects: i. New media capabilities in the scene: 2D nodes containing 2D graphics and 2D scene graph description; mixing of 2D and 3D graphics; new audio nodes supporting advanced audio features: Mixing of sources, Streaming audio interface and Creation of synthetic audio content. face and body specific nodes to link to specific Face and Body animation streams; specific nodes linked to the streaming client/server environment, such as media time sensors and back channel messages. ii. A binary encoding of the scene, so that an efficient transmission of the scene can be performed. DRAFT November 18, 2003

7 iii. Specific protocols to stream scene and animation data: The BIFS-Command protocol in order to send synchronized modifications of the scene with a stream; The BIFS-Anim protocol in order to stream continuous animation of the scene. BIFS is a compact binary format representing a pre-defined set of scene objects and behaviors along with their spatio-temporal relationships. In particular, BIFS contains the following four types of information: The attributes of media objects, which define their audio-visual properties. The structure of the scene graph which contains these objects. The pre-defined spatio-temporal changes of these objects, independent of user input. The spatio-temporal changes triggered by user interaction. Audiovisual objects have both spatial and temporal extent. Temporally, all objects have a single dimension, time. Objects may be located in 2-dimensional or 3-dimensional spaces. Each object has a local coordinate system, in which the object has a fixed spatio-temporal location and scale (size and orientation). Objects are positioned in the scene by specifying a coordinate transformation from the object local coordinate system into another coordinate system defined by a parent node. The coordinate transformation locating an object in a scene is an attribute of the scene, rather than of the object. Therefore, the scene description has to be sent as a separate Elementary Stream. Elementary streams are a key notion in MPEG-4. A complete MPEG-4 presentation transports each media/object in a different elementary stream. Elementary streams are composed of access units (e.g. a video object frame), packetized into Sync Layer (SL) packets. Some objects may be transported in several elementary streams, for instance if scalability is involved. This is an important feature for bitstream editing, one of the content-based functionalities in MPEG-4. The scene description follows a hierarchical structure that can be represented as a tree (Figures 2, 3). Each node of the tree is an audiovisual object. Complex objects are constructed by using appropriate scene description nodes. The tree structure is not nec- November 18, 2003 DRAFT

8 Fig. 2. Example MPEG-4 scene. Scene Newscaster 2D Background Natural Audio/Video Channel logo Voice Segmented Video Desk 2D Text Logo 3D Text Fig. 3. Corresponding scene tree. essarily static. The relationships can evolve over time and nodes may be deleted, added or be modified. Individual scene description nodes expose a set of parameters through which several aspects of their behavior can be controlled. Examples include the pitch of a sound, the color of a synthetic visual object, or the speed at which a video sequence is to be played. There is a clear distinction between the audiovisual object itself, the attributes that enable the control of its position and behavior, and any elementary streams that DRAFT November 18, 2003

9 contain coded information representing some attributes of the object. The scene description does not directly refer to elementary streams when specifying a media object, but uses the concept of object descriptors. The purpose of the object descriptors framework is to identify and properly associate elementary streams to media objects used in the scene description. These media objects often necessitate an elementary stream data point to an object descriptor by means of a numeric identifier, an Object- DescriptorID. An ObjectDescriptor (OD) is a structure containing pointers to elementary streams. Typically, however, these pointers are not to remote hosts, but to elementary streams that are being received by the client. ODs also contain additional information such as Quality of Service parameters. Each object descriptor is itself a collection of descriptors that describe the elementary streams comprising a single media object. An ES Descriptor identifies a single stream with a numeric identifier, ES ID. In the simplest case, an OD contains just one ES descriptor that identifies, for example, the audio stream that belongs to the AudioSource node by which this OD is referenced [23]. The same object descriptor may as well be referenced from two distinct scene description nodes. On the other hand, within a single OD it is also possible to have two or more ES descriptors, for example, one identifying a low bit-rate audio stream and another identifying a higher bit-rate stream with the same content. In that case the terminal (or rather the user) has a choice between two audio qualities. Specifically for audio, it is also possible to have multiple audio streams with different languages that can be selected according to user preferences. In general, all kinds of different resolution or different bit-rate streams representing the same audio or visual content may be advertised in a single-object descriptor in order to offer a choice of quality. By contrast, streams that represent different audio or visual content must be referenced through distinct object descriptors. As an example, an AudioSource and a MovieTexture node that (obviously) refer to different elementary streams have to utilize two distinct ODs (Figure 4). November 18, 2003 DRAFT

10 Scene Description Movie Texture OD_ ID2 Audio Source OD_ ID1 Object Descriptor ES Descriptor ES_ID a Object Descriptor ES Descriptor ES_ID b Audio Stream Visual Stream Fig. 4. Different scene description node types need different object descriptors. III. BIFS scene description features The proposed MPEG-4 authoring tool implements the BIFS nodes graph structure allowing authors to take full advantage of MPEG-4 node functionalities in a friendly user interface. A. Scene structure Every MPEG-4 scene is constructed as a directed acyclic graph of nodes. The following types of nodes may be defined: Grouping nodes construct the scene structure. Children nodes are offsprings of grouping nodes representing the multimedia objects in the scene. DRAFT November 18, 2003

11 Bindable children nodes are the specific type of children nodes for which only one instance of the node type can be active at a time in the scene (a typical example of this is the Viewpoint for a 3D scene; a 3D scene may contain multiple viewpoints or cameras, but only one can be active at a time). Interpolator nodes constitute another subtype of children nodes which represent interpolation data to perform key frame animation. These nodes generate a sequence of values as a function of time or other input parameters. Sensor nodes sense the user and environment changes for authoring interactive scenes. B. Nodes and fields BIFS and VRML scenes are both composed of a collection of nodes arranged in a hierarchical tree. Each node represents, groups or transforms an object in the scene and consists of a list of fields that define the particular behavior of the node. For example, a Sphere node has a radius field that specifies the size of the sphere. MPEG-4 has roughly 100 nodes with 20 basic field types representing the basic field data types: boolean, integer, floating point, two- and three-dimensional vectors, time, normal vectors, rotations, colors, URLs, strings, images, and other more arcane data types such as scripts. Figure 16 shows the list of the most common MPEG-4 nodes. The nodes which are supported by the current MPEG-4 Authoring Tool version are indicated with bold script. Full functionality of these nodes is provided to the author. C. ROUTEs and dynamical behavior The event model of BIFS uses the VRML concept of ROUTEs to propagate events between scene elements. ROUTEs are connections that assign the value of one field to another field. As is the case with nodes, ROUTEs can be assigned a name in order to be able to identify specific ROUTEs for modification or deletion. ROUTEs combined with interpolators can cause animation in a scene. For example, the value of an interpolator is November 18, 2003 DRAFT

12 Fig. 5. The interpolators panel. ROUTEd to the rotation field in a Transform node, causing the nodes in the Transform node s children field to be rotated as the values in the corresponding field in the interpolator node change with time. This event model has been implemented as shown in Figure 5, allowing users to add interactivity and animation to the scene (Figure 5). D. Streaming scene description updates: BIFS-Command MPEG-4 is designed to be used in broadcast applications as well as in interactive and one-to-one communication applications. To fit this requirement, an important concept developed within MPEG-4 BIFS is that the application itself can be seen as a temporal stream. This means that the presentation, or the scene itself, has a temporal dimension. DRAFT November 18, 2003

13 On the web, the model used for multimedia presentations is that a scene description (for instance an HTML page or a VRML scene) is downloaded once, and then played locally. In the MPEG-4 model, a BIFS presentation, which describes the scene itself, is delivered over time. The basic model is that an initial scene is loaded and may then receive further updates. In fact, the initial scene loading itself is considered an update. The concept of a scene in MPEG-4, therefore, encapsulates the elementary stream(s) that convey it over time. The mechanism with which BIFS information is provided to the receiver over time comprises the BIFS-Command protocol (also known as BIFS-Update), and the elementary stream that carries it is thus called BIFS-Command stream. BIFS-Command conveys commands for the replacement of a scene, addition or deletion of nodes, modification of fields, etc. For example, a ReplaceScene command becomes the entry (or random access) point for a BIFS stream, in exactly the same way as an Intra frame serves as a random access point for video. BIFS commands come in four main functionalities: scene replacement, node/field/route insertion, node/value/route deletion, and node/field/value/ route replacement. The BIFS-Command protocol has been implemented so as to allow the user to temporally modify the scene using the authoring tool user interface. E. Facial Animation The Facial and Body Animation nodes can be used to render an animated face. The shape, texture and expressions of the face are controlled by the Facial Definition Parameters (FDPs) and the Facial Animation Parameters (FAPs). Upon construction, the face object contains a generic face with a neutral expression. This face can be rendered. It can also immediately receive the animation parameters from the bitstream, which will produce animation of the face: expressions, speech etc. Meanwhile, definition parameters can be sent to change the appearance of the face from something generic to a particular face with its own shape and (optionally) texture. If so desired, a complete face model can be November 18, 2003 DRAFT

14 Format File Save Custom Format Open Internal Structure User Interaction 3D Renderer (OpenGl) GUI Play MPEG 4 Encoder Save (.mp4) MPEG 4 Browser Fig. 6. System Architecture. downloaded via the FDP set. The described application implements the Face node, using the generic MPEG-4 3D face model, allowing the user to insert a synthetic 3D animated face. IV. MPEG-4 Authoring Tool A. System Architecture The process of creating MPEG-4 content can be characterized as a development cycle with four stages: Open, Format, Play and Save (Figure 6). In this somewhat simplified model, the content creators can: i. Open an existing file. ii. Format saved scenes or create their own scenes: Insert 3D objects, such as spheres, cones, cylinders, text, boxes and background by clicking the appropriate icon (Figure 8). Modify the attributes such as 3D position, size, color, etc, (Figure 9) of the edited DRAFT November 18, 2003

15 texture control Cylinder Sphere Cone Update Commands Delete Object Details Group Objects Face Box Text Background IndexedFaceSet Fig. 7. Authoring tool application toolbar. objects or delete objects from the content created. Add realism to the scene by associating image and video textures to the inserted objects. Duplicate already inserted objects by using the copy-and-paste functionality. Group objects in order to simultaneously change their attributes (e.g. move a group of objects) or duplicate a group of objects by using the copy-and-paste operation. Insert sound and video streams. Add interactivity to the scene, using sensors and interpolators enabling for example motion of objects, periodic change of color, etc. Sensors allow interactivity between objects, for example when an object is clicked a new one is inserted. Control dynamically the scene using an implementation of the BIFS-Command protocol. For example, the author can define that a specific part (group) of the scene appears 10 second after the initial loading of the scene. Generic 3D models can be created or inserted and modified using the IndexedFace- Set node. November 18, 2003 DRAFT

16 Generic 3D models can be created or inserted and modified using the IndexedFace- Set node. Details of how all these procedures can be accomplished are given in the following Subections describing the User Interface of the authoring tool and in the Example of Use Section. During the creation process, the attributes of the objects and the commands as Defined in the MPEG-4 standard and more specifically in BIFS, are stored in an internal program structure, which is continuously updated depending on the actions of the user. At the same time, the creator can see in real-time a 3D preview of the scene, on an integrated window using OpenGL tools (Figure 8). iii. Play the created content by interpreting the commands issued by the editing phase and allowing the author to check the final presentation of the current description. iv. Save the file either in custom format or after encoding/multiplexing and packaging in a MP4 file [21], which is the standard MPEG-4 file format. The MP4 file format is designed to contain the media information of an MPEG-4 presentation in a flexible, extensible format which facilitates interchange, management, editing and presentation of the media. B. User Interface To improve the authoring process, powerful tools must be provided to the author [24]. The temporal dependence and variability of multimedia applications, hinders the author from obtaining a real perception of what he is editing. OpenGL was used to create an environment with multiple, synchronized views in order to overcome this difficulty. The interface is composed of three main views, as shown in Figure 8. Edit/Preview: By integrating the presentation and editing phases in the same view we enable the author to see a partial result of the created object on an OpenGL window. If any given object is inserted in the scene, it can be immediately seen on the presentation window (OpenGL window) located exactly in the given 3D position. But if a particular behavior is assigned to an object, for example a video texture, the full video can be seen DRAFT November 18, 2003

17 Fig. 8. Main Window, indicating the different components of the user interface. during the scene play only, in the preview window only the first frame is shown. If an object already has a video texture (image texture) and the user tries to map an image texture (video texture) on it, a message appears and give a warning to the user. The integration of the two views is very useful for the initial scene composition. Scene Tree: This pane provides a structural view of the scene as a tree (a BIFS scene is a graph, but for ease of presentation, the graph is reduced to a tree for display). Since the edit view cannot be used to display the behavior of the objects, the scene tree is used to provide more detailed information concerning them. The drag-and-drop and copy-paste operations can also be used in this view. Object Details: This window, shown in Figure 9, offers object properties that the author can use to assign values other than those given by default to the objects. These properties are: 3D position, 3D rotation, 3D scale, color (diffuse, specular, emission), shine, texture, video stream, audio stream (the audio and video streams are transmitted as two separated elementary streams according to the object descriptor mechanism), cylinder November 18, 2003 DRAFT

18 Fig. 9. Object details Window, indicating the properties of the objects. and cone radius and height, textstyle (plain, bold, italic, bolditalic) and fonts (serif, sans, typewriter), sky and ground background, texture for background, interpolators (color, position, orientation) and sensors (sphere, cylinder, plane, touch, time) for adding interactivity and animation to the scene. Furthermore, the author can insert, create and manipulate generic 3D models using the IndexedFaceSet node. Simple VRML files can be easily inserted. Synthetically animated 3D faces can be inserted by the Face node. The author must provide a FAP file [25] and the corresponding EPF file (Encoder Parameter File which is designed to give FAP encoder all the information related to the corresponding FAP file, like I and P frames, masks, frame rate, quantization scaling factor and so on). Then, a bifa file (binary format for animation) is automatically created, which is used in the Scene Description and Object Descriptor files. DRAFT November 18, 2003

19 V. Building a complete MPEG-4 scene Scene creation: While the user continuously changes the fields of a particular node through the dialogue boxes of the application, the program automatically creates two files that are needed in order to create the scene. In particular, the files created are the following: 1. Scene description file (.txt file). The scene description has several similarities to VRML, as the set of nodes defined by VRML was used as an initial set of composition nodes for MPEG-4. 2. Object Descriptor list file (.scr file). This file provides facilities to identify and name elementary streams which can be then referred to in a scene description and be attached to individual audiovisual objects. This association is performed in object descriptors that are transmitted in their own elementary streams. Scene generation: Following the creation of the two text files described in the previous step, it is necessary to construct suitable binary files, which can be processed locally or transmitted to the receiver side via the network. This can be done by using the software provided by the MPEG-4 Implementation Study Group. In particular, two successive stages are required: The BifsEncoder is used to construct the bifs/binary file (.bif file) from the bifs/text file. The Multiplexer is used to create the final MPEG-4 file. Scene utilization: The scene can now be saved or viewed on the MPEG-4 Player. The user can also open an existing scene. VI. IMPLEMENTATION SPECIFICS The 3D MPEG-4 authoring tool was developed using C/C++ for Windows, specifically Builder C++ 5.0 and OpenGL interfaced with the core module and the tools of the IM1 (MPEG-4 implementation group) software platform. The IM1 3D player is a software November 18, 2003 DRAFT

20 Scene Description file (. txt ) Bifs Enc. od. bif Audio Stream (G723) MUX. trif Video Stream (H263) Object Descriptors file (. scr ) MP4 Enc.mp4 Fig. 10. Tools of the MPEG-4 IM1 reference software and scene generation. implementation of a MPEG-4 Systems player [26]. The player is built on top of the Core framework which includes also tools to encode and multiplex test scenes. It aims to be compliant with the Complete 3D profile. The core module provides the infrastructure for full implementation of MPEG-4 players [27]. It includes support for all the functionalities described in Section II such as demultiplexing, BIFS and OD decoding, scene construction and update. It manages synchronized flow of data between the multiplexer, the decoders and the compositor through decoding and composition buffers. It supports plug-ins for the API (Application Programming Interface) for the Decoder, the DMIF (Delivery Multimedia Integration Framework - i.e. the name in MPEG of the layer that handles the delivery of MPEG-4 content over various kinds of networks and media) and the IPMP (Intellectual Property Management and Protection). It also provides the functionality of MediaObject, the base class for all specific node types. The core module is the foundation layer for customized MPEG-4 applications. It contains hooks for plugging all kind of decoders (JPEG, AAC, H.263, G.723 etc) and DRAFT November 18, 2003

21 customized compositors. It is written in C++. Its code is platform independent and has been used by the group as the infrastructure for applications that run on either Windows or Unix. The core module is accompanied by a test application. The test application is a Windows console application that reads a multiplexed file containing scene description and media streams (output of Mux), and produces two text files. One file shows the presentation time of each composition unit (CU), i.e. the time when a plug-in compositor would receive the CU for presentation, compared to the composition time stamp attached to the encoded unit. The other file shows textual presentation of the decoded binary scene description (BIFS) and object description (OD). The software tools include a BIFS/OD encoder, and a TRIF file-format multiplexer. The BifsEnc reads a textual description of a scene, scene updates and ObjectDescriptor stream commands (which may include ObjectDescriptor objects and IPMP objects), and produces two binary files - a BIFS file and an OD stream [28]. BifsEnc has been used in the presented authoring tool in order to encode the textual output of the authoring tool. It produces two files. Both have the same name as the input file, one with the extension.bif, the other with the extension.od. In addition, a text file with the same name and the.lst extension is produced. This file lists all the input lines, each followed by error descriptions, if any, and a textual description of the binary encoding. The TRIF multiplexer is a software tool that reads a set of files, each containing an MPEG-4 elementary stream, and multiplexes them according to TRIF specifications into one bitstream. In addition, the TRIF multiplexer may encode a bootstrap Object Descriptor (InitialObjectDescriptor) and place it at the beginning of the multiplexed file. The MP4Enc multiplexer is an application that reads MPEG-4 elementary streams and multiplexes them into a single mp4 file. It is based on the TRIF multiplexer Mux developed by Zvi Lifshitz (Optibase Ltd.) and on the MP4 file format API libisomp4.lib. Im1Player (for 2D scenes) and 3D player (for 3D scenes) are two tools which verify compliance of MPEG-4 Systems bitstreams [29]. The tools input MP4 files and produce text November 18, 2003 DRAFT

22 files that describe the content of the file. The output of the tools includes full textual description of all the Systems elementary streams (BIFS, OD) it processes. OpenGL [30] is a software interface to graphics hardware. The main purpose of OpenGL is to render two- and three- dimensional objects into a framebuffer. These objects are described as sequences of vertices (that define geometric objects) or pixels (that define images). OpenGL performs several processes on this data to convert it to pixels forming the final desired image in the buffer. Our authoring Tool provides a front-end user interface to the MPEG-4 IM1 referenced software described above. More specifically, the.txt and the.scr files are produced (Figure 10) which are used as inputs in BifsEnc and MP4Enc (and MUX) respectively. VII. EXAMPLE OF USE In this section we present two scenes and we explain the creation of one scene, that can be easily created by the authoring tool. The scene represents an ancient Greek temple (Figure 13), made of several groups of cylinders and boxes, which is continuously rotated around its y-axis. The steps for the creation of the temple are relatively simple if enough of the capabilities of the authoring tool are used. The basic steps are the following: A. Create the front part of the temple First a vertical cylinder is created. Changes to its position and scaling are made to make it similar to a column of the temple. Then with a copy-paste operation in the Scene Tree View, a second identical column is created. After the reposition of the second object to the desired place, the two first columns of the temple are ready. More columns will be created later after the front part of the whole temple is ready. The second step will be to create the roof of the temple, so a box is created. After the reposition of the box, it will be placed on the top of the two columns. The box z-dimension should be equal to the diameter of the columns. Afterwards, one more box is created, and after its resizing DRAFT November 18, 2003

23 and rotation it will be placed on top of the former box. This box will be rotated about 45 degrees around its z-axis. This object (box) is duplicated (with a copy-paste operation) and by changing its z-axis rotation vector with a symmetric negative value, two similar antisymmetric boxes are created. At this point the roof looks like the extrusion of an isosceles triangle. The front of the temple is ready. B. Duplicate identical portions of the scene The front of the temple, which is created in the previous step, is an important part of the temple s geometry. By duplicating it twice, the back and the middle section of the temple is created. For this reason a group object is inserted in the scene. With a drag-and-drop operation (in the Scene Tree View) all items in the scene are included in the group object. This makes it easier to manipulate them as a set of objects rather than as single items. The creation of the remaining portions of the temple is achieved with a copy-paste operation of the whole group several times. The only adjustment requiring concern, is the z-position of the groups. The z-values of the front and back portions of the temple must be symmetrical. C. Add final details to the geometry At this point, the gaps on the roof must be filled. For this reason identical boxes are created and placed between the front and middle portion of the roof, or the middle and back portion. This can be done either from scratch, or by slightly duplicating parts of the roof. After duplicating, for example, the front part of the roof an appropriate reposition and scaling take place towards its z-aspect. All that is now needed for the temple, is a floor. A stretched box can serve for this purpose. At this point more specific details such as textures or colors are added to the objects. November 18, 2003 DRAFT

24 D. Use update commands for the gradual presentation of the temple It is assumed that this scene is going to be used to demonstrate the gradual presentation of the temple, i.e. the historical process of its construction. This gradual creation can be achieved using the BIFS-Commands (updates) so that, gradually, the temple appears in the player. The exact steps are the following: on the Updates tab (Figure 12b) the Insert command is selected ( Insert button). On the main window, in the scene tree (Figure 13a), the group of nodes for the gradual presentation, is selected and copied (it is implied that the whole scene has already been created in the Authoring Tool). On the Update Command Details panel (Figure 11) in tab General the selected group of nodes is pasted, the group is specified ( Set Target button) and the time ( Time of Action button), of action needed (e.g. 500 ms). Finally by pressing the button Play the result is shown by the 3D MPEG-4 Player. E. Add movement to the scene Animation in the scene can be activated by using interpolators. The first step to achieve this is to group all the objects together in a global group object. This object can be set in motion by activating its interpolator properties. At the Interpolators Menu the Orientation Interpolator property is checked and with appropriate selections the object rotates around its y-axis. The movement can be seen by playing the scene with the MPEG-4 Player. The movement can be as complex as needed, by engulfing group nodes inside others and activating each one s interpolator properties. Finally, the scene is saved so it can be viewed either externally with a MPEG-4 Player or from the Preview/Play Button that is available on the toolbar of the interface. Every scene that is produced from the authoring tool is fully compatible with the MPEG-4 BIFS standard and can be presented by any MPEG-4 Player (capable of reproducing BIFS). The second scene represents a virtual studio (Figure 15). The scene contains several DRAFT November 18, 2003

25 Fig. 11. The updates panel (Insert node(s)). groups of synthetic objects including a synthetic face, boxes with textures, text objects and indexedfacesets (Figure 14). The logo group which is located on the upper left corner of the studio is combined of a rotating box and a text object that describes the name of the channel. The background contains four boxes (left-right side, floor and back side) with image textures. The desk is created with another two boxes. On the upper right corner of the scene a box with video texture is presented. On this video-box a H.263 video is loaded. The body of the newscaster is an indexedfaceset imported from a VRML 3D model. The 3D face was inserted by using the corresponding button. Finally, November 18, 2003 DRAFT

26 a rolling text is inserted in the scene for the headlines. After the selection of a FAP (Face Animation Parameters) file and an audio stream (a saxophone appears on the upper left corner), the face is configured to animate according to the selected FAP file. The video stream (H.263) and the audio stream (G.723) are transmitted as two separate elementary streams according to the object descriptor mechanism. All the animation (except the face animation) is implemented using interpolator nodes. Some major parts of the produced scene description file (.txt) are the following: DEF ID_014 AnimationStream #fap animation stream { url 50 } Transform { translation 0.000 1.529 1.690 rotation 0.000 0.000 0.000 0.000 scale 0.013 0.013 0.013 Children Face #face node { fap DEF ID_104 FAP{} renderedface [] } }... DEF T120661744 Transform { translation 0.000 0.000 0.000 rotation 1.786 1.014 0.000 0.911 children Shape { appearance Appearance { texture ImageTexture { url 10 } texturetransform TextureTransform { } } geometry Box { #box with image texture size 0.796 0.796 0.694 } } } DEF OrientTS120658180 TimeSensor { stoptime -1 starttime 0 loop TRUE # time sensor for interpolation # purposes cycleinterval 15 } DEF ORI120658180 OrientationInterpolator { key [0, 1] keyvalue [0.000 0.000 0.000 0.000,0.000 0.200 0.000 3.143 ] }... ROUTE OrientTS120658180.fraction_changed TO ORI120658180.set_fraction ROUTE ORI120658180.value_changed TO T120661744.rotation DRAFT November 18, 2003

27 The AnimationStream node reads from an external source the selected FAP file. The Transform node inserted before the Face node, controls the position of the animated face in the scene. The Face node inserts the animated face and connects it with the FAP file defined earlier. The following group creates the logo which is located on the upper left corner and more specifically, the textured rotating box. First the position of the box (Transform node) and then the image to be applied as texture (appearance and texture fields) is defined. Finally the geometry and the dimensions of the object are defined (geometry node). In our case the object is a box. The final part contains the necessary nodes for creating the rotating motion. First, the period of the motion is defined (how fast the box will be rotated) and whether the rotation speed will be constant. This is controlled by the TimeSensor node and the loop and cycleinterval fields. The OrientationInterpolator node defines the intermediate positions of the motion. Finally, the ROUTE nodes connect the defined parameters of the movement to the textured object. The objects are uniquely characterized by the DEF nodes. For example, the texture box is object T120661744. As can be seen from the above, the text based description format for MPEG-4 is very complicated. It is almost impossible to develop an MPEG-4 scene from scratch using only text. The user should be aware of a complicated syntax and a great number of MPEG- 4 BIFS node names and at the same time keep track of all object names defined. The presented authoring tool allows non-expert MPEG-4 users to develop complicated scenes by converting this text-based description to a more native, graphical description. VIII. Conclusions In this paper an authoring tool with 3D functionalities for the MPEG-4 multimedia standard was presented. The tool maps BIFS features and functionalities to common Window controls allowing users to efficiently create or edit and finally play MPEG-4 compliant scenes using an external MPEG-4 player. The scenes presented in the previous section demonstrate that it is possible to create complex scenes using unique MPEG-4 November 18, 2003 DRAFT

28 features such as Updates and Facial Animation. The presented parts of the corresponding Text Description Files show that it is almost impossible for the non-expert to build even simple MPEG-4 scenes from scratch using only text. We found that while content developers were satisfied with the efficiency and the effectiveness of the system, those that were not familiar with the MPEG-4 standard had problems understanding the terminology used. Thus, further development and refinement is needed before the tool can be useful for large-scale deployment. Another important feature of the authoring tool is that it produces totally MPEG-4 compliant scenes. These scenes can be visualized using the IM1-3D player developed by the MPEG-4 group without any modifications. Thus, the tool may be used to create MPEG-4 compliant applications without introducing proprietary features. The presented paper, also, highlights and exemplifies the manner in which non-expert MPEG-4 users may create and manipulate MPEG-4 content using appropriate tools. The tool developed is intended to help MPEG-4 algorithm and system developers integrate their algorithms and make them available through a user friendly interface. It may also help as a beginning for the development new tools of their own. Finally the tool may serve as a benchmark for the comparison of other or proprietary authoring tools to one with the capabilities of the MPEG-4 system. References [1] Tutorial issue, Signal Processing:Image Communication, Tutorial issue on MPEG-4, vol. 15, no. 4-5, 2000. [2] R. Koenen, MPEG-4 Multimedia for our Time, IEEE Spectrum, vol. 36, pp. 26 33, Feb. 1999. [3] L. Chiariglione, MPEG and Multimedia Communications, IEEE Trans. on Circuits and Systems for Video Technology, vol. 7, pp. 5 18, Feb. 1997. [4] F. Pereira, MPEG-4:Why, what, how and when?, Signal Processing:Image Communication, vol. 15, pp. 271 279, 2000. [5] MPEG-4 Systems, ISO/IEC 14496-1: Coding of Audio-Visual Objects: Systems, Final Draft International Standard, ISO/IEC JTC1/SC29/WG11 N2501, October 1998. [6] ISO/IEC 14472-1, The Virtual Reality Modeling Language, http://www.vrml.org/specifications/vrml97, 1997. DRAFT November 18, 2003

29 [7] M. Kim, S. Wood, and L.-T. Cheok, Extensible MPEG-4 Textual Format (XMT), in ACM Multimedia-2000, Oct. 30 - Nov. 4 2000. [8] Extensible 3D (X3D) Graphics Working Group, http://www.web3d.org/x3d.html, [9] S. Boughoufalah, J. C. Dufourd, and F. Bouilhaguet, MPEG-Pro, an Authoring System for MPEG-4, in ISCAS 2000-IEEE International Symposium on Circuits and Systems, (Geneva, Switzerland), May 2000. [10] V. K. Papastathis, I. Kompatsiaris, and M. G. Strintzis, Authoring tool for the composition of MPEG-4 audiovisual scenes, in International Workshop on Synthetic Natural Hybrid Coding and 3D Imaging, (Santorini, Greece), September 1999. [11] H. Luo and A. Eleftheriadis, Designing an interactive tool for video object segmentationa and annotation, in ACM Multimedia-99, March 1999. [12] P. Correia and F. Pereira, The role of analysis in content-based video coding and interaction, Special Issue on Video Sequence Segmentation for Content-Based Processing and Manipulation, Signal Processing Journal, vol. 26, no. 2, 1998. [13] B. Erol and F. Kossentini, Automatic key video object plane selection using the shape information in the MPEG-4 compressed domain, IEEE Trans. on Multimedia, vol. 2, pp. 129 138, June 2000. [14] B. Erol, S. Shirani, and F. Kossentini, A concealment method for shape information in MPEG-4 coded video sequences, IEEE Trans. on Multimedia, vol. 2, no. 3, pp. 185 190, 2000. [15] IBM Hotmedia Website, http://www-4.ibm.com/software/net.media/, 2000. [16] Veon Website, http://www.veon.com, 2000. [17] F. Lavagetto and R. Pockaj, The facial animation engine: toward a high-level interface for the design of MPEG-4 compliant animated faces, IEEE Trans. Circuits and Systems for Video Technology, vol. 9, pp. 277 289, March 1999. [18] G. A. Abrantes and F. Pereira, MPEG-4 facial animation technology: survey, implementation, and results, IEEE Trans. Circuits and Systems for Video Technology, vol. 9, pp. 290 305, March 1999. [19] H. Tao, H. Chen, W. Wu, and T. Huang, Compression of MPEG-4 facial animation parameters for transmission of talking heads, IEEE Trans. Circuits and Systems for Video Technology, vol. 9, pp. 264 276, March 1999. [20] I. Kompatsiaris, D. Tzovaras, and M. G. Strintzis, 3D Model Based Segmentation of Videoconference Image Sequences, IEEE Trans. on Circuits and Systems for Video Technology, Special Issue on Image and Video Processsing for Emerging Interactive Multimedia Services, vol. 8, Sept. 1998. [21] R. Koenen, MPEG-4 Overview - (V.16 La BauleVersion), ISO/IEC JTC1/SC29/WG11 N3747, October 2000. [22] J. Signès, Y. Fisher, and A. Eleftheriadis, MPEG-4 s Binary Format for Scene Description, Signal Processing:Image Communication, Special issue on MPEG-4, vol. 15, no. 4-5, pp. 321 345, 2000. [23] E. D. Shreirer, R. Vaananen, and J. Huopaniemi, AudioBIFS: Describing audio scenes with the MPEG-4 multimedia standard, IEEE Trans. on Multimedia, vol. 1, pp. 237 250, June 1999. [24] B. MacIntyre and S. Feiner, Future multimedia user interfaces, Multimedia Systems, vol. 4, no. 5, pp. 250 268, 1996. [25] University of Genova. Digital Signal Processing Laboratory, http://www-dsp.com.dist.unige.it/ snhc/fba ce/facefrmt.htm, 2000. [26] Z. Lifshitz, Status of the Systems Version 1, 2, 3 Software Implementation, tech. rep., ISO/IEC JTC1/SC29/WG11 N3564, July 2000. November 18, 2003 DRAFT

30 [27] Z. Lifshitz, Part 5 - Reference Software - Systems (ISO/IEC 14496-5 - Systems), tech. rep., ISO/IEC JTC1/SC29/WG11 MPEG2001, Mar. 2001. [28] Z. Lifshitz, BIFS/OD Encoder, tech. rep., ISO/IEC JTC1/SC29/WG11, Mar. 2001. [29] Z. Lifshitz, Im1 Player - A Bitstream Verification Tool, tech. rep., ISO/IEC JTC1/SC29/WG11, Mar. 2001. [30] OpenGL, The Industry s foundation for High Performance Graphics, http://www.opengl.org, 2000. DRAFT November 18, 2003

31 (a) (b) Fig. 12. An ancient Greek temple. November 18, 2003 DRAFT

32 (a) (b) Fig. 13. An ancient Greek temple. DRAFT November 18, 2003

33 Sound Textured 3D Box with Interpolator Textured IndecedFaceSet 3D Box Animated Face Video 3D Text Textured 3D Box Scene Tree Fig. 14. The virtual studio scene in the authoring tool. November 18, 2003 DRAFT

34 Fig. 15. The virtual studio scene in the IM1 3D player. DRAFT November 18, 2003