A MULTIPOINT VIDEOCONFERENCE RECEIVER BASED ON MPEG-4 OBJECT VIDEO. Chih-Kai Chien, Chen-Yu Tsai, and David W. Lin

A MULTIPOINT VIDEOCONFERENCE RECEIVER BASED ON MPEG-4 OBJECT VIDEO Chih-Kai Chien, Chen-Yu Tsai, and David W. Lin Dept. of Electronics Engineering and Center for Telecommunications Research National Chiao Tung University Hsinchu, Taiwan 30010, R.O.C. E-mails: kai.ee92g@nctu.edu.tw, dwlin@mail.nctu.edu.tw ABSTRACT We consider the design and implementation of a novel type of software-based multipoint videoconference receiver on a personal computer (PC), where some distinguishing features are that MPEG-4 object-based coding is used in encoding each video stream and that the decoded videos are composed into one for display. The resulting receiver includes an RTP-based network interface, a set of MPEG-4 video and AAC audio decoders (whose number depends on number of source sites), and a unit to compose the decoded media streams for display. We develop a graphical user interface using the Windows SDK for convenience in system control and monitoring as well as display of results. With an AMD CPU running at 2.1-GHz with 512-MB RAM, the current un-optimized implementation yields a speed on the order of 10 frames per second for CIF (352 288) video when receiving from only one source site. The frame rate reduces approximately in proportion to number of sources. 1. INTRODUCTION In a typical multipoint videoconference system, the receiver places the decoded videos from different sources in different windows. We consider constructing a different kind of system in which the decoded videos are composed into a virtual conference room scene. For this, a natural and simplest approach is to segment and encode the source videos separately at their respective transmitter sites, and let each receiver decode all received videos and compose and display the result. The MPEG-4 standards, with their provision for object-based video coding, appear naturally fitting for this use. In this work, we consider the design and implementation of the receiver on a personal computer (PC), in software. There are four major components in the receiver: the network interface, the video decoder, the audio decoder, and the composition unit, as illustrated in Fig. 1. As will be Work supported by National Science Council of R.O.C. under Grant NSC 93-2219-E-009-022. Fig. 1. Structure of the proposed videoconference receiver. explained below, we decide to use the Real-time Transport Protocol (RTP) for the network interface and develop our own composition method, leaving the video and the audio encoded and decoded according to MPEG-4 specifications. In what follows, Section 2 gives an overview of the (earliest) MPEG-4 standards and comments on the usefulness of each part in our work. Section 3 discusses how we decode and compose multiple videos into one scene. Section 4 describes the integration of the receiver system. Some experimental results are described in Section 5. Finally, Section 6 contains the conclusion. 2. THE MPEG-4 STANDARDS AND THEIR APPLICATION IN THIS WORK The original MPEG-4 standards are divided into four basic parts: Systems (ISO/IEC 14496-1), Visual (ISO/IEC 14496-2), Audio (ISO/IEC 14496-3), and Delivery Multimedia Integration Framework, or DMIF in short (ISO/IEC 14496-4) [1], [2]. The MPEG-4 Systems part specifies how audio-visual

Fig. 2. Relation between the two decoders in time domain. scenes can be created from individual objects. To actually create the audio-visual scenes for a particular application, one may employ a suitable authoring tool designed to the MPEG-4 specifications. Unfortunately, no suitable authoring tools are found for our application in the course of this work. Hence we decide to develop a simple composition method and write our own program for it. Aside from other innovations, a major novelty in the MPEG-4 Visual part is the provision for coding of arbitraryshaped objects. In fact, each picture is considered as composed of a number of video objects. For each object, a socalled alpha-plane image sequence defines the support of the object in each video frame as well as the transparency of each pixel therein. Pixels belonging to the object are encoded largely employing typical motion-compensated block- DCT techniques; and so are the alpha-plane images. These features are used in our implementation. The MPEG-4 Audio part is rather generic in that it considers different kinds of audio signal (e.g., speech, natural sound, and synthetic sound) and different kinds of signal manipulation (e.g., compression and synthesis). As a result, it facilitates object-based audio coding and manipulation. Our system merely makes use of its compression functionality. We employ a publicly available MPEG-4 AAC (Advanced Audio Coding) decoder in this work. The MPEG-4 DMIF part specifies the interface between the application and the transport, so that the transport can be relatively transparent to the application and the application can be developed largely free from concerns over the transport. Despite the intent, recent MPEG activities show waned interest in DMIF and existing DMIF software, such as [3], has suffered functionality and maintenance problems. Therefore, we decide to use RTP [4] for the network interface. 3. VIDEO DECODING AND COMPOSITION As discussed previously, we develop a simple video composition method in this work. This involves not only composition of multiple videos into one scene, but also synchronization among the multiple decoders. For simplicity, we Fig. 3. Different spatial relations between two images. (a) Case 0, no overlap. (b) (e) Cases 1 4, respectively, with overlaps. assume that all videos have the same frame rate and relegate the situation with disparate frame rates to potential future research. Note, however, that in the latter situation, we may consider skipping some frames in the videos with higher frame rates, which should not cause much problem in videoconferencing when the frame rates are high enough. In this case, only minor modifications are needed in the present program. We first consider the situation with two videos and two decoders. The two decoders are placed in two threads on the PC. To synchronize the decoded videos, for each frame we let the first decoder wait until the second decoder completes. Then it starts working on the next frame. Composition and display of the videos are done in the second thread. Figure 2 shows the temporal relation between the two decoders. To compose the two images decoded by the two decoders, note that there are several possible spatial relations between them, as illustrated in Fig. 3. Case 0 is where there is no overlap between the two images. If there is some overlap between the two images, we let the first image occlude the second in the overlapped area. In the integrated receiver system, we let the user determine and specify where each decoded video is to be placed in the display window.

Fig. 4. Composition of four (or three) videos. Now consider the situation with four (or three) videos. Figure 4 explains our method of composition. The outputs of the first and the second decoders are composed together first, and so are the outputs of the third and the fourth decoders. Then the two composed images are composed to yield the final displayed image. The padding referred in the figure is to pad the image into the form of a box for display purpose. For more videos, we simply extend the composition tree. In any case, it is always the highest-indexed video that controls the composition and display operation. 4. RECEIVER SYSTEM INTEGRATION 4.1. Overall System Structure Figure 5 shows how the integrated receiver program works. The GUI block creates a window and the user can input the ports and the positions of the different videos. Since there may be multiple video and audio streams to be handled, we use the multi-threading technique to manage the video and the audio decoders. The multi-threading technique lets the operating system handle the scheduling of the threads. Good scheduling can made efficient use of the PC s available computing, storage, and communication resources. So, after the user has specified the ports and the positions of the videos, the system creates the decoder threads. It also obtains information on which decoders are active at present, so as to determine the highest-indexed video decoder and pass the control responsibility (including video composition and display) to it. Now the video and the audio decoders can begin their work in decoding the data received through the RTP network interface. The composed video is displayed in a window and the composed audio is played. Fig. 5. Flow diagram of the integrated receiver. In the following subsections we give some further details of the system components. 4.2. The RTP Network Interface The RTP facilitates end-to-end delivery of data with realtime characteristics, such as interactive audio and video, and is thus suitable for our application [4]. It supports sequence numbering, timestamping, and delivery monitoring. It can also support multicasting if the underlying network has such capability, which makes it naturally fitted to support multiparty multimedia conferences. However, our implementation does not make use of this feature, neither do we assume that the network has multicasting capability. The RTP is usually run on top of UDP. One should note that RTP itself does not provide quality-of-service guarantees (such as orderly and timely delivery of packets); it is up to the higher layers (such as the user) to provide them, aided by the information provided by the RTP and as far as it is within the ability of the lower-layer services. We use the jrtplib 3.1.0 software [5] developed at the Expertise Centre for Digital Media (EDM) to realize the RTP network interface. One RTP session is created for each video or audio stream. The transmitter [6] handles the required RTP packetization. Two important parameters that need to be set for each session are the timestamp and the portbase. The timestamp parameter is set to 1 section per second for a video stream and to 1 section per 4 seconds for an audio stream. (These parameter values are somewhat in-

appropriate for real-time applications, but are chosen based on experience for smooth running of the program. Further work is needed to determine the underlying problem and potential solution.) For the portbase, since each transmitter sends two streams (video and audio), the video stream uses a user-specified port number and the audio stream uses that number plus 100. In our system, after the setting of these parameters, we put the receiving of the RTP packets in a main loop. We use the function StartRTP to enable the main loop to receive RTP data and the function ResetRTP to disable the loop to stop receiving RTP data. For some reason yet unclear, the software would lose the first two packets and receive the third and later packets successfully. 4.3. Video Decoding and Display Our video decoder is from the Microsoft MPEG-4 Video Reference Software [7], which is a public source for encoding and decoding of video in MPEG-4 format. We did optimization of the encoder for Intel processors [8], but not the decoder. For our application, we use the binary shape coding feature belonging to main profile of MPEG-4 video. To integrate the video decoder into the overall system, we modify the original decoder program into a function called MPEG4VDecoder and give it two parameters: handle (of the display window) and thread index. The handle can give the decoder some information to control the window and display the video stream. To display the video, we convert the decoder output from the original 4:2:0 format to the 4:4:4 format. Then we calculate the RGB values of each pixel from the luminance and the chrominance values. And then we use the SetPixelV function provided by the Windows SDK library to display the RGB values pixel by pixel. However, experience shows that the SetPixelV function is very slow and can significantly slow down the overall speed of the receiver system. Hence, to reduce its use, instead of using it on all pixels in the display window, we only use it to update the pixels in the object areas of two successive frames. 4.4. Audio Decoding and Composition For audio decoding, we use the Freeware Audio Decoder (FAAD2) [9] written by M. Bakker. The decoder can handle HE, LC, MAIN and LTP profiles, but we only make use of the MAIN profile. After decoding the audio stream received from the RTP network interface, the result is saved as a temporal audio file in the WAV format, which is a common format used in PC audio. The Media Control Interface (MCI), a high level open interface, provides two ways to play WAV-format audio. Since each audio section is four seconds, the decoder Fig. 6. Some typical composed scenes with two videos. would wait for that long as the decoder output is played by the MCI. For audio composition, two intuitive methods are (1) to sum all audio streams and (2) to play only one stream. The first method suffers an overflow problem which can be solved by proper scaling. This is left to potential future work. For simplicity in the final system we only play the audio from the first transmitter site. 5. EXPERIMENTAL RESULTS We present some experimental results in this section. For convenience, for the video part we use the common CIF (352 288) test sequence Bream with its associated binary shape information. Multiple instances of the sequence are considered separate video transmissions. Figure 6 shows some typical composed scenes with two videos. Table 1 shows some performance data from employing a PC with an AMD CPU running at 2.1-GHz with 512-MB RAM. The program is not yet optimized. We see that if only one original video decoder is present, the rate can get up to 53.4 frames per second (fps). The processing speed decreases roughly in proportion to the number of decoders, but somewhat slower. This presumably is because the computing resources are used a little more efficiently when there are multiple decoder threads. As explained previously, we pad the decoded video to the 4:4:4 format and convert the result to RGB values for display. As shown in Table 1, the frame rate decreases by about 20% in each case when we add the padding function. Adding video composition and displaying of the result, where the SetPixelV function is used over the whole video frame, reduces the frame rates exceedingly (by one order of magnitude), as can be seen in Table 1 for the cases with

Table 1. Processing Rate (in Frames per Second) Under Various Configurations No. of Decoders Configuration 1 2 3 4 Original 53.4 28.8 18.8 14.2 With Padding 42.7 22.1 14.8 11.4 With Compos.+Display 5.0 2.6 Only Composition 17.6 Reduced SetPixel 12.0 6.2 3.9 With Audio 11.2 5.2 3.6 We considered the design and implementation of a novel type of software-based multipoint videoconference receiver on a PC, where some distinguishing features were the use of MPEG-4 object-based coding and the composition of decoded videos into one scene. The resulting receiver included an RTP-based network interface, a set of MPEG-4 video and AAC audio decoders (whose number depends on number of source sites), and a unit to compose the decoded media streams for display. Some topics for potential future research are as follows. 1. Optimization of software components to increase speed and reduce transmission latency: Top on the list should be a faster method to display the composed video. Other components, such as the video and the audio decoders, can also be improved in speed. 2. More elegant audio composition. 3. More sophisticated handling of videos with different frame rates. 4. More elaborate ways of video composition: This may include scaling and three-dimensional rotation of the decoded videos, as well as integration with proper background and foreground. 7. REFERENCES [1] International Committee for Information Technology Standards, http://www.ncits.org/. Fig. 7. Time analysis of the receiver system. one and two videos. With only video composition but not the display of the result, the frame rate can be much higher, as shown in Table 1 for the case with two videos. Time analysis shows that image display alone takes over 70% of the overall processing time, as illustrated in Fig. 7. Hence, unless we can find more efficient ways to set the display pixels, we should minimize the use of the time-consuming SetPixelV function. After limiting SetPixelV function calls to the video object areas as mentioned briefly previously, we increase the frame rates by about 140%, as shown in Table 1. Finally, we add the audio decoder into the system. From Table 1, we see that the frame rate decreases only by roughly 10% in order of magnitude in each case. Hence the audio decoder has a relatively low complexity. 6. CONCLUSION AND FUTURE WORK [2] MPEG-4 Video Group, MPEG-4 overview (V.21 Jeju Version), doc. no. ISO/IEC JTC1/SC29/WG11 N4668, Mar. 2002. [3] DMIF software, http://lan.ece.ubc.ca/memberza.html. [4] H. Schulzrine, S. Casner, R. Frederick, and V. Jacobson, RTP: a transport protocol for real-time application, RFC 3550, Network Working Group, July 2003. [5] JRTPLIB, http://research.edm.luc.ac.be/jori/page.html. [6] C.-Y. Tsai, Integration of videoconference transmitter with MPEG-4 object-based video encoding, M.S. thesis, Dept. of Electronics Engineering, National Chaio Tung University, Hsinchu, Taiwan, R.O.C., June 2005. [7] Microsoft, ISO/IEC 14496 (MPEG-4) Video Reference Software User Manual, Oct. 2004. [8] M.-Y. Liu, Real-time implementation of MPEG-4 video encoder using SIMD-enhanced Intel processor, M.S. thesis, Degree Program of Electrical Engineering and Computer Science, National Chaio Tung University, Hsinchu, Taiwan, R.O.C., July 2004. [9] AudioCoding.com, http://www.audiocoding.com/.