Layered Self-Identifiable and Scalable Video Codec for Delivery to Heterogeneous Receivers

Layered Self-Identifiable and Scalable Video Codec for Delivery to Heterogeneous Receivers Wei Feng, Ashraf A. Kassim, Chen-Khong Tham Department of Electrical and Computer Engineering National University of Singapore 10 Kent Ridge Crescent, Singapore 119260 ABSTRACT This paper describes the development of a layered structure of a multi-resolutional scalable video codec based on the Color Set Partitioning in Hierarchical Trees (CSPIHT) scheme. The new structure is designed in such a way that it supports the network Quality of Service (QoS) by allowing packet marking in a real-time layered multicast system with heterogeneous clients. Also, it provides (spatial) resolution/ frame rate scalability from one embedded bit stream. The codec is self-identifiable since it labels the encoded bit stream according to the resolution. We also introduce asymmetry to the CSPIHT encoder and decoder which makes it possible to decode lossy bit streams at heterogeneous clients. Keywords: Multi-resolutional scalability, CSPIHT, automatic adaptive video streaming, layered coding, heterogeneous receivers. 1. INTRODUCTION In recent years, the discrete wavelet transform (DWT) has been increasingly applied in coding of images and video sequences. The Set Partitioning in Hierarchical Trees (SPIHT) 2 is an image compression technique that exploits decaying spectrum density. SPIHT was extended to Color-SPIHT (CSPIHT) 4,5,6 by linking selected Y nodes in the lowest subband to the corresponding CrCb nodes. This scheme provides satisfactory results for color image and video coding in terms of PSNR. However, the main drawback is that once encoded, network elements (e.g., servers, routers and clients) are not able to know how a particular chunk of bits will contribute to reconstruction of the compressed video. Furthermore, it is not suited for real-time QoS enabled transmission since it provides stream of different bit rate at the expensive of re-encoding. Finally, lossy video streams cannot be correctly decoded. Taking possible network QoS implementations into consideration, our modified CSPIHT codec slots specially designed flags between each two layers in order to indicate the layer beginning position to network elements and the decoder. These flags, called layer ID, tell which layer a particular package belongs to. The organization of this paper follows. We provide a brief overview of the CSPIHT in section 2 and describe our modified codec in detail in sections 3 and 4. Section 5 provides the experimental results and performance analysis. We present our conclusions in section 6. 2. CSPIHT VIDEO CODEC In this section, we describe the CSPIHT scheme and its limitations that severely restrict its use in a real time QoS enabled video delivery system. Like the SPIHT, the CSPIHT is essentially an algorithm for sorting the wavelet coefficients across subbands. The coefficients in the transformed domain are linked by a spatial orientation tree (SOT) structure that are then partitioned such that the coefficients are divided into sets defined by the level of the most significant bit in a bit-plane representation of their magnitudes 3,4,5,6. Significant bits are coded with higher priority under certain bit budget constraints, thus creating a rate controllable bit stream. For video sequences, the CSPIHT encoder takes a group of frames (GOF), which normally consists of 16 frames, as input 7, does a three-dimensional

wavelet transform and subsequently links and codes the wavelet coefficients in the 3-D CSPIHT 7 kernel. In the luminance plane, the SPIHT 2 algorithm is used to define the SOT while the EZW 1 structure is adopted in the chrominance planes 6,7. Fig. 1 depicts the SOT structure of the CSPIHT for image coding. Y (SPIHT structure) Cb Cr (EZW structure) Fig. 1 Structure of Spatial Orientation Tree of 2-D Color SPIHT In the original CSPIHT, sorting terminates when the user defined bit budget is used up. The decoder must know the same budget information to decode a particular stream. For video reconstruction at a different bit-rate, the encoder needs to run again with the desired bit budget, which is a problem if real-time multicast is required. Thus, the original CSPIHT is not suitable for realizing a real-time multicast system that supports QoS. Networks rely on packet marking to provide QoS service. Packets from different resolution layers can be marked with different priorities since they contribute differently to the reconstruction. Hence, it is highly desirable for different layers in the encoded bit stream to be easily identified by network elements. Since data loss is possible, the decoder cannot rely on the bit budget parameter but must be able to decode differently truncated versions as well as lossy versions of the original encoded stream, i.e., the stream before network transmission, in real-time. Fig. 2 illustrates the problem when decoding lossy data under the original CSPIHT scheme. The first stream in Fig. 2 is the encoded video data sent from the network server, and the second is the one that arrives at the decoder. As shown, the loss of one packet in GOF2 renders the decoder unable to correctly decode all data that arrive after it. To overcome this problem, additional flags are needed to enable the decoder to identify the beginning of new layers and GOFs. Finally, CSPIHT is a bit rate scalable codec and so does not provide resolution or frame rate scalability. block header GOF1 GOF2 GOF3 GOF4 Data Data Data Data Data Data Data Data GOF1 GOF2 GOF3 GOF4 Data lost data data confused data Fig. 2 Confusion when decoding lossy data using original CSPIHT decoder

3. MODIFIED CSPIHT VIDEO CODEC In this section we present the Modified CSPIHT codec which overcomes the limitations of the CSPIHT highlighted above so that it can be applied to a real-time QoS enabled transmission system. Our main consideration is how to cooperate with the network elements and enable the decoder to decode lossy data. We assume that the network is real-time and multicast. The new codec can also be used in a unicast delivery system. In this paper, we only discuss the multicast case since the codec functions similarly in the unicast situation. multi document receiver decoder player encoder sender layer 1... PDA server layer 2 layer 7....... desktop PC... laptop PC Fig. 3 Network scenario considered for design of modified codec Layered multicast enables receiver-based subscription of multicast groups that carry incrementally decodable video layers 9. In a layered multicast system, the server sends only one video stream and the receivers/clients can choose different resolutions, sizes, and frame rates for display, resulting in different bit rates. As shown in Fig 3, the encoder is executed offline. The resulting stream is separated into layers according to resolution and stored separately so that they can be sent over different multicast groups. The server must make various decisions including the number of layers to be sent, the layers to be discarded if bandwidth is not adequate for all layers and etc. The server is able to do this because it has information about the network status including the available bandwidth and the congestion level. Heterogeneous clients subscribe to the layers they want based on the capacity of the client machines and user requests. Users may not always want to see the best quality video even if they could because it takes more time and costs more. In Fig. 3 for example, the PDA client only subscribes to the first layer of each GOF, the laptop PC client subscribes to the first two layers while the powerful desktop PC client subscribes to all seven layers. For QoS marking, the network elements in the layered multicast system need to know which layer the data being currently processed belongs to, and the decoder also needs to know it for the decoding lossy streams. In a real time multicast network, real-time delivery cannot be achieved using the bit-budget since it is obtained by re-encoding. We modified the original CSPIHT codec so that it can be used in layered multicast environments by removing the bit budget parameter so that all coefficients are coded and by incorporating: A new sorting algorithm that produces resolution/frame rate scalable bit streams in a layered format. A specially designed layer ID in the encoded bit stream that identifies the layer that a particular data packet belongs to.

3.1 Layer IDs The layer ID must be unique and result in minimal overhead for any combination of video data. Synchronization bits consisting of k consecutive 1 s, i.e., 1111 11 are introduced as the layer ID at the beginning of each layer. To make the ID unique, occurrences of k consecutive 1 s in the video data stream are modified by inserting a 0 bit after k 1 1 s so that the sequence becomes 1111 101. If the data bit after k 1 consecutive 1 s is 0, an additional 0 will still be added to protect the original 0 from being removed at the decoder. Once the video stream is received at the decoder, 0 is removed from occurrences of 1111 10 while conducting normal CSPIHT decoding (Fig. 4). A good value of k is one that results in the smallest overhead. If k is too large, the layer ID will be a high overhead while if it is too small many zeroes have to be inserted. From simulation experiments conducted by examining the resulting compression ratios with different reasonable values of k, k=8 was found to be a good choice. header ID layer1 ID layer2 10... 11...1 01110011...1010100110111011...101011...101011...1001101... 11...1 11110011... k k-1 k-1 k-1 k-1 data added added added added Fig. 4 The bit stream after layer ID is added In a congestion adaptive network, the network elements can consciously select less important data to discard when congestion occurs. In our proposed codec, different layers have different priorities according to their resolution. The Layer ID is essential as it enables network elements and decoders to identify the beginning point of a new layer i.e., the boundary between two layers. Knowing the boundaries between layers not only helps QoS marking but also enables decoding of lossy video stream. Different layers use the same ID by maintaining an ID counter that counts the number of times the layer ID is captured. For example, decoder knows the successive data belong to layer 3 when it detects layer ID for the third time. ID counter is reset to zero as soon as it reaches the maximum number of layers. Our modified codec has 7 layers as it uses 3-level spatial and temporal wavelet transform. layer1 layer2... layer7 layer1 GOF1 layer2... layer3 GOF2 block header layer ID data lost data Fig. 5 Bit stream structure of the modified encoder The modified CSPIHT solves the problem mentioned in section 2 (Fig. 2). Fig. 5 is a detailed version of the first stream in Fig. 2. Each GOF is re-sorted and separated into 7 resolution layers. Layer ID is slotted between every two layers at the encoding stage. As layer ID is designed to be a unique binary code, the decoder can easily identify it while reading the stream. When a packet in layer 6 is lost (dark area in Fig. 5), the decoder will stop decoding the current layer (layer 6) on detecting the subsequent layer ID. Thus, the confusion in Fig. 2 is avoided and correct decoding of the subsequent layers after the lost packet is realized. If the lost packet is not the last packet in a layer, the decoder will have to discard the stream thereafter up to the next layer ID, and begin decoding from the next layer. In the network, block headers and layer IDs should be marked with the lowest drop precedence. 3.2 Production of Resolution/Frame Rate Scalable Bit Stream

tim e GOF 2 GOF 1 layer1 layer2 layer3 layer4 layer5 layer6 layer7 quality increases (a) (b) Fig. 6 (a) Resolution layers for one GOF and (b) the progressive decoding that result from it. Layers Spatial resolution Frame rate 1 Low Low 1+2 Medium Low 1+3 Low Medium 1+2+3+4 Medium Medium 1+2+3+4+5 High Medium 1+2+3+4+6 Medium High All 7 layers High High Table 1 Resolution options In the original CSPIHT, a node is significant when the magnitude of the coefficient is larger than a given threshold. In the modified CSPIHT, a node is significant when it is greater than the threshold and when it is in an effective subband, i.e., a subband that is currently being coded. Thus, every node must be double checked to determine whether it is significant. Fig. 6(a) depicts the relationship between the subband and layer for one GOF. The total of 22 subbands are divided into 7 layers that are coded in order i.e., layer 1, layer 2, up to layer 7. Table 1 shows 7 resolution options which provide for different combinations of spatial resolutions and frame rates. Fig. 6(b) shows how the resolution layers are progressively decoded to generate increasingly higher quality reconstruction. 4. MODIFIED CSPIHT ALGORITHM Our modified CSPIHT algorithm is similar to the original CSPIHT algorithm 4,5,6,7 except for the following: i. coefficients are re-sorted by redefining the criterion for a node to be significant; ii. the layer ID is inserted in the encoded bit stream between consecutive layers; iii. additional zeros are inserted to protect the uniqueness of the layer ID. A bit-counter is used to keep track of the number of bit 1 s. In the initialization stage, the bit-counter is reset to zero and subbands belonging to layer 1 are marked as effective subbands. Next, the modified CSPIHT sorting pass is conducted. Nodes in the List of Insignificant Pixels (LIP) are coded as in the original CSPIHT algorithm, while layer

effectiveness is checked when judging significance of nodes in the List of Insignificant Sets (LIS). A check on the layer effectiveness is not necessary in LIP because the check done in LIS prevents nodes from non-effective subbands from entering the LIP. If a node (i,j,k) does not satisfy layer effectiveness, it will be considered as an insignificant node and undergo normal CSPIHT processing for insignificant nodes. If it does, the magnitude of node (i,j,k) will be compared against the current threshold to finally determine whether it is significant or not. A special step called layer ID protecting is also carried out during the sorting pass in the modified CSPIHT. This step is executed whenever a 1 is output to the encoded bit stream. In layer ID protecting, we increment the bit-counter by 1. When bit-counter reaches k-1, a 0 will be added to the encoded bit stream to prevent the occurrence of 1111 11. Also, layer effectiveness must be updated to the next layer, and layer ID must be written to the encoded bit stream at end of coding each layer. 5. PERFORMANCE DATA In this section, we discuss the coding performance of the modified CSPIHT video codec. Experiments are done with color QCIF format video sequences: foreman, suzie, news and container at 10 frame per second. All experiments are performed on Pentium IV 1.6GHz computers. Fig. 7 shows frame by frame PSNR results of the foreman and the container sequences at three different resolutions: the lowest resolution in both spatial and temporal dimension (resolution 1), the medium resolution (resolution 2) and the highest resolution (resolution 3). Clearly, high resolution results in high PSNR. Foreman at resolution 1 has an average PSNR of 25.61 kbps. When 3 more layers (layer 2, 3 and 4) are coded, the PSNR improves by 0.38 to 8.81 db. In full resolution coding, the resulting PSNR can reach as high as 46.63 db. The average PSNR results on foreman, news, container and suzie are given in Table 2. 60 60 50 resolution 3 50 resolution 3 PSNR(Y) in db resolution 2 PSNR(Y) in db resolution 2 20 resolution 1 20 resolution 1 0 100 200 Frame number 0 (a) 0 100 200 Frame number 0 (b) Fig. 7 Frame by frame PSNR results on (a) foreman and (b) container sequences at 3 different resolutions. We compare the performance of the modified CSPIHT codec and the original CSPIHT codec in terms of PSNR, encoding time and compression ratio. All comparisons are done at the same bit rate and frame rate. The modified codec is run at resolution 1, i.e., only layer 1 is coded. The bit rate required to fully code layer 1 is computed and the original codec is run at this bit rate. In our experiment the bit rate is 216580 bps. Fig. 8 gives a frame by frame comparison in terms of PSNR in the luminance and chrominance planes. In the luminance plane, the original codec outperforms the modified codec significantly. This is expected because confining the coding to the first resolution layer causes the coder to miss significant coefficients in the higher resolution subbands. These coefficients may be very large and

discarding them can cause the PSNR to decrease significantly. However, in the chrominance planes, the modified codec performs on par with or even better than the original codec. Because chrominance nodes are normally smaller than luminance nodes, the affect of restricting the resolution will not be so significant. Visually, the modified codec gives quite pleasant reconstruction with less brightness. Fig. 10 shows the encoding time and the compression ratio of the two codecs. As the modified codec adds special markings to the encoded bit stream to support QoS implementation and lossy stream decoding, it takes more time in coding and produces a less compressed bit stream. Experimental results also show that when coding 16, 128 and 256 frames, the original codec saves 0.37, 2.28 and 4.25 seconds respectively. Also, the modified codec has a lower compression ratio than the original CSPIHT due to extra bits introduced by the layer ID. Foreman news container suzie Resolution 1 Resolution 2 Resolution 3 Lum Cb Cr Lum Cb Cr Lum Cb Cr 26.21 37.96 37.58 31.09 41.78 42.31 46.09 51.97 50.97 21.92 31.51 37.06 26. 37.57 42.71 48.42 50.01 52.18 21.62 37.88 36. 26.75 43.68 41.06 51.39 54.24 54.58.31 46.02 44.88 35.06 50.32 49.59 52.09 54.69 54.32 Table 2 Average PSNR (db) at 3 different resolutions 35 PSNR (db) 25 20 Original CSPIHT +++++ Modified CSPIHT frame number 15 0 50 100 150 200 250 0 350 (a) 45 PSNR (db) 35 Original CSPIHT +++++ Modified CSPIHT frame number 0 50 100 150 200 250 0 350 (b)

45 PSNR (db) 35 Original CSPIHT +++++ Modified CSPIHT frame number 0 50 100 150 200 250 0 350 (c) Fig. 8 PSNR (db) comparison of the original and the modified codec on (a) luminance plane, (b) Cb plane and (c) Cr plane for the foreman sequence 16 frames 128 frames 256 frames Compression ratio Original 0.72sec 5.sec 10.83sec 1:825 Modified 1.09sec 7.68sec 15.08sec 1:729? t 0.37sec 2.28sec 4.25sec N/A Table 3 Encoding time and compression ratio of the original and modified codec 6. CONCLUSION This paper presents a modified version of the CSPIHT codec that can be applied to real-time layered network transmission. Our codec achieves this at the expense of less PSNR performance in the luminance plane. In the chrominance planes it performs on par with the original CSPIHT codec. ACKNOWLEGEMENT The authors would like to thank Prof. Kamisetty R. Rao from University of Texas at Arlington for his valuable inputs during his visit to our laboratory in the National University of Singapore in January 2003, and Mr. Tan Eng Hong for his valuable inputs on the 3-D CSPIHT. REFERENCE 1. J.M. Shapiro, Embedded Image Coding Using Zerotrees of Wavelets, IEEE Transactions on Signal Processing, vol. 41, pp. 3445-3462, Dec. 1993. 2. A. Said and W.A. Pearlman, A New, Fast, and Efficient Image Codec Based on Set Partitioning in Hierarchical Trees, IEEE Transactions on Circuits and Systems for Video Technology, vol. 6, pp. 243-250, Jun. 1996. 3. B.J. Kim, Z. Xiong and W.A. Pearlman, Low Bit-Rate Scalable Video Coding with 3-D Set Partitioning in Hierarchical Trees (3-D SPIHT), IEEE Transactions on Circuits and Systems for Video Technology, vol. 10, pp. 1374-1386, Dec. 2000. 4. W.S.Lee and A.A. Kassim, Low Bit-Rate Video Coding Using Color Set Partitioning In Hierarchical Trees Scheme, International Conference on Communication Systems 2001, Singapore, Nov. 2001. 5. A.A. Kassim and W.S. Lee, Performance of the Color Set Partitioning In Hierarchical Tree Scheme (C-SPIHT) in Video Coding, Circuits, Systems and Signal Processing, vol. 20, pp. 253-270, 2001.

6. A.A. Kassim and W.S. Lee, Embedded Color Image Coding Using SPIHT with Partial Linked Spatial Orientation Trees, IEEE Transactions on Circuits and Systems for Video Technology, vol. 13, pp. 203-206, Feb, 2003. 7. A.A. Kassim and E.H. Tan, 3D Wavelet Video Codec based on Color Set Partitioning in Hierarchical Trees (CSPIHT), in preparation. 8. T. Kim, S. Choi, R.V. Dyck and N. Bose, Classified Zerotree Wavelet Image Coding and Adaptive Packetization for Low-Bit-Rate Transport, IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, pp. 1022-1034, Sep. 2001. 9. C.K. Tham, Y.S. Gan and Y Jiang, Congestion Adaptation and Layer Prioritization in a Multicast Scalable Video Delivery System, IFIP/IEEE International Conference on Management of Multimedia Networks and Services 2002, Santa Barbara, USA, Oct. 2002. 10. Y.S. Gan and C.K. Tham, "Random Early Detection Assisted Layered Multicast", in "Managing IP Multimedia End-to-End", IFIP/IEEE International Conference on Management of Multimedia Networks and Services 2002, pp. 341-353, Santa Barbara, USA, Oct. 2002. 11. J.Y. Tham, S. Ranganath, and A.A. Kassim, Highly Scalable Wavelet-based Video Codec for Very Low Bit Rate Environment, IEEE Journal on Selected Areas in Communications Special Issue on Very Low Bit-rate Video Coding, vol.16, pp. 12-27, Jan. 1998. 12. E.C. Reed and F. Dufaux, Constrained Bit-Rate Control for Very Low Bit-Rate Streaming-Video Applications, IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, pp. 882-889, Jul. 2001. 13. J.R. Ohm, Three-Dimensional Subband Coding with Motion Compensation, IEEE Transaction on Image Processing, vol. 3, pp.559-571, Sep. 1994. 14. J.W. Woods and G. Lilienfield, A Resolution and Frame-Rate Scalable Subband/Wavelet Video Coder, IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, pp. 1035-1044, Sep. 2001. 15. D. Wu, Y. Hou and Y.Q Zhang, Scalable Video Coding and Transport over Broad-Band Wireless Networks, Invited Paper, Proceedings of the IEEE, vol. 89, pp. 6-20, Jan. 2001.