Audio Engineering Society Convention Paper Presented at the 121st Convention 2006 October 5 8 San Francisco, CA, USA This convention paper has been reproduced from the author s advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42 nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society. Error-Robust Frame Splitting For Audio Streaming Over the Lossy Packet Network Jong Kyu Kim 1, Hwan Sik Yun 1, Jung Su Kim 1, Joon-Hyuk Chang 2, and Nam Soo Kim 1 1 School of Electrical Engineering, Seoul National University, Seoul 151-744, Korea 2 School of Electronic and Electrical Engineering, Inha University, Incheon 402-751, Korea Correspondence should be addressed to Chong Kyu Kim-Author (ckkim@hi.snu.ac.kr) ABSTRACT In this paper, we propose a novel audio streaming scheme for perceptual audio coder over the packet-switching network. Each frame is split into several subframes which are independently decoded based on the specified packet size for robust error concealment. We further improve the subframe splitting techniques by allocating the spectral lines to each subframe adaptively. Through an informal listening test, it is discovered that our approach enhances audio signal under the lossy packet network environments. 1. INTRODUCTION Audio streaming has become one of the most popular data services in mobile communications on these days. Most of the audio streaming services are based on the packet-switching network where messages are divided into packets and each packet is transmitted individually. In the mobile packet-switching network, one of the most typical type of errors is the packet loss. Packet loss may arise in many different forms on the internet or wireless networks [1]. Under such packet loss conditions, it is crucial to guarantee the user-perceived quality of service (QoS). There are several practical techniques of audio dd streaming such as the error resilience (ER) or error protection (EP) tools in the MPEG Advanced Audio Coding (AAC) standard [2]. These tools can be applied to cope with bit errors caused by the packet losses error. When packet loss occurs, error concealment is usually applied to substitute the lost part with a suitable data. Error concealment algorithms are implemented at the receiver of the audio streams and they do not usually require any side information from the transmitter [3]. The major objective of packet error concealment is to regenerate the lost data so that it is perceptually indistinguishable from the original. Also, there have been several proposals on packeti-
zation scheme for error-robust audio streaming over the packet-switching networks. RTP payload format [4] defines a general and configurable payload structure to transport MPEG-4 elementary streams, which includes detection of the loss of crucial information in the bitstream, optional interleaving of audio frames, and retransmission or forward error correction with due consideration to congestion control. A more specific strategy on the packetization of audio bitstream has been proposed with an internal structure of encoded audio frame [5]. This strategy arranges MPEG-AAC frames in different packets according to the proportional priority considering the tradeoff between redundancy overhead and retransmission delay. In this paper, we propose a novel frame splitting scheme for robust audio streaming over the packetswitching networks. The proposed scheme can be applied to the cases in which a single audio frame should be split into a number of seprate packets. Such situation happens when the size of the packet is smaller than that of the audio frame or when the audio frame should be segmented into subblocks and interleaved. The proposed technique is found effective to enhance the audio quality that may possibly be degraded due to the packet losses and the mismatch between the audio frame and packet sizes specified in network and audio codec configurations. The rest of this paper is organized as follows; in Section 2 we address the general structure of a perceptual audio coder and its bitstream. And we address its defects when applied to transmission over packetswitching network. Then we describe the proposed frame splitting scheme in Section 3 and present an adaptive frame splitting technique which prevents deterioration of coding efficiency in Section 4. Following the experimental results in Section 5, we conclude this paper in Section 6. 2. STRUCTURE OF AUDIO BITSTREAM Generally, perceptual audio coders are developed with little attention to the transmission error. In this section, we consider the problem of streaming the compressed audio data over the lossy packet network. In conventional audio coding algorithm, each block of audio samples are converted to a frame of bitstream which is independently decoded. A block of input samples are transformed into a set of spectral lines in the frequency domain through a timefrequency transformation. Perceptual audio coding algorithms achieve a high coding gain by exploiting both the perceptual irrelevancies and the statistical redundancies in the spectral domain [6]. Perceptually irrelevant components are removed by adjusting the quantization stepsizes depending on the masking level computed from the psycho-acoustics model. On the other hand, statistical redundancies are removed with the use of an entropy coding technique such as the Huffman coding or DPCM. Consequently, the compressed audio bitstream should consist of the entropy encoded spectral information, side information which is used to decode each spectral line and header information which conveys the configurations e.g., sampling rate, number of channels and so on. In general, these informations are written sequentially as shown in Fig. 1. As mentioned above, audio frame is the smallest unit that can be decoded independently in a perceptual audio coder. For the transmission of the compressed audio data over a packet-switching network, the bitstream should be segmented into serveral packets of appropriate size. If the audio encoder has been developed without knowing the network specification, there usually exist some mismatches between the audio frame and packet sizes. If the frame border does not coincide with the packet border, information in a frame spans over two adjacent packets. When either of these two packets is missing at the receiver, the frame can not be decoded perfectly; more precisely, part of the frame can be decoded depending on which part of the frame was lost. However the possibility is low since the audio bitstream is highly vulnerable to consecutive bit errors caused by the missing packet. An even worse case arises when a single frame is segmented into several packets. This happens when the packet size should be smaller than the frame size or when a frame is segmented into partitions and interleaved Fig. 1: Structure of the bistream in the conventional audio coding algorithms Page 2 of 7
Author et al. Fig. 2: Synchronization between audio frames and packets over several packets. In this case, the loss of a single packet results in a loss of the whole audio frame even though all the other packets are received successfully. Fig. 3: Sequential Splitting 3. FRAME SPLITTING In order to cope with the mismatch between the audio frame and packet sizes and to achieve an efficient audio data streaming which is robust to packet losses, we modify the conventional audio encoding technique. Our approach splits each audio frame into several subframes such that the size of a subframe matches the size of a packet and each subframe can be decoded independently. Even though the basic idea can be applied to various perceptual audio coders, we focus on the modification of the MPEG-AAC in this work. 3.1. Splitting into Subframes Every time a frame is encoded, the available number of packets is given to maintain the time synchrony. This comes from the network bandwidth configuration. As a result, the audio coder operates in a variable rate mode. An example of assigning each packet to the corresponding audio frame is shown in Fig. 2 where we can see that the number of packets varies from frame to frame. One drawback of this scheme is that the frame which is split into smaller number of packets is given less bits for audio compression irrespective of the spectral contents. This effect becomes weaker as the packet size gets smaller. In conventional perceptual audio coders, spectral lines are grouped into frequency bands which are referred to as scalefactor bands (as in MPEG-AAC). The spectral lines in each scalefactor band are entropy coded with a side information added to the bitstream. Since the spectral lines in each scalefactor band are coded jointly, it is desirable to split the audio frame by treating each scalefactor band as the basic unit. Fig. 4: Interleaving Splitting 3.2. Scaeflactor Band Allocation Rule The rule according to which the scalefactor bands are allocated to each packet can be arbitrarily chosen if only both the encoder and the decoder know it exactly. It is easy to devise two simple rules: sequential and interleaving schemes of splitting. In the sequential splitting scheme, adjacent scalefactor bands are allocated into the same packet as shown in Fig. 3. A major shortcoming of this scheme is to make a large spectral gap in the frequency domain when a packet is missing. To alleviate this deterioration, the interleaving scheme interleaves the order of scalefactor bands before sequentially assigning to each packet as shown in Fig. 4. Since the interleaving operation disperses the effect of packet loss, missing spectral lines which appear as a large spectral gap in the original sequential splitting scheme are replaced by multiple AES 121st Convention, San Francisco, CA, USA, 2006 October 5 8 Page 3 of 7
small gaps which are perceptually preferred. This also helps error concealment because missing spectral lines can be predicted based on the correlation with the neighboring spectral lines. A disadvantage of the interleaving scheme is that it decreases the coding efficiency since the spectral lines collected over a wide frequency range has low redundancies. This results in a deterioration of the perceived audio quality at the same bitrate. Consequently, an optimal splitting rule should be designed based on a tradeoff between the coding efficiency and error concealment. In the splitting rule, the number of scalefactor bands allocated to each subframe is an important factor that determines the audio quality. It is due to the fact that spectral lines are distributed unequally over the whole frequency range. For instance, there are usually more spectral contents in low frequency bands than in high frequency bands. If every packet assumes an equal number of scalefactor bands, the low frequency bands are likely to be coded with less bits than required. Adjusting the splitting rule according to the overall statistical distribution of the audio signals can alleviate this effect in some degree. However the distribution of spectral information varies rapidly over time and the fixed rule may not guarantee a proper splitting for some frames leading to a degradation of the audio quality. A suboptimal splitting rule will be discussed in the following section. 3.3. Encoding After the allocation of scalefactor bands according to the splitting rule mentioned above, each subframe is independently encoded. Encoding is executed based on the general audio coding algorithm with a slight modification. First, the number of available bits is given as a parameter for the separate encoding of each subframe. This parameter is used for rate control. Since each subframe should be quantized independently, it is inavoidable to modify the conventional audio coding algorithm. Instead of computing the global gain over all the scalefactor bands, separate global gain is obtained for each subframe considering only the scalefactor bands that belong to it. Once the global gain is obtained, each scalefactor is fed into the rate control loop which iteratively determines the quantization levels of the scalefactor bands according to the bitrate constraint [2].The rate control loop is almost the same to that of the conventional audio coder. The only difference lies on that in our approach the scalefactor bands that belong to the same subframe are considered simultaneously. After the bit allocation, encoding data of each subframe is separately written as a bitstream for later packetization. The bitstream structure of each subframe is not much different from that of the normal audio frame specified in conventional audio coding. For a decoding robust to packet loss, the header is added to all the subframe data. Even though this may be considered an overhead on the limited network bandwidth, the header data is usually much smaller than the other data that describes the audio contents. 4. ADAPTIVE SPLITTING Now, what remains is how to optimally split the audio frame into a finite number of subframes. Splitting here means a mapping that allocates each scalefactor band to a specific subframe on packet. As mentioned in the previous section, a fixed allocation is not desirable for achieving high audio quality despite its advantage that it does not require any side information to be delivered to the receiver. A more promising approach is to split the audio frame such that all the subframes are encoded with a equal amount of coding efficiency such that no specific subframe causes the low audio quality. To measure the coding efficiency for each subframe, we apply the noise-to-mask ratio (NMR) which represents a ratio of the quantization noise to the masking threshold [2]. Let R i denote the NMR for the i th scalefactor band. Then, R i = N i M i (1) where N i is the power of the quantization noise and M i is the masking threshold computed from the psycho-acoustics model for the i th scalefactor band. Creating a constant NMR over all scalefactor bands is an objective in the rate-distortion control module of the MPEG-AAC when the number of available bits are higher or lower than the required bits [2]. Analogous to this method, we also aim to allocate sclaefactor bands to subframes such that all Page 4 of 7
the subframes have almost the same level of NMR. For the adaptive frame splitting, we propose an algorithm that operates in an iterative manner. A flowchart of the overall algorithm is shown in Fig. 5. At the initial phase, scalefactors are allocated to each subframe with a default splitting rule then each subframe is encoded. After encoding, NMR for each subframe is calculated and it is checked whether the NMRs are equally distributed. If the NMRs are found to be unbalanced, scalefactor bands are reallocated by increasing the number of scalefactor bands in the subframe with the maximum NMR while decreasing the number of scalefactor bands in subframe with minimum NMR. Then the process of encoding and NMR computation are executed again. As this iteration continues, the number of scalefactor bands allocated to each subframe converges as shown in Fig.?? in which the number of scalefactor bands in each subframe is plotted. The iteration stops when the frame splitting does not make any change. Information on the adaptive splitting should be included in the bitstream of each subframe. Decoder arranges the decoded scalefactor bands according to it. For an independent decoding, each packet should have the location index as well as the number of scalefactor bands the subframe has. In our implementation, we assign 5 bits for the starting location index and another 5 bits for the number of scalefactor bands. An example of frame splitting with the relevant information to be coded is given in Table 1. Subframe Index 1 2 3 4 5 First Scalefactor Band Index 0 8 19 28 33 Last Scalefactor Band Index 7 18 27 32 37 Starting Location Index 8 11 6 Number of Scalefactor Bands 7 11 9 5 5 Table 1: Representation of Allocation Information. 5. TEST RESULTS To evaluate the performance of the proposed scheme, we implemented the frame splitting module on the MPEG-AAC platform. For simplicity, we made several modifications on the original MPEG-AAC algorithm. First, we did not apply the block switching technique such that the audio analysis could be performed based on only the long block. Second, additional encoding tools such as temporal noise shaping Fig. 5: A flowchart of the overall frame splitting algorithm Page 5 of 7
(TNS) and gain control were not applied. Finally, the bit reservoir was not adopted. Sampling Rate Frame Size Input File Length Number of Channels 11,025 Hz 92.88 ms 40 s Mono Table 2: Test audio coder specifications. Bitrate 8.4 kbps Packet Length 20 ms Packet Size 168 bits PER (Packet Error Rate) over 20% Table 3: Packet network specifications Specifications for the audio frame, packet size and network condition in the experiments are shown in Tables. These specifications were derived from an audio streaming application where the input signal is compressed by the MPEG-AAC and then transmitted over the Code Division Multiple Access (CDMA) packet-switching network. When a packet loss occurred, an error concealment algorithm was applied to reconstruct the missing spectral lines. For the error concealment, we took the simple repetition strategy with which the lost spectral lines were substituted with those spectral components that were successfully received most recently. If the packet loss occurred continuously (burst packet loss), corresonding spectral lines were faded out exponentially, and muted from a certain number of consecutive packet losses. The same error concealment scheme was applied to both the original audio coder and proposed one. A comparison of waveforms decoded from the audio bitstream damaged from some packet losses is given in Fig. 7. The first plot shows the original input waveform and the second plot is the waveform obtained from the conventional MPEG-AAC decoder. The third plot displays the waveform obtained from the proposed frame splitting algorithm. The graph at the bottom illustrates the applied error sequence where 0 represents no packet error and 1 indicates the packet loss. At the locations where packet losses occurred, original algorithm could not decode the received frames and faded out the waveform. In contrast, the proposed algorithm recovered the partly Fig. 6: An iteration to find the numbers of scalefactor bands assigned to the subframes Fig. 7: Decoded Waveform Page 6 of 7
lost audio frames and the lost spectral lines could be concealed more faithfully. This example clearly demonstrates the advantage of our frame splitting technique in the loss packet environments. For a further evaluation of the performance, an informal listening test was carried out by ten listeners. All the ten subjects provided an opinion that the decoded audio data obtained from the proposed approach had much less interruption caused by the packet losses compared to that from the original MPEG-AAC. for transport of MPEG-4 elementary streams, IETF RFC 3640, 2003. [5] J. Korhonen, Y. Wang and D. Isherwood, Toward bandwidth-efficient and error-robust audio streaming over lossy packet networks, Multimedia Systems Journal (MMSJ), 2005. [6] T. Painter, A. Spanias. Perceptual coding of digital audio, In Proc. the IEEE, April 2000. 6. CONCLUSIONS In this paper, we have proposed a frame splitting scheme in perceptual audio coding algorithms. Each subframe is independently encoded such that it fits the specified packet size. Received packets are independently decoded without being affected by the other missing packets. An informal subjective listening evaluation has shown that the suggested scheme dramatically improves the audio streaming quality under the lossy packet network environment. 7. ACKNOWLEDGEMENT This work was supported by SK Telecom, and the authors would like to thank Dr. D. H. Lee, Dr. S. S. Park and D. S. Woo at SK Telecom for their helpful discussions. 8. REFERENCES [1] Y. Wang, A. Ahmaniemi, D. Isherwood and W. Huang, Content-based UEP: A new scheme for packet loss recovery in music streaming, ACM Multimedia Conference, Berkeley, CA, USA, Nov. 2003. [2] ISO. Information Technology-Coding of Audio- Visual Objects, 1999. ISO/IEC JTC1/SC29 WG11, ISO/IEC IS-14496 (Part-3, Audio). [3] B.W. Wah, X. Su and D. Lin., A survey of error concealment schemes for real-time audio and video transmissions over the internet, IEEE International Symposium on Multimedia Software Engineering, Taipei, Taiwan, pp. 17-24, Dec. 2000. [4] J. van der Meer, D. Mackie, V. Swaminathan, D. Singer, P. Singer, RTP payload format Page 7 of 7