Audio Engineering Society. Convention Paper. Presented at the 126th Convention 2009 May 7 10 Munich, Germany

Audio Engineering Society Convention Paper Presented at the 126th Convention 2009 May 7 10 Munich, Germany 7712 The papers at this Convention have been selected on the basis of a submitted abstract and extended precis that have been peer reviewed by at least two qualified anonymous reviewers. This convention paper has been reproduced from the author's advance manuscript, without editing, corrections, or consideration by the Review Board. The AES takes no responsibility for the contents. Additional papers may be obtained by sending request and remittance to Audio Engineering Society, 60 East 42 nd Street, New York, New York 10165-2520, USA; also see www.aes.org. All rights reserved. Reproduction of this paper, or any portion thereof, is not permitted without direct permission from the Journal of the Audio Engineering Society. for transitions between LPC-based and non-lpc based audio coding Jeremie Lecomte 1, Philippe Gournay 2, Ralf Geiger 1, Bruno Bessette 2 and Max Neuendorf 1 1 Fraunhofer IIS, Erlangen, 91058, Germany 2 Université de Sherbrooke, Sherbrooke, Québec, J1K2R1, Canada Correspondence should be addressed to Jérémie Lecomte (amm-info@iis.fraunhofer.de) ABSTRACT The reference model selected by MPEG for the forthcoming unified speech and audio codec (USAC) switches between a non-lpc based coding mode (based on AAC) operating in the transform domain and an LPC-based coding mode (derived from AMR-WB+) operating either in the time domain (ACELP) or in the frequency domain (wlpt). Seamlessly switching between these different coding modes required the design of a new set of cross-fade windows optimized to minimize the amount of overhead information sent during transitions between LPC-based and non-lpc based coding. This paper presents the new set of windows which was designed in order to provide an adequate trade-off between overlap duration and time/frequency resolution, and to maintain the benefits of critical sampling through all coding modes. 1. INTRODUCTION It is widely known that time domain codecs based on a Linear Predictive Coding (LPC) representation, such as CELP and its derivatives, perform better on speech signals, while frequency domain codecs based on Modified Discrete Cosine Transform (MDCT) decomposition, such as the Advanced Audio Coding (AAC) codec, perform better on music and on general audio signals. In the past few years, there has been a growing request for a codec capable of good performance on both speech and music signals at low bit rates (below 64 kbps). To fulfill this need, MPEG issued a Call for Proposals for a unified speech and audio codec (USAC) in 2007 [1] and a first reference model was selected in July 2008. This reference model is a switched codec which makes use of either a non-lpc based transform domain audio core codec or an LPC-based core codec.

Seamlessly switching between these different core codecs requires the use of properly designed cross-fade windows. This paper presents the new set of windows which was designed for the USAC codec. The goal of this new set of windows is to minimize the amount of overhead information sent during transitions between LPC-based and non-lpc based coding, to provide an adequate trade-off between overlap duration and time/frequency resolution, and to maintain the benefits of critical sampling through all coding modes. The paper is organized as follows. The AMR-WB+ and HE-AAC codecs which form the basis of the USAC codec are briefly reviewed in section 2. Section 3 gives an overview of the USAC codec. The set of windows developed for transitions between LPC-based and non- LPC based core coding in USAC is described in section 4. Finally, conclusions are drawn in section 5. 2. STATE-OF-THE-ART SPEECH AND AUDIO CODECS This section will shortly describe the main characteristics of the two standards, AMR-WB+ and HE-AAC, which form the basis of the new USAC codec. 2.1. An LPC-based codec: AMR-WB+ The Extended AMR-WB (AMR-WB+) [2] audio coder is a multi-rate audio coder, capable of encoding mono and stereo signals at bit rates ranging from 6 to 48 kbps. Based on LPC, the coder uses a multi-mode encoding model which can switch, on a frame-by-frame basis, between time domain and frequency domain encoding. In time domain mode, the input signal is encoded using the Algebraic Code Excited Linear Prediction (ACELP) encoder from the 3GPP AMR-WB speech coding standard [3]. In frequency domain mode, an LPCweighted version of the input signal is encoded in the FFT domain using Transform Coded excitation (TCX). The input signal is split into 1024 samples super-frames. Each super-frame can be divided in frames of 256, 512 or 1024 samples. Only short frames (nominally 256 samples) are used in time domain mode, while either short, medium or long frames (nominally 256, 512 or 1024 samples) are used in frequency domain mode. A super-frame can therefore be encoded using 26 different ACELP/TCX mode combinations. Although some open-loop strategies exist, the mode combination is normally determined by a closed loop mode selection procedure which minimizes the total weighted error. The AMR-WB+ codec also includes tools for bandwidth and stereo extension. In bandwidth extension, the upper half band of the signal is encoded at a very low bit rate (typically 800 bps) using a parametric approach which relies on spectral folding and spectral envelope shaping (using an LP filter). The stereo image of the input audio signal is encoded using a mid/side representation and a sub-band coding approach. 2.2. A non-lpc based codec: HE-AAC MPEG-4 Advanced Audio Coding (AAC) [4] is a generic audio coding scheme. It utilizes the Modified Discrete Cosine Transform (MDCT) [5] to represent the audio signal in frequency domain. The quantization and coding of the MDCT spectrum is controlled by a perceptual model. Several additional coding tools ensure that the codec provides efficient coding for general audio signals. These tools include a time-variant MDCT filter bank, which allows a switching between more precise frequency and time resolution. Therefore, two different transform lengths of 1024 and 128 samples are available. Table 1 shows the AAC standard windows. The two MDCT lengths are realized in the long and short window respectively. The transitional start and stop windows are used to lead over from one transform length to the other. Window long window start window eight short windows stop window General Design Table 1: Transform windows in AAC standard To further reduce the bitrate, High Efficiency AAC (HE-AAC) [4] combines an AAC core in the low frequency band with a parametric coding approach for the high frequency band (Spectral Band Replication: SBR). The high frequency band is reconstructed from replicated low frequency signal portions, and is controlled by parameter sets containing level, noise and tonality adjustment parameters. Page 2 of 9

HE-AAC inherits its generic M/S stereo and multichannel coding capabilities from AAC. The further enhanced HE-AACv2 possesses a parametric stereo coding tool, which extracts binaural cues from the input channels that are transmitted in addition to a mono downmix. Similarly HE-AAC can be paired with MPEG Surround, a generic parametric multichannel extension, to efficiently code multiple channels [6]. 3. UNIFIED SPEECH AND AUDIO CODING (USAC) This section describes the context of USAC development and gives a brief overview of its reference model version 0 (RM0). 3.1. Motivation for USAC At the 82nd MPEG Meeting in Shenzhen, China, in 2007 the MPEG Audio Subgroup issued a Call for Proposals (CfP) on Unified Speech and Audio Coding [1]. The aim of this activity is to standardize an audio codec which performs consistently and equally well for speech, music and mixed content over a large bitrate range, and at the same time reaches the quality of the best performing state-of-the-art codecs for each type of content. As a characteristic feature, the core coder derived from AMR-WB+ applies an LPC tool on the input signal as one of the first processing steps, making the rest of the signal flow operate in an LPC-filtered domain (LPD). The input signal is further encoded using both time and frequency coding, where switching between both modes can be performed on a frame-by-frame basis. The time domain mode of the LPD coder is based on ACELP technology, which is known for its excellent speech coding capability. Alternatively, an MDCT-based transform coder allows coding of the weighted LPC filtered signal similar to the TCX known from AMR- WB+. In order to better distinguish the original TCX from the new MDCT-based coding used in the RM0, the latter is more precisely called weighted Linear Predictive Transform (wlpt) coding throughout this paper. The LPD coding path is typically activated for speechlike signals. The other, AAC derived, non-lpd coding algorithm is used for other general audio and music signals and is also usually activated exclusively at high bit rates, because it scales towards transparency as known from existing MPEG technologies. Seven candidates were provided as responses to the CfP. The evaluation of these responses was based on nine listening tests covering bitrates from 12 kbps mono up to 64 kbps stereo and assessing performance on all signal types. Eventually a candidate designed jointly by VoiceAge Corporation and Fraunhofer IIS was selected as the RM0 for USAC. At this stage, the proposed technology already performed equally good or better than the two state-of-the-art reference codecs at all test points. 3.2. RM0 system overview The technology in USAC RM0 combines state-of-theart MPEG technology such as AAC, SBR and MPEG Surround with state-of-the-art LPC based speech coder technology such as ACELP and TCX. At the core of the USAC RM0 is a hybrid coding scheme with two core codecs, derived from AAC and AMR-WB+. A mode switch selects one of the two coders (see Figure 1). Figure 1: Simplified encoder diagram of the USAC RM0 For bandwidth extension and parametric stereo coding, modified and enhanced versions of SBR [4] and MPEG Surround [6] are used on top of the switched AAC and ACELP/wLPT core. More details on the USAC reference model 0 can be found in [7, 8]. Page 3 of 9

4. THE NEW SET OF WINDOWS DEVELOPED FOR USAC Using two fundamentally different coding paradigms in one unified system poses a series of problems at the transition points where one core codec switches over to the other: risk of blocking artifacts, possible overhead of information required by transitions and necessity for constant framing. In the USAC framework all this is particularly challenging because the non-lpd domain core codec uses an MDCT. The MDCT allows an overlapping of adjacent blocks by a maximum of 50% without introducing additional overhead. This is particularly helpful to smooth blocking artifacts, but requires introducing Time Domain Aliasing (TDA) which has to be canceled out during synthesis [5]. A Time Domain Aliasing Cancellation (TDAC) is done by an adequate overlap-add operation of adjacent MDCT blocks on synthesis side. In USAC however, adjacent blocks can be coded using the LPD coder, which has either: a) Time Domain Aliasing (TDA) in a weighted LPC domain (not in the signal domain) or b) no TDA at all. In order to allow proper aliasing cancellation with the non-lpd mode (which introduces aliasing in the signal domain), the required aliasing components must be converted into the signal domain (case a) or introduced artificially by simulating the MDCT operations of analysis windowing, folding, unfolding and synthesis windowing (case b). Another solution to this problem is the design of MDCT analysis/synthesis windows without a TDAC region. The overlap-add operation is then the same as a simple cross-fade over the range of the window slope. Both methods are used in USAC RM0. In order to get the necessary and appropriate overlap areas for cross-fade and TDAC, a slightly different time alignment between the two coding modes had to be introduced as explained in section 4.4.1. 4.1. Categories of Windows The complete set of windows is divided into 4 categories depending on the coding mode of the previous, current and following frames: - The first category of windows is used when the core coder stays in the LPD mode. This case is presented in section 4.2. - The second category of windows is used when the core coder stays in the non-lpd mode. This case is presented in section 4.3. - The last two categories deal with transitions between the two coding modes. Transitions from the non-lpd mode to the LPD mode are presented in section 4.4.1.Transitions from the LPD mode to the non-lpd mode are presented in section 4.4.2. Two special cases are considered, depending on the coding mode (wlpt or ACELP) of the LPD frame. Figure 2 represents a basic scheme for switching back and forth from the LPD to the non-lpd modes. In the presented case the LPC processed block corresponds to four AMR-WB+ frames or ACELP frames (of size 256 samples). 4.2. LPD mode to LPD mode When comparing the LPD mode in USAC to the original AMR-WB+ codec, it can be seen that the TCX filterbank was replaced by an MDCT. In this wlpt, the aliasing is computed in the weighted LPC domain (i.e. after the weighting filter W(z)) as shown in Figure 3. Therefore the original window switching procedure presented in section 5.3 of [2] had to be modified. In the original TCX, the right hand slope of the window covers 1/9th of the entire window length and the left hand slope length is equal to the right hand slope length of the Page 4 of 9

previous frame to achieve perfect reconstruction. In wlpt, the use of the MDCT allows larger and homogeneous overlap regions; therefore the overlap size is fixed to 128 samples. Figure 4: Stop_start window sequence 4.4. Transitions between LPD and non-lpd Figure 3: MDCT computation for the wlpt: aliasing occurs in the weighted LPC domain 4.3. Non-LPD mode to non-lpd mode If both the previous and the following frames are encoded in the non-lpd mode, then generally, the regular AAC transform windows presented in table 1 are used. Short windows, which have a better time resolution, are used to avoid pre- or post-echoes on transients caused by temporal smearing of the quantization noise. However, since coding efficiency is generally worse for lower frequency resolutions, the coder switches to short windows only when necessary and back to normal long windows as soon as possible. Unfortunately the windowing specification presented in [4] does not allow switching from short windows to an isolated long window and immediately back to short windows. Instead transition windows, called stop and start windows, must be used to switch from short to long windows and vice versa. In addition to the regular AAC transition windows, a new combined stop_start window producing an isolated long block was introduced in USAC RM0. Figure 4 shows this new window, which is useful to avoid undesirable latency for closely spaced transients, e.g. in case of transients occurring at a time interval longer than one long frame but shorter than two. In such a case it would make no sense to apply the short-stop-startshort sequence authorized by the AAC standard, because the codec might miss the transient and also apply short windows on perhaps stable signal. The following transition procedures represent the main innovation of this paper. The main challenge is to connect smoothly two different domains. Section 4.4.1 presents transitions from the non-lpd to the LPD mode, while section 4.4.2 presents transitions from the LPD to the non-lpd mode. 4.4.1. Non-LPD mode to LPD mode The LPD codec includes some predictors and internal filters which, during start-up, need a short time to reach a state which ensures an accurate filter synthesis. Using a rectangular window at the beginning of the first LPD frame and resetting the LPD-based codec to a zero state is therefore obviously not the ideal option for these transitions, because it would not leave enough time for the LPD codec to build-up a good signal and as a result would introduce blocking artifacts. Using a rectangular window but properly resetting the internal state of the LPD codec, including filter memories and the adaptive codebook used by ACELP, using past synthesis samples from the previous non-lpd frame was also considered. This operation requires among other things decoding the previous non-lpd frame, performing an LPC analysis, and applying the LPC analysis filter to the non-lpd synthesis signal. The impact on quality is, however, minimal and hence this approach was not further considered given the large increase in complexity. Introducing time domain aliasing in the original signal before LPD coding is not feasible either, because time domain aliasing is not compatible with prediction-based time domain coding such as ACELP. A possibility was to introduce an artificial aliasing in the beginning of the LPD segment and to apply TDAC in the same way as for ACELP to non-lpd transitions (see 4.4.2). However, in this case the artificial aliasing is produced from the synthesis signal instead of the original one. Since the synthesis signal is inaccurate especially at the LPD start-up, the introduction of artificial TDA would rather emphasize this error than reduce artifacts. Page 5 of 9

To avoid these problems, a modified start window without any time domain aliasing on its right side was designed. The right part of this window, which is represented in Figure 5, finishes before the centre of the TDA (i.e. the folding point) of the MDCT. Consequently, the modified start window is free of time domain aliasing on its right side. Compared to the standard short window which has an overlap of 128 samples (including TDA), the overlap region of the modified start window is reduced to 64 samples. This overlap regions is however still sufficient to smooth the blocking effect. Furthermore, it reduces the impact of the inaccuracy due to the start of the LPD coder by feeding it with a faded-in input. Note that this transition requires an overhead of 64 samples, i.e. that 64 samples are coded by both the non-lpd codec and the LPD codec. This results in a small difference in alignment between the non-lpd and the LPD core codecs. This small misalignment is compensated when the codec switches back again to the non-lpd codec, as explained in section 4.4.2. wlpt to non-lpd mode Figure 6 shows an example of transition from the LPD to the non-lpd modes, when the previous frame was coded using wlpt. Unlike ACELP which only codes the samples inside the frame, wlpt naturally provides some overlap after the end of the previous frame. Therefore, the standard AAC window length (1024- sample kernel) can be used and critical sampling is preserved. Note that in this case the overlap provided by wlpt is sufficient to compensate the misalignment between the LPD and non-lpd core codecs. Figure 6: Transitions from the LPD to non-lpd mode, when the previous frame was coded using wlpt Figure 5: Window scheme for transitions from the non-lpd mode to the LPD mode 4.4.2. LPD mode to non-lpd mode This section describes transitions from the LPD to the non-lpd mode. Two cases are considered: first, transitions from wlpt to the non-lpd mode; then, transitions from ACELP to the non-lpd mode. The examples are given for the stop-like case only, i.e. when the second half of the first non-lpd window has a 1024-sample overlap. As explained in Section 5, the complete family of transition windows also includes the stop_start-like case, i.e. when the second half of the first non-lpd window has a 128-sample overlap. During these transitions, the TDA introduced by the non-lpd mode in the overlap region (which is 128 samples long) has to be canceled out. The normal way to do this is to use the aliasing present in the previous frame, that is to say present in the overlap part of the MDCT-based wlpt frame. But TDAC is not straightforward in this case, because the aliasing in the two frames were produced in two different domains (LPD and non-lpd mode). The MDCT in wlpt is not computed directly in the signal domain, but after filtering the signal with a filter W(z) based on the LPC coefficients. W(z) is called the weighted analysis filter and permits to both whiten the input signal and shape the quantization noise by a formant-based curve which is in line with psycho-acoustic theories. Therefore, in order to have the aliasing contribution of the wlpt overlap part in the same domain as in AAC, i.e. in the signal domain, the weighting filter W(z) is moved between the folding operation and the DCT IV of the MDCT in wlpt. Figure 7 shows the modification compared to Figure 3 in the specific case of overlapand-add with a non-lpd frame. Page 6 of 9

The corresponding aliasing, required for perfect reconstruction, is artificially introduced in the right end of the synthesized ACELP frame delivered by the LPCbased codec. Time domain aliasing cancelation is achieved by windowing, folding, unfolding and windowing again the ACELP contribution, then overlap-adding it to the non-lpd mode contribution, in an MDCT/IMDCT manner. This process is illustrated in Figure 9. This approach requires the introduction of 64 overhead samples only. Figure 7: Time domain aliasing introduction in the wlpt overlap segment for wlpt to non-lpd mode transitions ACELP to non-lpd mode Figure 8 shows an example of a transition from the LPD mode to the non-lpd mode, when the previous frame was encoded using ACELP. This case is characterized by a transition from a codec operating in the LPC residual domain to a codec operating directly in the signal domain. The time domain aliasing introduced by the AAC codec in the overlap region (which is 128 samples long) is kept unchanged. Figure 9: Artificial time domain aliasing introduction and time domain aliasing cancellation for ACELP to non-lpd mode transitions Unlike wlpt, ACELP does not provide any overlap region which can be used to compensate the difference in alignment between the LPD and the non-lpd modes. Since the non-lpd mode is very flexible in terms of window length, of number of transmitted spectral coefficients, and of bit allocation, it was chosen to let this codec provide the necessary number of overhead samples. The window size for this transition has therefore enlarged from 2048 to 2304 samples. The flat region on the left side of this window is 128 samples longer than the standard length of 448 samples to compensate for the misalignment. A new MDCT kernel of 1152 samples was consequently introduced in the USAC codec. 5. OVERVIEW OF THE NEW FAMILY OF WINDOWS Figure 8: Window scheme for transitions from the LPD mode to the non-lpd mode, when the previous frame was coded using ACELP Figure 10 shows all possible windows and the allowed changeovers between windows. Here, circles represent the different windows; lines indicate allowed sequences of windows. Please note that circles marked with an asterisk ( ) really represent a group of windows, which all share the same characteristic in terms of underlying transform length (short or long), window slope (short or long) and coding mode (LPD or non- LPD), but may vary in detail depending on the adjacent windows. The general appearance is hinted at by small icons in the lower part of the circles. For example, as shown in section 4.4.2, two stop windows are available Page 7 of 9

depending on the core mode of the previous frame, one of which has an increase transform length (1152 instead of 1024), but the overall appearance is the same for both (non-lpd mode, short slope on the left, long transform, long slope on the right). Similarly, there are two start windows, depending on the following core mode. For the stop_start window, as many as four different variants are conceivable. The lines indicating the allowed succession of windows are accompanied by a three digit acronym [x 1 x 2 x 3 ], which helps to understand on what condition a particular line would be followed. The first digit x 1 indicates if the frame will be coded in the LPD mode (=1) or in non-lpd mode (=0). If in non-lpd mode (first digit is 0), the second digit x 2 indicates whether an attack (i.e. a transitional audio event) is present in the frame (=1) or not (=0), essentially triggering the use of short windows. The last digit x 3 is looking one frame further ahead and indicates whether the frame following the current frame will be a non-lpd frame without attacks (=0) or whether it contains an attack or will be coded in LPD (in both cases =1). The symbol indicates a wildcard and means that the value can be either 0 or 1 and is basically ignored. As an example, if the encoder decides to encode the next frame in non- LPD mode ([0..), which is reasonably stationary (no attacks) (..0..), and the next but one frame will be coded in LPD mode (..1]), it would travel along the line ([001]). The meaning of the digits is summarized in Table 2. [ x 1 x 2 x 3 ] Mode decision Attack in present Following frame frame? 0 = non-lpd 0 = No 0 = attack-free, non-lpd 1 = LPD 1 = Yes 1 = attack or LPD mode = ignored = ignored Table 2: Legend for the window state transitions of Figure 10 6. CONCLUSION This paper presented a new family of windows designed for transitions between an LPC-domain codec such as AMR-WB+ and a purely transform based codec such as HE-AAC. The major problem solved was the transitions between a frequency domain codec with time domain aliasing (such as both wlpt and AAC) and a time domain codec which is normally incompatible with time domain aliasing because of its use of long-term prediction (ACELP). The proposed family of windows provides smooth transitions (no blocking artifacts) and introduces a minimum of overhead (in terms of number of samples coded twice during the overlap regions and in terms of bit transmitted). This undoubtedly contributed to the success of the candidate technology selected as the reference model version 0 of the MPEG unified speech and audio codec USAC. [1 ] [01 ] [001] [000] [000] [01 ] Page 8 of 9

7. REFERENCES [1] ISO/IEC JTC1/SC29/WG11 MPEG2007/N9519, Call for Proposals on Unified Speech and Audio Coding [2] 3GPP Technical Specification TS26.290, Audio codec processing functions; Extended Adaptive Multi-Rate - Wideband (AMR-WB+) codec; Transcoding functions, March 2007 [3] 3GPP Technical Specification TS26.171, Adaptive Multi-Rate Wideband (AMR-WB) speech codec; General description, 2002 [4] ISO/IEC 14496-3, Information technology: Coding of audio-visual objects, Part 3: Audio [5] J. Princen and A. Bradley, Analysis/Synthesis Filter Bank Design Based on Time Domain Aliasing Cancellation, IEEE Trans. on Acoustics, Speech and Signal Processing, vol.34 n.5, oct.1986 [6] ISO/IEC 23003-1:2007 Information technology MPEG audio technologies Part 1: MPEG Surround [7] A Novel Scheme for Low Bitrate Unified Speech and Audio Coding MPEG RM0, M. Neuendorf et al., Paper submitted to the 126th AES convention, Munich, Germany, May 2009 [8] M. Neuendorf, P. Gournay, M. Multrus, J. Lecomte, B. Bessette, R. Geiger, S. Bayer, G. Fuchs, J. Hilpert, N. Rettelbach, R. Salami, G. Schuller, R. Lefebvre, and B. Grill. Unified Speech and Audio Coding Scheme for High Quality at Low Bitrates. Accepted for publication at ICASSP 09, Taipei, Taiwan. Page 9 of 9