MPEG-4 Advanced Audio Coding - PDF Free Download

MPEG-4 Advanced Audio Coding Peter Doliwa Abstract The goal of the MPEG-4 Audio standard is to provide a universal toolbox for transparent and efficient coding of natural audio signals for many different application areas. First, the MPEG-2 Advanced Audio Coder, the core of MPEG-4 Audio, is described, followed by the new tools added in MPEG-4 Audio version 1 and version 2 for improvements in coding efficiency and perceived audio quality and adding new functionalities as error robustness, low-delay AAC and fine grain scalability. Then the process of standardisation of MPEG-4 Audio and the current and future developments of the standard towards new extensions for lossless audio coding, bandwidth extension of audio signals and parametric audio coding of wide-band signals are shown. Finally, some examples on existing standards based on MPEG-4 Audio and implementations of the MPEG-4 specifications showing its variety of application areas is given. 1 Introduction MPEG-4 Advanced Audio Coding(AAC) is a universal coding toolbox for transparent coding of general audio signals not optimized for a specified source, but using information about the signal receiver (the human auditory system) instead. A psychoacoustic model is used to simulate the ability of the human auditory system to perceive different frequencies. Tones at different frequencies with equal power are not perceived with equal power. The perceptual model is also used to model the masking effects of loud tones that mask quieter tones and quantization noise around its frequency. The perceivable frequencies are divided into several frequency bands, this part of the signal spectrum is then analyzed and a masking threshold is calculated. For quantization only that amount of bits is needed, that is enough to keep the introduced quantization noise below the calculated masking threshold. In comparison to the first MPEG-2 multichannel audio standard, MPEG-2 AAC offers higher quality at lower bitrates, because it is not restricted to be backward compatible to MPEG-1. Many new perceptual audio technologies were developed since the standardization of older formats like MPEG-1 layer3 (MP3) that allow much higher coding efficiency, so the MP3 format is outdated. MPEG-2 AAC shows excellent encoding performance at very low bitrates additionally to its efficiency at standard bitrates, so it was selected as the core of the MPEG-4 general audio (time/frequency) coder. The main principle of MPEG-4 is universality, i.e. one universal standard with several different tools optimized for different kinds of applications and bitrates (quality). Its tools offer several new functionalities like low-delay, error robustness and scalability in addition to the standard compression functionality of older standards. To reach this universality, the interoperability between all these different tools is very important. This universality makes it possible to use one standard (MPEG-4 Audio) for every kind of application, so its usage is not as limited as the usage of older standards as MP3, it can be adapted to its usage by selecting only the needed tools out of the predefined tools in MPEG-4. Another advantage in comparison to older standards is the expandability of the standard, there are still new developments made for MPEG-4 to provide new tools for even 1

Figure 1: Masking threshold more applications. The predefined profiles optimized for certain important applications define the tools used for these applications. Possible applications for MPEG-4 Audio are internet streaming or downloads, digital radio broadcast, digital satellite and cable broadcast, portable players, data storage (audio), third generation mobile phone and wireless networks multimedia services and bidirectional communications. 2 MPEG-4 General Audio Coding Tools 2.1 MPEG-2 Advanced Audio Coding(AAC) MPEG2-AAC is based on time/frequency audio coding, which exploits linear correlation between subsequent samples (redundancy reduction),and uses a perceptual model, which models the human auditory system to remove unperceivable signal parts (irrelevancy removal). 2.1.1 AAC-Filterbank The Filterbank maps the signal samples into a spectral representation using a modified dicrete cosine transformation (MDCT) with critical subsampling (one spectral coefficient per sample) and overlapping subsequent analysis windows. Although efficient redundancy reduction and coding of stationary tonal signals is achieved, adaptive switching of the MDCT window size is needed to avoid breaking the conditions for temporal masking for transient signals with a long window size. Quantization noise is evenly distributed through a whole MDCT window, while the masking threshold can vary within that time. There are two different resolutions in AAC, one with 1024 spectral coefficients (one long window) and one with eight sets of 128 coefficients (eight short windows) and switching between them is supported through transition windows. The encoder also selects the optimal shape for each of these windows between a Kaiser-Bessel-derived window (KBD) with improved far-off rejection of its filter response and a sine window with a wider main lobe. 2

Figure 2: AAC encoder 2.1.2 Quantization AAC uses a nonuniform power-law quantization, where smaller values are quantized finer and larger values are quantized coarser, so that quantization noise is stronger at larger values and is easier masked. Scalefactors are used to scale the spectral coefficients before the quantization to be able to control the power of the introduced quantization noise. The spectral coefficients are grouped together into scalefactor bands to allow different scalefactors for different frequency bands. All scalefactors are differentially Huffman encoded, i.e. only the difference between the values of subsequent bands is coded. 2.1.3 Noiseless Coding The noiseless coding stage uses sectioning and Huffman coding (entropy coding) and exploits statistical redundancy to efficiently encode the 1024 coefficients without further loss of information. One section can comprise several subsequent scalefactor bands, which use the same Huffman Codebook to minimize the resulting bitrate. There are several predefined 2- and 4-dimensional Huffman codebooks available optimized for different distribution statistics. 2.1.4 Temporal Noise Shaping(TNS) The TNS tool allows finer temporal shaping of the introduced quantization noise, that is needed for transient and pitched signals (section 2.1.1). Signals with a nonflat spectral envelope are time correlated and can be encoded efficiently by predictive coding of the time signal or by coding spectral coefficients, while signals with nonflat time structure can be coded efficiently by predictive coding of the spectral coefficients or by coding time domain values. Predictive coding of spectral coefficients adapts the temporal shape of the introduced quantization noise to the temporal shape of the input signal and resolves the problem of the varying masking threshold of transient or pitched signals. 3

2.1.5 Prediction The exploitation of the time redundancy of stationary or periodic signals by the AAC coder is limited due to the limited MDCT window size, the AAC prediction tool is used to allow more efficient redundancy reduction for long-term periodic or stationary signals. The current spectral coefficient is estimated by the predictor based on the corresponding spectral coefficients of the preceding two frames (backward prediction) and only the prediction errors (the residue of the subtraction of the predicted value from the real value) need to be transmitted. 2.1.6 Joint Stereo Coding AAC joint stereo coding reduces the needed bitrate for stereo or multichannel signals more efficiently than separate coding of several channels. There are two different joint stereo methods that can be selected for coding of different frequency bands to optimize the resulting bitrate: M/S stereo coding and intensity stereo coding. M/S stereo coding is very efficient for near monophonic signals, because it uses a sum (M or middle) and a difference (S or side) channel instead of left and right channels and the difference signal is very small in this case. Another advantage of using M/S stereo coding for near monophonic signals is its grouping of channel pairs on a left/right axis which avoids spatial unmasking (different masking thresholds in space because of the phase and noise). Intensity stereo coding uses equal energy-time envelopes for all channels (the same spectral coefficients) and only scales them differently at different channels which is almost perceived as the original signal and lowers the needed bitrate. AAC offers two different intensity stereo coding modes, AAC intensity stereo coding uses a restricted channel-pair concept as in M/S stereo coding and AAC coupling channels offers the possibility to share common spectral coefficients between arbitrary channels. 2.2 MPEG4 - Extensions to AAC MPEG-4 AAC offers some new tools to improve the coding efficiency and performance of AAC and to add some new functionalities. 2.2.1 Perceptual Noise Substitution(PNS) The PNS tool increases the coding efficiency of AAC by representing noiselike signal components with a compact parametric representation instead of coding the exact waveform. Each noiselike scalefactor band is represented by a noise substitution flag and the total power of its spectral coefficients, that are not quantized and transmitted. The decoder generates random numbers replacing these coefficients with the received total power. 2.2.2 Long-Term Prediction(LTP) Another Tool to improve the coding efficiency of AAC is the LTP tool. It exploits time redundancy between the current and the preceding frame (backward prediction). The spectral coefficients of the preceding frame are then remapped into the time domain, filtered by an inverse TNS filter and matched to the current signal to get the best prediction parameters (delay and gain) to derive the predicted signal. Then the spectral representations of the predicted and the current signal are TNS filtered and subtracted from each other to get a residual signal. A frequency selective switch is used to choose either the residual or the original signal for each scalefactor band, for further coding the signal needing the smaller bitrate is chosen. 4

Figure 3: Perceptual noise substitution Figure 4: Long-term prediction 2.2.3 TwinVQ MPEG4 also adds a coding kernel as alternative to the MPEG2-AAC coding process: the Transform-Domain Weighted Interleave Vector Quantization(TwinVQ) designed for good coding performance at extremely low bitrates. First the spectral coefficients are normalized to a specified amplitude range, then they are interleaved and devided into subvectors. The quantization bit demand for lower frequency coefficients is higher than for higher frequency coefficients to keep the quantization noise under the masking threshold, so the coefficients are interleaved to get subvectors with a constant amount of quantization bits. Then the subvectors are vector quantized using an optimized codebook selected through a weighted distortion measure. 5

Figure 5: TwinVQ interleaving 2.2.4 Low-Delay AAC(AAC-LD) The algorithmic delay of the standard MPEG-4 T/F coder of up to several hundred milliseconds is too high for realtime applications like bidirectional communication, so MPEG-4 AAC offers a low-delay audio coding mode with reduced framelength (512/480 samples instead of 1024/960) to reduce the analysis window. To avoid the look-ahead delay used to decide which window to take window switching is not supported. Another window, the low overlap window is used for transient signals to improve TNS performance and the bit reservoir is minimized or not used at all to further reduce the delay. Figure 6: Low-delay AAC compared to standard AAC 2.2.5 Error Robustness Improved error robustness is achieved by reducing the perceived degradation of the decoded audio signal caused by bit errors. The Virtual codebook tool (VCB11) enhances error resilience of scalefactor bands with large spectral coefficients, because bit errors in these bands 6

can be easier perceived. Virtual codebooks with different maximum values are used to detect errors that lead to too high values and can then be concealed. Reversible variable length coding uses symmetric code words to allow forward and backward decoding of the scalefactors to improve error resilience while the Huffman codeword reordering (HCR) tool places priority codewords on predefined positions in the bitstream, so that error propagation is avoided for the most important spectral coefficients. 2.3 Scalable Audio Coding Scalable audio coding allows to receive different bitrates through the same bitstream dependant on the actual transmission capacity of the channel. There are two different tools providing scalable audio coding for MPEG-4: Large-step scalable audio coding and bit-sliced arithmetic coding. Large-step scalable audio coding is achieved by coding the input signal by a first coder (base layer coder) and then subsequently coding the residual signal of the decoded preceding layer and the original input signal to encode the next layer. Each enhancement layer is optional and is not needed to decode the signal, but improves the perceived audio quality. Decoding additional layers can improve coding precision (SNR), signal bandwidth and/or add stereo information to monophonic signals. Bit-sliced arithmetic coding (BSAC) is used to avoid overhead caused by side information in enhancement layers with very low bitrate. The absolute values of the spectral coefficients are processed in slices from most significant bit (MSB) to least significant bit (LSB). The first slice contains the MSBs of all coefficients beginning with low frequencies and ending with high frequencies. The sign bits are reinserted directly after the first 1 bit of each spectral value. All bit slices are then entropy encoded (arithmetic coding) with minimal redundancy by an optimized BSAC coding model. 2.4 Parametric Audio Coding The HILN (harmonic and individual lines plus noise) parametric audio coder is designed for coding of general audio signals with very low bitrates. As in the AAC-filterbank, frames are generated by overlapping analysis windows. These frames are then analysed for individual sinosoids (described by frequency and amplitude), harmonic tones (described by its fundamental frequency, amplitude and the spectral envelope of its partials) and noise components (described by its amplitude and spectral envelope). To minimize the resulting bitrate, a perceptual model is used to select the most relevant components that are then transmitted. 3 Standardization and Implementations 3.1 Standardization MPEG-4 Audio is standardised as ISO/IEC 14496-3 by the international standardization organisation and reached final draft international standard in october 1998. Version 1 enxtends MPEG-2 AAC through enhancements for improved coding efficiency as perceptual noise substitution (PNS), long-term prediction (LTP), the TwinVQ coding core and new functionalities as large-step scalable audio coding. MPEG-4 standards only define the bitstream syntax of the various audio object types and the decoding processes in terms of a set of tools, but not the encoding processes. MPEG-4 Audio Version 2 was approved as final draft international standard in december 1999 and extends version 1 through new functionalities as fine granularity bitrate scalability (bit sliced arithmetic coding in addition to large-step scalable 7

Figure 7: HILN parametric encoder audio coding), error robustness (virtual codebook tool, reversible variable length coder and Huffman codeword reordering), parametric general audio coding (HILN) and low-delay audio coding. MPEG-4 standardises general audio coding at bitrates from 6 kbit/s up to 64 kbit/s and sampling rates from 8kHz up to 96kHz with MPEG-2 AAC as standard coder for general audio. There are also standardization efforts using MPEG-4 by the IETF, trying to develop Figure 8: Bitrates covered by the MPEG-4 Audio coders a full Internet Standard Protocol for real-time transmission of MPEG-4 Audio and Video over the Real-time Transport Protocol (RTP). RFC 3016 describes the RTP payload format for MPEG-4 Audio and Visual bitstreams without using MPEG-4 Systems synchronisation and stream management. It can be used for systems with own stream management and has 8

the advantage, that these payloads can be handled in the same way as other payload formats for non-mpeg-4 audio. Its disadvantage is the lack of compatibility to other systems based on MPEG-4 Systems specifications. RFC 3640 defines the payload format for MPEG-4 elementary streams as MPEG-4 Audio, Video, Systems (e.g. binary format for scenes BIFS, object descriptor OD, intellectual property management and protection IPMP) bitstreams. This RTP payload is simple to implement, very efficient and allows interleaving to increase error resilience. 3.2 Current Developments Current Developments for further extension of MPEG-4 Audio are the bandwidth extension of audio signals, parametric coding of wide-band signals and lossless audio coding. The principle of the bandwidth extension of audio signals is to recover the high frequency ( >5kHz) parts of the input signal from the lower frequency part to achieve efficiently coded improvements of the perceived audio quality. This technique exploits the fact, that the psychoacoustic importance of high frequencies is usually relatively low. Traditional perceptual audio coders as MPEG-4 AAC reduce the bandwidth of the audio signal at lower bitrates to keep the introduced quantization noise below the masking threshold. Most audio material has a very high correlation between the lower and the higher frequencies of its spectrum. The spectral band replication (SBR) technolgy proposed for bandwidth extension of MPEG-4 exploits this fact by transposition of the lower frequency coefficients to the higher frequencies and adjusting them with low amount of side information needed. Parametric coding of wide-band signals complements the existing MPEG-4 standards towards higher quality and bitrates. The HILN parametric coder of MPEG-4 Audio is targeted at very low bitrates (6-16kbps) only, although parametric representation of audio data is very efficient and allows easy post-processing (speed and pitch changes). Standard applications for the technique are internet streaming or download, mobile aplications and storage. Easy pitch and speed scaling can be applied for games, answering machines, spoken books and for music productions. Lossless audio coding for MPEG-4 Audio is being developed to meet the demands for digital archiving of audio and to follow the general trend towards high resolution audio. The decoded signal is an exact reconstruction of the input signal at the predefined sampling rate and word length. The algorithm allows efficient lossless data compression (such as ZIP) optimized for audio signals. It is designed as either stand-alone coder or to be combined with perceptual audio coding. The lossy core coder is complemented by a lossless enhancement layer, providing backward compatibility by omitting the lossless enhancement. Scalability can be achieved through continous enhancement layers until lossless audio quality is reached. For pure lossless coding, the core coder can be omitted (zero kbps core bitrate), which is not backward compatible. Applications for lossless audio coding are lossless archiving, lossless editing in distributed productions, lossless consumer delivery for home archives and scalable lossless streaming depending on channel capabilities. Lossless audio coding is expected to become international standard by the end of 2004. 3.3 Implementations MPEG-4 Audio is widely used in different application areas as internet streaming (audio and video), solid state players, ISDN music transmission, high definition television (HDTV), satellite and terrestrial digital audio broadcasting and for audio transmission in third generation mobile networks (UMTS, CDMA2000) because of its efficiency and universality. Several 9

other standards (as 3GPP and 3GGP2 for UMTS/CDMA2000) are based on the MPEG4 standards. 3.3.1 Coding Technologies aacplus Coding Technologies improved the coding efficiency of MPEG4-AAC through a new technique called spectral band replication (SBR) which is combined with MPEG4-AAC to create aacplus. This technique replicates the lower frequency parts of the decoded audio signal to retrieve the higher frequency parts with only low amount of side information added. Using SBR with any perceptual t/f audio coder results in increased efficiency up to the factor of two. AacPlus delivers streaming or download 5.1 surround audio at 128kbps, CD-quality stereo at 48kbps, excellent quality stereo at 32kbps, parametric stereo down to 20kbps and optimized speech/mixed speech with music down to 8kbps mono. It also features built in error concealment for wireless mobile applications and the widest available audio bandwidth. Example Figure 9: aacplus efficiency in comparision to other standards (original = 100) applications are third generation mobile and wireless audio and A/V services, internet audio streaming or download, digital radio broadcast, digital satellite and cable broadcast and portable players (especially with built in flash memory for memory efficiency reasons). Supported platforms include Win32, Linux, MacOS X, TI, Motorola and other DSPs. Because of its coding efficiency and error resilience, aacplus is/will be used for audio broadcasting by XM Satellite Radio and Digital Radio Mondiale. 3.3.2 Fraunhofer IIS Fraunhofer IIS offers quality and resource usage optimized software implementations of the MPEG-4 Audio en- and decoding algorithms on several platforms. There are three generic versions of implementations: PC-software, core design kit software (CDK) and digital signal processor (DSP) software. PC-Software: PC-software from Fraunhofer IIS is available for a variety of operating systems, mostly supporting X86 compatible or PowerPC platforms with hardware support for 10

floating point arithmetic. There are two different MPEG-4 encoders, the professional encoder and the consumer encoder, and one decoder for PC from Fraunhofer. The professional encoder (PcEncPro) supports almost all natural MPEG-4 Audio object types at maximum audio coding quality, while the consumer encoder (PcEncCons) is designed to offer minimum encoding time (processing complexity) and no noticable quality loss compared to the professional encoder. The consumer encoder uses special optimization techniques for Intel Pentium 4 processors by supporting its multimedia streaming extensions (SSE) and the Hyperthreading technology. The decoder (PcDec) all the natural audio object types of MPEG-4 (is a MPEG-4 compliant natural audio decoder). CDK-Software: The core design software consists of bit-precise reference and template codes with optimized memory and processing power requirements. Fraunhofer IIS offers one version that is directly copileable for 16-bit/32-bit fixed point processors as ARM, MIPS, PowerPC, ADI, TI, Motorola, etc. Another version contains a template code for DSPs with fractional or integer arithmetic of any word length. DSP-Software: The DSP-software contains highly optimized source code or libraries for several DSPs and allows different levels of support or integration. 3.3.3 FAAC/FAAD2 AudioCoding.com provides free MPEG-2/4 AAC codecs, encoders and decoders and tools(e.g. an Id3v2 tag tool). The encoder FAAC currently supports the MPEG-2 main, low and MPEG- 4 LTP (long-term prediction), main and LC (low complexity) audio object types. The decoder FAAD2 is the fastest ISO AAC audio decoder available and supports MPEG-2/4 main, LC, HE (high efficiency), LTP, LD (low delay) and ER (error resilience) audio object types and can be used for Digital Radio Mondiale (DRM) with a few changes. AudioCoding.com also provides a plugin for the winamp 5 player that supports high efficiency AAC in addition to the built-in AAC support of winamp 5. 3.3.4 Apple QuickTime Apple QuickTime features native support for MPEG-4 Audio, because MPEG-4 is based on the flexible (extendable) QuickTime file format. The new version of QuickTime (version 6.5) allows the import, export and playback of mp4, 3GPP and 3GPP2 contents with a signal processing AAC codec built upon technology from Dolby Laboratories. Apple offers several other products using MPEG-4 Audio for a variety of applications. The QuickTime browser plugin allows to view streamed MPEG-4 media embedded in web pages, QuickTime 6 Pro is a MPEG-4 authoring tool, QuickTime Broadcaster is designed for MPEG-4 live encoding and broadcasting and the QuickTime Streaming Server 5 can be used for streaming of MPEG-4 content. 3.3.5 Stego-lame Another open source project is called stego-lame. Stego-lame is developing steganography tools for the analysis and synthesis of audio files as MP3, Ogg Vorbis, MPEG-2/4 AAC and G.72x format. 11

3.3.6 BonkEncoder The BonkEncoder is an open source audio cd ripper and an encoder for different audio formats. It currently supports Ogg Vorbis, MPEG-2/4 AAC, MP3 and Bonk files, more formats can be added through plugins. 4 Summary MPEG-4 Audio provides several different interoperable tools for improving the coding efficiency of MPEG-2 AAC and adds new functionalities to provide a standard for many different kinds of aplications. MPEG-2 AAC is the core coder for MPEG-4 Audio, a powerful time/frequency multichannel coder using a perceptual model for redundancy reduction. The PNS tool added by MPEG-4 reduces bit requirements of noiselike signal components by parametric coding, while the LTP tool replaces the prediction tool of MPEG-2 AAC by exploiting the redundancy of stationary signals even more. For extremely low bitrates, MPEG-4 offers an alternative quantizer/coder, the TwinVQ coder, that can be used for scalable audio coding. New fuctionalities added by MPEG-4 are bitrate scalabilty, error resilience and lowdelay audio coding. Large-step scalability is reached by additional enhancemant layers and can improve coding precision, signal bandwidth and the number of channels, while fine granularity scalability enables enhamcements through scalability in small steps down to 1kbps per enhancement layer. The error resilience tools improve the received signal quality over error prone channels (e.g. wireless applications, broadcasting). Low-delay AAC is designed for applications demanding low algorithmic delay (e.g. bidirectional communications) without significant quality losses. With the HILN parametric coder, MPEG-4 reaches bitrates down to 6kbps decomposing the input signal into individual sinusoids, harmonic tones and noise components. These tools are components of two final draft international standards standardized by the international standardisation organisation (ISO/IEC 14496-3 MPEG-4 Audio version 1 and 2) defining the bitstream syntax and the decoding processes, but there are still new technological developments made for MPEG-4 Audio. The first extension, bandwidth enhancement of audio signals by using the spectral band replication (SBR) technology allows more efficient audio coding by omitting the spectral data of the higher frequencies and using adapted data recovered from lower frequencies instead. Another extension uses parametric audio coding for higher bitrate and quality signals then the HILN coder to extend its advantages as coding efficiency and easy post-processing to higher bitrates. Finally lossless enhancements for lossless audio archiving and distributed productions are developped for even more applications of the MPEG-4 standards (broadband applications based on MPEG-4). There are many implementations of the MPEG-4 Audio standard for different applications and supporting many platforms. Fraunhofer IIS offers encoders/decoders for professionals and consumers for many platforms and optimized code for several DSPs, while Coding Technologies combined AAC and SBR to aacplus used by digital satellite and digital terrestrial radio broadcast. 12

References [Bran] [HDKG] [HDSQ] [HeGZ] Karlheinz Brandenburg. MP3 and AAC explained. document. Jürgen Herre, Martin Dietz, Leon van de Kerkhof und Ralf Geiger. Recent Developments in MPEG-4 Audio. document. Jürgen Herre, Martin Dietz, Erik Schuijers und Schuyler Quackenbush. New Technological Developments in MPEG-4 Audio. document. Jürgen Herre, Bernhard Grill und Giorgio Zoia. MPEG-4 Audio: Basics and Extensions. document. [HePu94] Juergen Herre und Heiko Purnhagen. General Audio Coding, Kapitel 11, S. 487 543.? 1994. [IIS] Frauenhofer IIS. Fraunhofer IIS MPEG-4 Audio Software. document. [Koen02] Rob Koenen. Overview of the MPEG-4 Standard. document, M-arz 2002. [Purn98] Heiko Purnhagen. MPEG-4 Audio (Final Comittee Draft 14496-3. document, M-arz 1998. [Purn99] Heiko Purnhagen. MPEG-4 Audio Version 2(Final Comittee Draft 14496-3 AMD1. document, Juli 1999. 13