Chapter 4: Audio Coding

Size: px

Start display at page:

Download "Chapter 4: Audio Coding"

Hortense Moore
6 years ago
Views:

1 Chapter 4: Audio Coding Lossy and lossless audio compression Traditional lossless data compression methods usually don't work well on audio signals if applied directly. Many audio coders are lossy coders, e.g. ADPCM, MP3 and MPEG AAC coders. However, for high fidelity audio compression, lossless coding is preferred, e.g., MLP compression for DVD audio and MPEG-4 ALS coding Issues of compression performance Lossy audio coding generally achieves higher compression performance than lossless audio coding, for example, MP3 has an average compression ratio of about 12 while MPEG-4 ALS coding has an average compression ratio of only 2 Issues of audio quality Lossless coding can reproduce the original audio signal without loss so there is no quality issue for it, however, for lossy audio coding, the reproduced signal has distortion (coding error or noise) in it. This noise may be audible resulting in the degradation of audio quality when listen to it. A quality measure is needed in order to rate the performance of lossy audio coders, e.g., SNR, spectral distortion are objective measures, and Mean Opinion Score, PESQ score are subjective measures 1

2 Linear Prediction for Audio Coding Audio Coding Linear prediction has been widely applied in speech, audio processing and coding areas, for examples, backward predictor for ADPCM, LTP and TNS for AAC in MPEG- I,IV, MPEG-4 ALS employs a high-order predictor, predictor/decorrelator in MLP coding Basic Concept of Linear Prediction By linearly predicting the value of a stationary random process in time, consider the one-step predictor, which forms the prediction of the value by a weighted linear combination of the past values. The predicted value is where are called the prediction coefficients of the one-step linear predictor of order P. The difference between and is called the prediction error The mean-square value of the linear prediction error is where is the autocorrelation of Note that is a quadratic function of the predictor coefficients. 2

3 Linear Prediction for Audio Coding Basic Concept of Linear Prediction The objective of linear predictive analysis is to obtain the prediction coefficients such that is minimized. Since is quadratic, by setting for we obtain the optimum prediction coefficients by solving a set of linear equations Audio Coding These are called normal equations for the coefficients of the linear predictor and can be written in matrix form as shown: The optimum predictor can be computed by inverting the matrix, i.e., Fast algorithm such as Levinson-Durbin algorithm can be used to invert the matrix 3

4 Linear Prediction for Audio Coding Audio Coding Basic Concept of Linear Prediction Practically, in order to compute the autocorrelation, we need to apply a window, e.g. Hamming window, to window the signal such that it becomes finite duration We can view linear prediction as being equivalent to linear filtering where the predictor is embedded in the linear filter as shown: Prediction-Error Filter This FIR filter is called predictor-error filter or inverse prediction filter. The z-transform relationship between input and output is where 4

5 Lossy Audio Coding Waveform coding vs psychoacoustic coding Audio Coding Waveform coding techniques try to code the signal as faithfully as possible so that the reproduced signal is close to the original signal waveform (or spectrum), e.g., ADPCM, Subband coding. The compression performance for waveform coding is generally lower than psychoacoustic coding, typically having a compression ratio from 2 to 4 Psychoacoustic coding techniques explore human psychoacoustic properties such as temporal and frequency maskings, such that the coding noise fall below the perceptual masking levels and become inaudible, e.g. MP3, MPEG II AAC are well-known psychoacoustic audio coders. The compression ratios achieved are generally higher than waveform coders, typically at 6 to 20 5

6 Audio Waveform Coding Adaptive Differential Pulse Code Modulation (ADPCM) Try to code the signal waveform as close to the original as possible The difference between an input sample and a predicted sample is quantized and encoded. The quantization levels are adapted to the signal strength so that the quantization errors are minimized As a result of quantization, ADPCM is a lossy coding technique The compression ratio achieved by ADPCM coding is mild between 2 to 4. Quantization noises will prevail and the audio quality will drop significantly if compression ratio is further pushed up Some International standards ITU G.721, 32 kbps with 8kHz sampling frequency ITU G.722, 64 kbps subband ADPCM, split into two subbands ITU G.726, 40, 24 and 16 kbps with 8kHz sampling frequency Digital Theater System (DTS), subband ADPCM with some perceptual coding for 5.1 channels at 1.5 mbps 6

7 Adaptive Differential Pulse Code Modulation (ADPCM) G.726 ADPCM Coder Audio Waveform Coding ADPCM Encoder ADPCM Decoder 7

8 Adaptive Differential Pulse Code Modulation (ADPCM) Principle of ADPCM Coder A difference signal is obtained,, by subtracting an estimate of the input signal,, from the input signal itself,. An adaptive linear quantizer is used to quantize the value of the difference signal. The quantized codeword is transmitted to the decoder. An inverse quantizer produces a quantized difference signal,, from the codeword. The signal estimate is added to the quantized difference signals to produce the reconstructed version of the input signal. Both the reconstructed signal and the quantized difference signal are operated upon by an adaptive predictor, which produces the estimate of input signal thereby completing the feedback loop. The ADPCM decoder includes a structure identical to the feedback portion of the encoder. In ITU 32 kbit/s ADPCM coding standard, the adaptive predictor consists of a pole-zero filter of order 2 and 6 respectively. The G.726 standard defines a multiplier constant that will change for every difference value,, depending on the current scale of signals. Define a scaled difference signal as follows:,, Audio Waveform Coding is the predicted signal value. is then sent to the quantizer for quantization By changing, the quantizer can adapt to change in the range of the difference 8 signal

9 Adaptive Differential Pulse Code Modulation (ADPCM) Backward Adaptive Quantizer Audio Waveform Coding Basic principle : too many values are quantized to values far from zero > quantizer step size in the quantizer were too small too many values fall close to zero too much of the time -> quantizer step size were too large An adapter should allow one to adapt a backward quantizer step size after receiving just one single output. It simply expands the step size if the quantized input is in the outer levels of the quantizer, and reduces the step size if the input is near zero. It assigns multiplier values to each level, with values smaller than unity for levels near zero, and values larger than 1 for the outer levels. For signal, the quantizer step size is changed according to the quantized value, for the previous signal value, by the simple formula G.726 uses fixed quantizer steps based on the logarithm of the input scaled difference signal, G.726 Quantizer 9

10 Adaptive Differential Pulse Code Modulation (ADPCM) G.726 Backward Predictor The signal estimate is computed by: Audio Waveform Coding where In z-transform, where and is a 6-tap all-zero path and is a 2-tap allpole path of the predictor. The predictor has a transfer function Both sets of predictor coefficients are updated using a simplified gradient algorithm: For an example of the first coefficient of the second order predictor: 10

11 Audio Compression by Exploring Psychoacoustics Property Coding of audio signals by trying to follow the waveform shape can not achieve a compression ratio of higher than 4 if quality is to be maintained. For applications like music player using flash card, a compression ratio of at least 10 is required. In order to achieve this, it is necessary to explore the properties of psychoacoustics so that the quantization errors inherent in the coding process are inaudible, this is the so-called transparent coding the coded audio and source audio pieces are indistinguishable by expert listeners Basic Structure of Psychoacoustics Coder Psychoacoustics Coding 11

12 Basic Structure of Psychoacoustics Coder Principle: Input signal is firstly divided into several sub-bands by using a filter bank, typically a quadratic mirror filter (QMF) implemented by using polyphase structure. Each sub-band signal is then quantized according to the bit allocation information obtained through masking analysis. This masking analysis is done via FFT, critical band grouping of spectral components, and signal to mask ratio (SMR) computation. The quantized sub-band signals are then packed into bit stream as output. Filter bank Psychoacoustics Coding Divide the signal band into several sub-bands. The sub-band filtered signals are then downsampled to the sub-band rate for further processing 12

Basic Structure of Psychoacoustics Coder Filter Bank Psychoacoustics Coding In a perfect filter bank the first part is the only part wanted. The second part consists of the aliasing terms.

13 Basic Structure of Psychoacoustics Coder Filter Bank Psychoacoustics Coding In a perfect filter bank the first part is the only part wanted. The second part consists of the aliasing terms. It is impossible to construct ideal sub-band filter (brisk-wall), this results in aliasing after downsampling. For ideal sub-band filter, it should have response In general, sub-band filtering is a non perfect reconstruction process. However, it is possible to build near perfect sub-band filter by designing filter bank so that the aliasing is small. Downsampling from input sampling rate of to, M is the number of sub-bands., where is the sub-band bandwidth with upper frequency at multiple of re-sample the sub-band filtered signal at 13

14 Basic Structure of Psychoacoustics Coder Masking Analysis Psychoacoustics Coding Critical bands masking of neighbouring bands according to psychoacoustics model signals are coded when they are above masking threshold, ignored if they are below, bit allocations for the quantization of above signals are dynamic according to how much they are above the threshold MUSICAM (Masking-pattern adapted Universal Subband Integrated Coding and Multiplexing) Algorithm Example: After analysis, the first levels of 16 of the 32 bands are these: Band Level(db) If the level of the 8th band is 60dB and it gives a masking of 12 db in the 7th band, and 15dB in the 9 th, then since level in 7th band is 10 db ( < 12 db ), so it is inaudible -> ignore it. level in 9th band is 35 db ( > 15 db by 20dB ), so send it. However, only the amount above the masking level needs to be sent, so instead of using 6 bits to encode it, we can use 4 bits with 24dB SQNR, i.e., noise level is still 4dB below the masking, this can achieve a saving of 2 bits. Bit allocated is 14

15 Psychoacoustics Coding MPEG Coding Standards MPEG, which stands for Moving Picture Experts Group, is the name of family of standards used for coding audio-visual information (e.g., movies, video, music) in a digital compressed format. History of MPEG-Audio MPEG-1 Two-Channel coding standard (Nov. 1992) MPEG-2 Extension towards Lower-Sampling-Frequency (LSF) (1994) MPEG-2 Backwards compatible multi-channel coding (1994) MPEG-2 Higher Quality multi-channel standard (MPEG-2 AAC) (1997) MPEG-4 Advanced Audio Coding (AAC) ( ) MPEG-4 with added functionalities, e.g., Spectral Band Replication (SBR), 2003, SinuSodial Coding (SSC), 2004 MPEG-4 ALS (Advanced Lossless Coding), 2006 MPEG Surround, 2007 MPEG Spatial Audio Object Coding (SAOC),

16 Psychoacoustics Coding MPEG-1 Audio Coding MPEG-1 standard (ISO/IEC ) has a gross bit rate of 1.5 Mbits/sec for audio and video, about 1.2 Mbits/sec for video, 0.3 Mbits/sec for audio. Compression factor ranging from 2.7 to 24 (typical at 12). With Compression rate 6:1 (16 bits stereo sampled at 48 KHz is reduced to 256 kbits/sec) and optimal listening conditions, expert listeners could not distinguish between coded and original audio clips. MPEG-1 audio supports sampling frequencies of 32, 44.1 and 48 KHz. Supports one or two audio channels in one of the four modes: Monophonic -- single audio channel Dual-monophonic -- two independent channels, e.g., English and French Stereo -- for stereo channels that share bits, but not using Joint-stereo coding Joint-stereo -- takes advantage of the correlations between stereo channels The Basic Algorithm 1. Use quadratic mirror filter bank to divide the audio signal (e.g., 48 khz sound) into 32 frequency sub-bands --> sub-band filtering. 2. Determine amount of masking for each band caused by nearby band using the psychoacoustic model via a separate FFT analysis. 3. If the power in a band is below the masking threshold, don't encode it. 4. Otherwise, determine number of bits needed to represent the coefficients such that noise introduced by quantization is below the masking effect (Recall that one fewer bit of quantization introduces about 6 db of noise). 5. Format bitstream 16

Basic model is same, but codec complexity increases with each layer.

17 MPEG-1 Audio Coding Psychoacoustics Coding MPEG provides 3 layers of compression with various coding complexities and bit rates. Basic model is same, but codec complexity increases with each layer. MP3 coders incorporate layer 3 of the MPEG audio coding standard. A Block diagram of MPEG Audio Coder 17

Psychoacoustics Coding MPEG-1 Audio Coding Polyphase Filterbank in MPEG-1 Audio Coder 32 sub-bands 512 tap FIR-filters equal width non-perfect reconstruction frequency overlap Low complexity,

18 Psychoacoustics Coding MPEG-1 Audio Coding Polyphase Filterbank in MPEG-1 Audio Coder 32 sub-bands 512 tap FIR-filters equal width non-perfect reconstruction frequency overlap Low complexity, requires 80 multiplications/additions per output sample The sub-bands overlap at 3 db to the adjacent bands. The leakage to the other bands is small. The total response almost adds up to one (0 db). 18

Psychoacoustics Coding MPEG-1 Audio Coding Layer I coding Divides data into frames, each of them contains 384 samples, i.e. a duration of 8ms for 48kHz sampling, 12 samples from each of the 32 filtered sub-bands.

19 Psychoacoustics Coding MPEG-1 Audio Coding Layer I coding Divides data into frames, each of them contains 384 samples, i.e. a duration of 8ms for 48kHz sampling, 12 samples from each of the 32 filtered sub-bands. The subbands are equally spaced in frequency, i.e., not critical bands. Psychoacoustic model only uses frequency masking. Layer 1 Psychoacoustic model uses a 512- point FFT to get detailed spectral information about the signal. Both tonal (sinusoidal) and non-tonal (noise) maskers are derived from the FFT spectrum. Each masker produces a masking threshold depending on its frequency, intensity, and tonality. For each subband, the individual masking thresholds are combined to form a global masking threshold. The masking threshold is compared to the maximum signal level for the sub-band, producing a signal-tomasker ratio (SMR) which determine the bit allocations for the quantizer. 19

MPEG-1 Audio Coding Psychoacoustics Coding Layer I coding

X(k) is the FFT spectrum 8/3 is the gain reduction factor due to

Concept of Perceptual Entropy Signal intensity Intensity of scale

20 MPEG-1 Audio Coding Psychoacoustics Coding Layer I coding psychoacoustic model Input signal spectral intensity is computed as X(k) is the FFT spectrum 8/3 is the gain reduction factor due to Hanning window Input signal is assumed to have amplitude limited to Concept of Perceptual Entropy Signal intensity Intensity of scale factor band m Relative intensity of the masked threshold Tonal masker detection by local maxima 20

dependent factor c = 3 -> 6 db Addition of masking Intensity of marking curve simply addition of

21 MPEG-1 Audio Coding Psychoacoustics Coding Layer I coding psychoacoustic model SMR is the difference in level between a signal component and the masking threshold at a certain frequency z is a frequency dependent factor c = 3 -> 6 db Addition of masking Intensity of marking curve simply addition of individual masking curve (used in MPEG-1) using the highest masking curve i.e., dominant (used in MPEG-2, AC-2 21

Psychoacoustics Coding MPEG-1 Audio Coding Layer I coding quantization and frame packing The Layer 1 quantizer/encoder first examines each subband's samples, finds the maximum absolute value of these

22 Psychoacoustics Coding MPEG-1 Audio Coding Layer I coding quantization and frame packing The Layer 1 quantizer/encoder first examines each subband's samples, finds the maximum absolute value of these samples, and quantizes it to 6 bits. This is called the scale factor for the subband. Then it determines the bit allocation for each subband by minimizing the total noise-to-mask ratio with respect to the bits allocated to each subband. (It's possible for heavily masked subbands to end up with zero bits, so that no samples are encoded.) Finally, the subband samples are linearly quantized to the bit allocation for that subband. Layer-I frame packing Each frame starts with a header information for synchornization and bookkeeping (20bits), 16-bit cyclic redundancy check (CRC) for error detection and correction. Each of the 32 subbands gets 4 bits to describe bit allocation and 6 bits for the scale factor. The remaining bits in the frame are used for subband samples, with an optional trailer for extra information. 22

Psychoacoustics Coding MPEG-1 Audio Coding Layer I coding quantization and frame packing Header = 12 bit sync + 20 bits system information as 1 bit for ID (1=MPEG), 2 bits for layer (I & II & III), 1

23 Psychoacoustics Coding MPEG-1 Audio Coding Layer I coding quantization and frame packing Header = 12 bit sync + 20 bits system information as 1 bit for ID (1=MPEG), 2 bits for layer (I & II & III), 1 bit for error protection, 4 bits for bit rate index, 2 bits for sampling frequency, 1 padding bit, 1 private bit, 2 bits for mode, 2 bits for mode extension, 1 copyright bit, 1 bit for original/copy, 2 bits for emphasis Scale factor is 6 bits covering 0<->62 representing quantization of gain range 2.0 <-> with a dynamic range of 120 db Bit allocation is 4 bits No one bit allocation due to the use of mid-thread quantizer Zero bit allocation to highest SMR band 15 forbidden Highest quality is achieved with a bit rate of 384k bps. Typical applications of Layer 1 include digital recording on tapes, hard disks, or magneto-optical disks, which can tolerate the high bit rate. 23

24 MPEG-1 Audio Coding Psychoacoustics Coding Layer II coding Use three frames in filtering (i.e., before, current and next, a total of 3x384=1152 samples). This models a little bit of the temporal masking. The Layer 2 time-frequency mapping is the same as in Layer 1 The Layer 2 psychoacoustic model is similar to the Layer 1 model, but it uses a 1024-point FFT for greater frequency resolution. It uses the same procedure as the Layer 1 model to produce signal-to-masker ratios for each of the 32 subbands. The Layer 2 quantizer/encoder is similar to that used in Layer 1. However, Layer 2 frames are three times as longer than Layer 1 frames, so Layer 2 allows each subband a sequence of three successive scale factors, and the encoder uses one, two, or all three, depending on how much they differ from each other. This gives, on average, a factor of 2 reduction in bit rate for the scale factors compared to Layer 1. Bit allocations are computed in a similar way to Layer 1. Layer 2 processes the input signal in frames of 1152 PCM samples. At 48 khz, each frame carries 24 ms of sound. Highest quality is achieved with a bit rate of 256k bps, but quality is often good down to 64k bps. Typical applications of Layer 2 include audio broadcasting, television, consumer and professional recording, and multimedia. Audio files on the World Wide Web with the extension.mpeg2 or.mp2 are encoded with MPEG-1 Layer 2. 24

Psychoacoustics Coding MPEG-1 Audio Coding Layer II coding The Layer 2 frame packer uses the same header and CRC structure as Layer The number of bits used to describe bit allocations

The scale factors (one, two or three depending on the data) are encoded along with a 2-bit code describing which combination of scale factors is being used.

25 Psychoacoustics Coding MPEG-1 Audio Coding Layer II coding The Layer 2 frame packer uses the same header and CRC structure as Layer The number of bits used to describe bit allocations varies with subband: 4 bits for the low subbands, 3 bits for the middle subbands, and 2 bits for the high subbands (this follows critical bandwidths). The scale factors (one, two or three depending on the data) are encoded along with a 2-bit code describing which combination of scale factors is being used. The subband samples are quantized according to bit allocation, and then combined into groups of three (called granules). Each granule is encoded with one code word. This allows Layer 2 to capture much more redundant signal information than Layer 1. 25

26 MPEG-1 Audio Coding Psychoacoustics Coding Layer III coding Layer 3 uses a better critical band filter (non-equal frequencies), psychoacoustic model includes temporal masking effects, takes into account stereo redundancy, and uses Huffman coder for further lossless compression of the quantized signals The filter bank used in MPEG Layer-3 is a hybrid filter bank which consists of a 32 channel polyphase filter bank and a Modified Discrete Cosine Transform (MDCT). This provides a fine grain frequency resolution. This hybrid form was chosen for reasons of compatibility to its predecessors, Layer-1 and Layer-2 because MDCT is applied after the polyphase filter. 26

MPEG-1 Audio Coding Psychoacoustics Coding Layer III coding Layer 3 is substantially more complicated than Layer 2.

allowing variable length frames. The frame packer includes a bit reservoir which allows more bits to be used for portions of the signal that need them.

It allows high quality results at bit rates as low as 64k bps. Typical applications are in telecommunication and professional audio, such as commercially published music and video.

27 MPEG-1 Audio Coding Psychoacoustics Coding Layer III coding Layer 3 is substantially more complicated than Layer 2. It uses both polyphase and discrete cosine transform filter banks, a polynomial prediction psychoacoustic model, and sophisticated quantization (non-uniform) and encoding schemes allowing variable length frames. The frame packer includes a bit reservoir which allows more bits to be used for portions of the signal that need them. Layer 3 is intended for applications where a critical need for low bit rate justifies the expensive and sophisticated encoding system. It allows high quality results at bit rates as low as 64k bps. Typical applications are in telecommunication and professional audio, such as commercially published music and video. The widely used MP3 coding is Layer III coding of MPEG-1. Stereo Redundancy Coding: Intensity stereo coding -- at upper-frequency subbands, encode summed signals instead of independent signals from left and right channels. Middle/Side (MS) stereo coding -- encode middle (sum of left and right) and side (difference of left and right) channels. Detection 27

28 MPEG-1 Audio Coding Psychoacoustics Coding Layer III coding Layer III vs Layer I and Layer II modified DCT (Series MDCT) critical bands in bark scales Huffman coding entropy reduction dynamics compression difference and sum of stereo signals Effectiveness of MPEG audio Quality factor: 5 - perfect, 4 - just noticeable, 3 - slightly annoying, 2 - annoying, 1 - very annoying Real delay (encoding+transmission+decoding) is about 3 times of the theoretical delay. MPEG-III bit stream format Layer Target Bit-rate Compression Ratio Quality (MOS) at 64 kb/s Quality (MOS) at 128 kb/s Theoretical Min. Delay Layer kb/s 4: ms Layer kb/s 6:1-8:1 2.1 to ms Layer 3 64 kb/s 12:1-10:1 3.6 to ms 28

MPEG-2 Audio Coding Psychoacoustics Coding MPEG-2 BC is a multichannel standard that is backward compatible with MPEG-1 coder. The number of channels can be between 1 and 48.

29 MPEG-2 Audio Coding Psychoacoustics Coding MPEG-2 BC is a multichannel standard that is backward compatible with MPEG-1 coder. The number of channels can be between 1 and 48. It allows for ITU-R indistinguishable quality at data rates of 320 kbps for five full-bandwidth channels audio signals. It supports scaleable sampling rates from 8 khz to 96 khz with low and high complexities. The multichannel MPEG-2 format uses a basic five-channel approach sometimes referred to as 3/2+1 stereo (3 front and 2 surround channels + subwoofer). The low frequency effects (LFE) subwoofer channel is optional. An encoder matrix allows a two-channel decoder to decode a compatible twochannel signal that is a subset of a multichannel bit stream. 29

MPEG-2 Audio Coding Psychoacoustics Coding The MPEG-1 left and

channels and these are encoded into backward compatible MPEG

Additional multichannel data is placed in the expanded ancillary

30 MPEG-2 Audio Coding Psychoacoustics Coding The MPEG-1 left and right channels are replaced by matrix MPEG-2 left and right channels and these are encoded into backward compatible MPEG frames with an MPEG-1 encoder. Additional multichannel data is placed in the expanded ancillary data field. A standard two-channel decoder ignores the ancillary information, and reproduces the front main channels. 30

31 MPEG-2 Advanced Audio Coding (AAC) Psychoacoustics Coding Initially, the MPEG-2 AAC was called the MPEG-2 nonbackward compatible (NBC) coding. The standardization of MPEG-2 NBC started after the completion of the MPEG-2 BC multichannel standard. The MPEG-2 NBC was renamed the MPEG-2 AAC and was adopted in 1997 as part of the MPEG-2 standard. The MPEG-2 AAC format codes stereo or multichannel sound at a bit rate of about 64 Kbps per channel. It also provides 5.1-channel coding at an overall rate of 384 Kbps. MPEG-2 AAC is not backward compatible with MPEG-1. AAC supports input channel configurations of 1/0 (mono), 2/0 (two-channel stereo), different multichannel configurations up to 3/2+1, and has provision for up to 48 channels. DAB, DVB and DVD MPEG-Audio LayerII, Dolby AC-3 Dolby AC-3 Dual Channel Multi-Channel MPEG-2 Audio"LSF" MPEG-2 AudioLII Dual Channel Multi-Channel MPEG-1 Audio with32, 44.1and 48 khz CCITT G722 Layer III Layer II Layer I NICAM kbit/s 448 Bit Rate

32 MPEG-2 Advanced Audio Coding (AAC) MPEG-2 AAC Profiles Psychoacoustics Coding To allow flexibility in audio quality versus processing requirements, the AAC format defines three profiles of application: main, low complexity, and scalable sampling rate. The main profile provides the highest audio quality at any given bit rate, and all the features of AAC are employed. A main profile decoder can also decode the low-complexity bit streams. Main Profile all parts of AAC tools except the preprocessing tool are used. Low Complexity (LC) Profile the prediction and preprocessing tools are not executed and the TNS module has a limited order. Scaleable Sampling Rate (SSR) Profile preprocessing tools are required. They are composed of a polyphase quadrature filter, gain detectors, and gain modifiers. In this profile, prediction tools are not executed and the TNS order and the bandwidth are limited. 32

33 MPEG-2 Advanced Audio Coding (AAC) MPEG-2 AAC Encoder Psychoacoustics Coding 33

34 MPEG-2 Advanced Audio Coding (AAC) Psychoacoustics Coding MPEG-2 AAC Encoder The crucial differences between MPEG-2 AAC and its predecessor MPEG- 1 Audio Layer-3 are shown as follows: Filter bank: in contrast to the hybrid filter bank of ISO/MPEG-1 Audio Layer-3 - chosen for reasons of compatibility but displaying certain structural weaknesses - MPEG-2 AAC uses a plain Modified Discrete Cosine Transform (MDCT). Together with the increased window length (1024 instead of 576 spectral lines per transform) the MDCT outperforms the filter banks of previous coding methods. Temporal Noise Shaping (TNS): A true novelty in the area of time/frequency coding schemes. It allows controlling the fine structure of quantization noise within the filter bank window and shapes the distribution of quantization noise in time by prediction in the frequency domain. In particular voice signals experience considerable improvement through TNS. Prediction: with better signal prediction. It benefits from the fact that certain types of audio signals are easy to predict so incorporating a better signal prediction algorithm can improve the performance. Quantization: by allowing finer control of quantization resolution, the given bit rate can be used more efficiently. Bit-stream format: the information to be transmitted undergoes entropy coding in order to keep redundancy as low as possible. The optimization of these coding methods together with a flexible bit-stream structure has made further improvement of the coding efficiency possible. 34

35 MPEG-2 Advanced Audio Coding (AAC) Principle of MPEG AAC Encoding Preprocessing (used in SSR profile only) Psychoacoustics Coding It includes a polyphase quadrature filter (PQF), gain detectors and gain modifiers. The PQF allows four unique bandwidth outputs. For signal with 48kHz sampling, it can output signals with bandwidths at 24 khz, 18 khz, 12 khz, and 6 khz. The pre-echo effect can be suppressed by using the gain control tool. The amplitudes of each PDF band can be manipulated independently by gain detectors and the gain control can be used together with all types of window sequences. The time resolution of the gain control is about 0.7ms for 48 khz sampling rate. 35

MPEG-2 Advanced Audio Coding (AAC) Principle of MPEG AAC Encoding Filter bank Psychoacoustics Coding AAC encoder uses the modified discrete cosine transform (MDCT) as the input signal filter bank.

36 MPEG-2 Advanced Audio Coding (AAC) Principle of MPEG AAC Encoding Filter bank Psychoacoustics Coding AAC encoder uses the modified discrete cosine transform (MDCT) as the input signal filter bank. MDCT adopts a technique called time domain aliasing cancellation (TDAC) for which the windowed signals are overlapped and added to achieve perfect reconstruction. Forward MDCT: Inverse MDCT: where N is the transform block length and The MPEG AAC filter bank was designed to enable smooth change from one window shape to another in order to better adapt to input signal conditions. It outputs 2048 spectral lines for stationary signals, or 256 spectral lines for transient signals. Two windows are employed in the filter bank - one is used when perceptually important components are spaced closer than 140Hz, and the other is used when components are spaced more than 220Hz apart. 36

37 MPEG-2 Advanced Audio Coding (AAC) Principle of MPEG AAC Encoding Psychoacoustics Coding Window Switching is kept synchronized at both encoder and decoder sides. This allows the encoder to control temporal pre-echo noise even within a filter-bank window since quantization is performed in the frequency domain, the quantization error will spread over the whole window in time domain resulting in pre-echo noise for transient signals, so transient signals are best encoded with short transform length. An illustration of typical window switching sequence: (a) long window, (b) long start window, (c) eight short window, (d) long stop window, and (e) long window Sine or Kaiser-Based Derived Window The encoder can select optimum window shape according to the characteristics of the input signal. In order to maintain perfect reconstruction, the shape of the left half of each window should always match the shape of the right half of the preceding window. 37

MPEG-2 Advanced Audio Coding (AAC) Principle of MPEG AAC Encoding Temporal noise shaping Psychoacoustics Coding Perceptual coding is especially difficult if there are temporal mismatches between the

TNS is basically a technique that performs noise shaping by applying linear prediction at MDCT coefficients.

38 MPEG-2 Advanced Audio Coding (AAC) Principle of MPEG AAC Encoding Temporal noise shaping Psychoacoustics Coding Perceptual coding is especially difficult if there are temporal mismatches between the masking threshold and the quantization noise (pre-echo). The TNS technique allows the encoder to control the fine temporal structure of the quantization noise even within a filterbank window. TNS is basically a technique that performs noise shaping by applying linear prediction at MDCT coefficients. The TNS filtering block implements an in-place filtering operation on the spectral values, which means that it replaces the set of spectral coefficients input to the TNS filtering block with the corresponding prediction residual. Prediction coefficients of the TNS filtering are obtained by applying a linear predictive analysis of the spectral coefficients. The combination of the encoder filterbank and the adaptive TNS prediction filter is a compound continuous signal adaptive filterbank which adapts between a high-frequency resolution filterbank (for stationary signals) and a high-time resolution filterbank (for transient signals) dynamically. 38

MPEG-2 Advanced Audio Coding (AAC) Principle of MPEG AAC Encoding Psychoacoustics Coding Joint stereo coding explores the binaural psychoacoustic effects to significantly reduce the bit rate for

39 MPEG-2 Advanced Audio Coding (AAC) Principle of MPEG AAC Encoding Psychoacoustics Coding Joint stereo coding explores the binaural psychoacoustic effects to significantly reduce the bit rate for stereophonic signals to a rate much lower than that required for coding the input channels separately. Two techniques: M/S stereo coding: also known as sum/difference coding. M/S stereo coding can be used to overcome the problem caused by the binaural masking level depression, where a signal of lower frequencies (< 2kHz) can present as big as 20 db difference in masking threshold depending on the phase of the signal and noise presence. M/S stereo coding is applied to each channel pair of the multichannel audio source, i.e., left/right front and let/right surround channels. M/S coding can be switched on and off in time or in frequency depending on the characteristics of the input signal. Intensity stereo coding: it is similar to techniques based on dynamic crosstalk or channel coupling. The technique is based on the fact that the perception of high-frequency sound components relies mainly on the analysis of their energy-time envelopes rather than the signals themselves. It is possible to transmit only a single set of spectral values and share them among several audio channels while still achieve excellent sound quality. 39

MPEG-2 Advanced Audio Coding (AAC) Principle of MPEG AAC Encoding Psychoacoustics Coding Prediction is used to further reduce the redundancy of signals, especially stationary signals.

40 MPEG-2 Advanced Audio Coding (AAC) Principle of MPEG AAC Encoding Psychoacoustics Coding Prediction is used to further reduce the redundancy of signals, especially stationary signals. Each spectral component at the filterbank s output (up to 16kHz) are input to the prediction module. The predictor is a backward adaptive predictor which exploits the correlation of spectral components of consecutive frames. An LMS adaptation algorithm is used to calculate the predictor coefficients (order two) on a frame-by-frame basis. Prediction can be switched on and off dynamically according to the bit saving achieved with prediction or not. 40

MPEG-2 Advanced Audio Coding (AAC) Principle of MPEG AAC Encoding Quantization and coding A global gain quantized to 8 bits 49 scale factors are differential quantized and then Huffman coded

41 MPEG-2 Advanced Audio Coding (AAC) Principle of MPEG AAC Encoding Quantization and coding A global gain quantized to 8 bits 49 scale factors are differential quantized and then Huffman coded Non-uniform quantizer QS is a global quantization stepsize with resolution of 1.5dB Psychoacoustics Coding Two constraints exist during the quantization: Meet the requirement of the psychoacoustics model Keep the total number of bits below a certain limit The strategy adopted by AAC is by using two nested iteration loops; the inner and outer iteration loops. Main feature of AAC quantization: Nonuniform quantization -> increase SNR, finest step is 1.5 db Huffman coding of spectral values with different probability models -> Probability tables of two and four dimensions are used. Noise shaping by amplifying groups of spectral values. The amplification information is stored in the scale factors -> shape the quantization noise in units similar to the critical bands of human auditory system.. Huffman coding of the difference between scale factors 41

42 MPEG-2 Advanced Audio Coding (AAC) Principle of MPEG AAC Encoding Quantization and coding Psychoacoustics Coding 42

43 MPEG-2 Advanced Audio Coding (AAC) Principle of MPEG AAC Encoding Quantization and coding The task of inner iteration loop is to control the quantizer step size so that the given spectral data can be coded within the number of available bits. Initial quantizer -> quantization -> bit count -> Psychoacoustics Coding repeat with increase quantizer step size if bits exceed the available limit The task of outer iteration loop is to change the amplification amount of the spectral coefficients in all scale factor bands so that the psychoacoustic model can be fulfilled as much as possible. The inner and outer loops are applied iteratively until the best results are achieved or some termination conditions appeared. 43

44 MPEG-2 Advanced Audio Coding (AAC) Quantization and coding Noiseless coding Group 4 quantized coefficients as magnitudes in excess of one, with a value of +/- 1 left in the quantized coefficient array to indicate the sign. The clipped coefficients are coded by integer magnitudes and an offset from the base of the coefficient array to mark their location. Each set of 1024 quantized coefficients is separated into sections so that a single Huffman codebook can be used to code each section. Sectioning is performed dynamically in order to minimize the required number of bits to represent the full set of quantized spectral coefficients. Grouping and interleaving For the case of 8 short windows, the set of 1024 coefficients is actually a matrix of 8 by 128 frequency coefficients. More coding gain can be achieved if the coefficients of these short windows are grouped and interleaved. Scale factors Psychoacoustics Coding There is a global gain that normalizes the scale factors. The global gain is represented by an 8-bit unsigned integer. There are 49 scale factors, most scale factor bands have 32 coefficients. Both the global gain and scale factors are quantized in 1.5dB steps. The difference between each scale factor and the previous scale factor are Huffman coded. 44

45 MPEG-4 Audio Psychoacoustics Coding MPEG-4 is a standard for every application that requires the use of advanced sound compression, synthesis, manipulation, or playback. MPEG-4 audio provides the following integration: Low-bit-rate with high quality compression; Synthetic with natural sound processing; Speech with audio coding; Single with multichannel audio configuration; Traditional with interactive and virtual-reality content analysis Overview of MPEG-4 Capabilities Speech tools: used for the transmission and decoding of natural or synthetic speech Audio tools: used for the transmission and decoding of recorded music and other audio soundtracks Synthesis tools: used for very-low-bit-rate description, transmission, and synthesis of synthetic music and other sounds Composition tools: used for object-based coding, interactive functionality, and audiovisual synchronization. Scalability tools: used for creating bit streams that can be transmitted, without recording, at several different bit rates. 45

46 MPEG-4 Audio Psychoacoustics Coding MPEG-4 General Audio Coding Tools MPEG-4 standard is capable of coding of natural audio at a wide bit range, including bit rates from 6kbit/s up to several hundred kbit/s per audio channel for mono, two-channel, and multichannel signals. In the upper bit rate range of MPEG-4 general audio coder, high-quality compression can be achieved by MPEG-2 AAC standard with certain improvements within the MPEG-4 tool set. Tools: Speech, General Audio, Scalability, Synthesis, Composition, Streaming, Error Protection General Frame Work of MPEG-4 Audio 46

47 MPEG-4 Audio Psychoacoustics Coding MPEG-4 Additions to AAC Besides the building blocks provided by MPEG-2 AAC, several new tools are added in the MPEG-4 T/F coder in order to improve the coding efficiency and offer new functionalities. Perceptual noise substitution (PNS) Long-term prediction (LTP) Transform-domain weighted interleave vector quantization (TwinVQ) coding kernel Low-delay AAC (AAC-LD) Error-resilience (ER) Bit Slice Arithmetic Coding (BSAC) instead of Huffman coding Perceptual noise substitution (PNS) The principle of PNS is to represent noise-like components of the input signal with a very compact parameter representation. It is based on the fact that the subjective sensation stimulated by a noise-like signal is determined by the input signal s spectral and temporal fine structure rather than its actual waveform. 47

MPEG-4 Audio Psychoacoustics Coding Perceptual noise substitution (PNS) The encoder analyses the input signal and determines noise-like signal components for each scalefactor band in each coding

48 MPEG-4 Audio Psychoacoustics Coding Perceptual noise substitution (PNS) The encoder analyses the input signal and determines noise-like signal components for each scalefactor band in each coding frame. If a particular scalefactor band is considered to be noise-like, no quantization and entropy coding is performed. Instead, these two steps are omitted and a noise substitution flag and the total power of the substituted set of spectral coefficients are transmitted to the decoder. The decoder analyses the transmitted information. If a noise-substitution flag is detected, pseudorandom noise with a total noise power equal to the transmitted level are inserted into the reconstructed data to replace the actual spectral coefficients. Since only a signaling flag and energy information are transmitted for each selected scalefactor band, PNS results in a highly compact representation for noiselike components in the input signals. It was found that PNS tool has the ability to enhance the coding efficiency for complex musical signals when coded at low bit rates. 48

49 MPEG-4 Audio Psychoacoustics Coding Long-term prediction It is based on the well-known technique for coding of speech signals. This technique is usually used in speech coding to reduce the redundancy of voiced speech signals which is periodic nature. 49

50 MPEG-4 Audio Psychoacoustics Coding Long-term prediction The LTP tool uses the quantized spectral values of the preceding frames to predict input signals as follows: An inverse TNS filter and a synthesis filterbank are needed to map the quantized spectral values back to their time domain representations. The optimum parameters for delay (long-term lag) and amplitude scaling (gain) are calculated based on matching the reconstructed time signal to the actual input signal. They are then formed the predicted signal. An analysis filterbank and a forward TNS filter are used to transform both the input and the predicted signals into their spectral representations. A residual signal is obtained by subtracting the spectral values of the input signal by that of the predicted signal. The difference signal is then sent for quantization and entropy coding. To achieve the best performance, the difference or input signals can be selected for quantization and coding depending on the resulting bit rate. 50

51 MPEG-4 Audio Psychoacoustics Coding Transform-domain weighted interleave vector quantization (TwinVQ) TwinVQ is an additional quantization/coding process provided by MPEG-4 T/F coder other than the MPEG-2 AAC. It is an alternative coding kernel and is mainly designed to be used with the MPEG-4 scalable T/F audio coder. This TwinVQ tool is capable of providing good coding performance at extremely low-bit-rates for general types of audio signal, including music. Two steps involved in TwinVQ: Spectral normalization: flatten the amplitudes of spectral coefficients to a desirable range by using the LPC technique. An LPC model is used for representing the overall coarse spectral envelope of the input signal. The LPC model parameters are quantized and sent to the decoder as side information. Weighted vector quantization: the flattened spectral coefficients are first interleaved and divided into subvectors. Perceptual shaping of the quantization noise can be achieved by using an adaptive weighted distortion measure that is controlled by a perceptual model. Vector quantization is then applied on the weighted subvector with equal bit allocation for all subvectors. 51

52 MPEG-4 Audio Psychoacoustics Coding Transform-domain weighted interleave vector quantization (TwinVQ) 52

53 MPEG-4 Audio Psychoacoustics Coding Low-Delay AAC The standard MPEG-4 T/F coder can provide excellent coding performance for general audio signals at low bit rates, but there is a penalty to pay its algorithmic delay is up to a several hundred milliseconds. This is not well-suited to applications requiring low delay, such as real-time bidirectional communications. A low delay version of MPEG-4 AAC LTP is developed to enable coding with an algorithmic delay down to 20ms. It uses a frame length of 512 or 480 samples at 48kHz sampling for low-delay mode. The look-ahead delay is avoided by disabling the window switching. A sine window is used for non transient parts of signal. Whereas a low overlap window is applied to transient signals so that the optimum TNS performance can be achieved and effects of temporal aliasing as a result of MDCT filterbank can be reduced. The use of bit reservoir is minimized or disabled at the encoder 53

MPEG-4 Audio Psychoacoustics Coding MPEG-4 Scalable Audio Coding The traditional perceptual audio coding design has a fixed target bit rate which is not appropriate for applications where bit streams

54 MPEG-4 Audio Psychoacoustics Coding MPEG-4 Scalable Audio Coding The traditional perceptual audio coding design has a fixed target bit rate which is not appropriate for applications where bit streams need to be distributed via transmission channels with time-varying capacity. MPEG-4 audio has scalability functionality which is integrated into the AAC coding framework and is capable of generating a single bit stream which adapts to different channel characteristics. Large-step scalable audio coding The bit streams generated is composed of several partial bit streams that can be independently decoded and then combined. The base layer coder is the first coder involved in the scheme. It codes the input signal with the basic perceptual quality. The residual data obtained by subtracting the input from a local decoder s output is coded by an enhancement layer coder. The process continues refining the signal by sending increasingly more enhancement layers into the bit stream until the target bit rate is achieved. 54

55 MPEG-4 Audio Psychoacoustics Coding MPEG-4 Scalable Audio Coding Architecture of MPEG-4 Large-Step Scalable Audio Encoder Architecture of an MPEG-4 Large-Step Scalable Audio Decoder 55

MPEG-4 Audio Psychoacoustics Coding MPEG-4 High Efficiency AAC (HE-AAC) This is an extension of low complexity AAC (AAC-LC) optimized for low-bitrate applications such as streaming audio.

56 MPEG-4 Audio Psychoacoustics Coding MPEG-4 High Efficiency AAC (HE-AAC) This is an extension of low complexity AAC (AAC-LC) optimized for low-bitrate applications such as streaming audio. HE-AAC version 1 profile uses spectral band replication (SBR) to enhance the compression efficiency in the frequency domain. HE-AAC version 2 profile couples SBR with Parametric Stereo (PS) to enhance the compression efficiency of stereo signals. It is a standardized and improved version of the AACplus codec. HE-AA is used in DAB+ digital radio standard. Spectral Band Replication It is based on extending the harmonic spectrum of low frequency band to high frequency band. The codec encodes and transmits the low- and mid-frequencies of the spectrum, while SBR replicates higher frequency content by transposing up harmonics from the low- and mid-frequencies at the decoder. Some guidance information for reconstruction of the highfrequency spectral envelope is transmitted as side information. Parametric Stereo It performs sparse coding in the spatial domain, somewhat similar to what SBR does in the frequency domain. Stereo audio is down-mixed to mono and then additional PS, about 2.3kbps of side information, is sent along with the encoded mono stream 56

57 Psychoacoustics Coding Other Lossy Audio Waveform Coders AC-3 (Dolby Digital) Coder preceded by the AC-1 and AC-2 codecs. AC-1 uses adaptive delta modulation combined with analog companding. It is not a perceptual coder and has approximately a 3:1 compression ratio. It is used in satellite relays of television and FM programming as well as cable radio services. AC-2 is a perceptual coder. It uses 512-point FFT with 50% overlapping. Kaiser-Bessel window is used. Coefficients are grouped into subbands containing 1 to 15 coefficients to model critical bandwidth. It can provide high quality audio with a data rate of 256 kbps per channel. AC-2 is a registered.wav type so that AC-2 files are interchangeable between computer platforms. The AC-3 coding (Dolby Digital) is an outgrowth of the AC-2 encoding format. It has widespread uses in commercial cinema and is widely used to convey multichannel audio in applications such as DVD-Video, DTV and DBS. AC-3 is a perceptual coder. It can code from 1 to 6 channels as 3/2, 3/1, 3/0, 2/2, 2/1, 2/0, 1/0 as well as optional LFE channel. It can code the 5.1 (i.e., six) channels at 48 khz at a nominal rate of 384 kbps (compression ratio 13:1). It also support bit rates from 32 to 640 kbps. The AC-3 coder is backward compatible with matrix surround sound, twochannel stereo and monaural formats. 57

58 Psychoacoustics Coding Other Lossy Audio Waveform Coders AC-3 coder description Filter bank: implement MDCT using 512 FFT with 256 spectral coefficients, use Kaiser-based window with 50% overlapping. There is a total of 50 bands between 0 and 24 khz. The bandwidths vary between 3/4 and 1/4 of critical bandwidth values. A transient detector can dynamically reduce the transform length from 512 to 256 for wideband transient signals Spectral coefficient quantization: Each frequency coefficient is processed with floating point representation with mantissa (0 to 16 bits) and exponent (5-bit) to maintain dynamic range. The coded exponents act as scale factors for mantissas and represent the signal spectrum envelope. The spectral envelope is coded differentially from lower frequency adjacent filters. The first DC term is coded as an absolute. Each differential is quantized from 5 possible values. The differential exponents are grouped for coding, e.g., group of three differentials are coded in a 7-bit word. The selection of grouping is dependent on the signal characteristics trading the frequency and time resolution. The bit allocation for coding the mantissas is according to the masking criteria. Assignment is performed globally on all channels from a bit pool. Quantized mantissas are scaled and offset. Differ is optionally employed when zero bits are allocated to the mantissa 58

59 Psychoacoustics Coding Other Lossy Audio Waveform Coders AC-3 coder description 59

60 Psychoacoustics Coding Other Lossy Audio Waveform Coders Digital Theater System (DTS) The DTS perceptual coding algorithm (known as Coherent Acoustics) nominally codes 5.1 channels at a bit rate of 1.5 Mbps. It can operate over a range of bit rates (e.g., 8 to 512 kbps/channel), sampling frequencies (e.g., 24 to 192 khz) and resolution (e.g., 16 to 25 bits). It is a subband ADPCM coder with 32 uniform subbands. Input signal is divided into frame of 256, 512, 1024, 2048 or 4096 samples depending on the sampling frequency and output bit rate. Each subband is ADPCM coded. Audio signal is examined for psychoacoustic and transient information. A global bit management system allocates bits over all the coded subbands. The algorithm calculates scale factors and bit allocation indices and ultimately quantizes the ADPCM samples using from 0 to 24 bits. Words can be represented using variable-length entropy coding. The LFE channel is coded independently by decimating a full-bandwidth input, yielding a LFE bandwidth; ADPCM coding is then applied. 60

61 Performance Evaluation of Perceptual Audio Coders Traditional audio devices are measured according to their small deviations from linearity; perceptual coders are highly nonlinear. The best objective testing mean for a perceptual coder is an artificial ear. To measure perceived accuracy, the algorithm contains a model that emulates the human hearing response. Subjective listening test Mean Opinion Score (MOS) Subband coders can have unmasked quantization noise that appears as a burst of noise in a processing block. A coder with long block length can exhibit a pre-echo burst of noise just before a transient, or there might be a tinkling sound. Expert listeners can be arranged to perform the subjective listening tests. CCIR has developed a five-point impairment scale for subjective evaluation of compression algorithms: 5 Imperceptible 4 Perceptible, but not annoying 3 Slightly annoying 2 Annoying 1 Very annoying Psychoacoustics Coding Panels of listeners rate the compression algorithms on a continuous scale in scores from 5.0 to 1.0. Original uncompressed material may receive an average score of 4.8 on the CCIR scale. When a coder also obtains an average score of 4.8 it is said to be transparent. Lower scores assess how far from transparency a coder is. Higher compression ratios generally score less. 61

62 Performance Evaluation of Perceptual Audio Coders Coding margin as noise-to-mask ratio (NMR) Psychoacoustics Coding The original signal and the error signal (the difference between the original and coded signals) are subjected to FFT analysis, and the resulting spectra are divided into subbands. The masking threshold (maximum masked error energy) in each original signal subband is estimated; the actual error energy in each coded signal subband is determined and compared to the masking threshold; the ratio of error energy to masking threshold is the NMR in each subband. A positive NMR would indicate an audible artifact. The NMR values can be linearly averaged and expressed in db; this means NMR measures remaining audibility headroom in the coded signal. NMR can be plotted overtime to identify areas of coding difficulty. A masking flag, generated when the NMR exceeds 0 db in a subband can be used to measure the number of impairments in a coded signal. 62

Performance Evaluation of Perceptual Audio Coders Perceptual Evaluation of Audio Quality (PEAQ) Psychoacoustics Coding PEAQ is a standardized algorithm for objectively measuring perceived audio

63 Performance Evaluation of Perceptual Audio Coders Perceptual Evaluation of Audio Quality (PEAQ) Psychoacoustics Coding PEAQ is a standardized algorithm for objectively measuring perceived audio quality developed by ITU in 2001 (ITU-R Recommendation BS.1387). It utilizes software to simulate perceptual properties of the human ear and then, integrate multiple model output variables (MOV) into a single metric. PEAQ characterizes the perceived audio quality as subjects would do in a listening test according to ITU-R BS PEAQ results principally model mean opinion scores (MOS) that cover a scale from 1 (bad) to 5 (excellent). 63

64 Performance Evaluation of Perceptual Audio Coders Perceptual Evaluation of Audio Quality (PEAQ) Psychoacoustics Coding The model follows the fundamental properties of the auditory system and its different stages of physiological and psychoacoustic effects. The first part models the construction of the signal with a DFT and filter banks. The other provides a cognitive processing as the human brain does. From the model comparison of the test signal with the (original) reference signal, a number of model output variables MOV are derived. Each model output variable may measure different psychoacoustic dimensions. In the final stage the MOV values are combined to produce a MOSlike result that copes with subjective quality assessment. There are two model versions defined in the standard: basic and advanced versions. 11 MOVs are used for the final mapping to a quality measure whereas the advanced version uses 5 MOVs. 64

65 Audio Coding Formats for the Compact Disc (CD) and Digital Versatile Disc (DVD) The compact disc The CD-DA (Digital Audio) standard was developed by Philips and Sony in First introduced as commercial product in 1982 Introduced CD-ROM (1984), CD-I (1986), CD-WO (1988), Video CD (1994) and CD-RW (1996), SACD (1999) Use 750nm laser with 780MB storage Digital Versatile Disc (DVD) Use 650nm laser with 4.7GB storage Blue-ray disc Introduced in 2006, use blue-violet laser (405nm), 27GB on a single layer and 54GB on a dual-layer disc Supported by Apple, Hitachi, Philips, Samsung, Sharp and Sony HD-DVD 15GB storage, by Toshiba, lost the format war so end of production Lossless Audio Coding 65

66 Audio Coding Formats for CD and DVD CD-DA format Lossless Audio Coding 16-bit PCM data sampled at 44.1 khz, bit rate at 1.41 mbps. Other overhead such as error correction, synchronization, and modulation are required. It holds 784 MB of user information or 74 minutes of stereo audio. Information is contained in pits impressed into the disc s plastic substrate. Disc diameter is 12cm. A pit is about 0.6 µm wide and disc might hold two billion of them. Each pit edge represents a binary 1; flat areas between pits or areas within pits are decoded as binary 0s. Data is read from the disc as a change in intensity of reflected laser light. The pits are aligned in a spiral track running from the inside diameter of the disc to the outside. There are 22,188 revolution across the disc s signal surface of 35.5 mm. The pits are encoded with eight-to-fourteen modulation (EFM) for greater storage density, and Cross Interleave Reed-Solomon code (CIRC) for error correction. EFM is an efficient and highly structured (2,10) run-length limited (RLL) code and is very tolerant of imperfections, provides very high density, and promotes stable clock recovery by a self-clocking decoder. 66

67 Audio Coding Formats for CD and DVD CD-DA format Lossless Audio Coding All data on a CD is formatted by frames. All of the required data is placed into frame format during encoding. Each frame consisting of 588 channel bits. Six 32-bit PCM audio samples (left & right channels) are grouped in a frame. Eight 8-bit parity symbols are generated per frame. One subcode symbol is added per frame. Subcodes contain information describing where tracks begin and end, track numbers, disc timing, index points, and other parameters. There are six possible subcodes (P, Q,R, S, T, U, and V). CD-DA uses only P and Q subcodes. The data in frame format is modulated using EFM. This increases data density and help facilitate control of the spindle motor speed. Block of 14 channel bits are linked by three merging bits to maintain the proper run length between words, as well as suppress dc content, and aid clock synchronization. 67

68 Audio Coding Formats for CD and DVD CD-DA format Lossless Audio Coding Frame of 588 channel bits 3 additional bits are added between two EFMs to bring the Digital Sum Value (DSV) to zero 68

69 Audio Coding Formats for CD and DVD Super Audio CD (SACD) Introduced by Philips and Sony in 1999 Supports discrete-channel (two-channel and multi-channel) audio recording Uses one-bit Direct Stream Digital (DSD) coding method Uses the same dimension as a CD: 12cm diameter and 1.2 mm thickness. The laser wavelength is 650nm, pit length 0.4 µm. Holds 4.7 GBytes of data; for 2-channel stereo this provides about 110 minutes of playing time. There are single-layer, dual-layer or hybrid disc constructions (single layer 4.7GB). All SACD discs incorporate an invisible watermark that is physically embedded in the substrate of the disc. The watermark is used to conduct mutual authentication of the player and the disc; the SACD player will reject any discs that do not bear an authentic watermark. SACD player can play back both SACD and CD discs. Text and graphics can be included on a SACD disc. Direct Stream Digital (DSD) coding for SACD One-bit pulse density form using sigma-delta modulation. Lossless Audio Coding Similar to one-bit sigma delta AD converter, but DSD does not require decimation filtering; instead, the original sampling frequency is retained. Onebit data is recorded directly on the 69 disc.

Audio Coding Formats for CD and DVD Lossless Audio Coding Direct Stream Digital (DSD) coding for SACD DSD does not employ interpolation (oversampling) filtering in the playback process.

70 Audio Coding Formats for CD and DVD Lossless Audio Coding Direct Stream Digital (DSD) coding for SACD DSD does not employ interpolation (oversampling) filtering in the playback process. DSD uses a sampling frequency that is 64-times 44.1 khz, or MHz with one bit quantization. The overall bit rate is 4 times higher than on a CD. A lossless coding algorithm known as Direct Stream Transfer (DST) has been adopted for the SACD format; it uses an adaptive prediction filter and arithmetic coding to effectively double the disc capacity. Eight DSD channels (6 surround plus stereo mix) on a 4.7GB data layer are allowed a playing time of 27 minutes. With DST compression, a 74-minute playing time is accommodated. Sigma Delta Modulation in SACD The block H(z) represents the noise-shaping filter. The function of this noise-shaping filter is to shape the quantization noise, introduced by the coarse 1-bit quantization, in such a way that virtually all quantization noise falls outside the 0-20 khz frequency band. H(z) is required to be a low-pass filter with a very high gain in audio band. In order to achieve a dynamic range that is good enough for SACD, e.g., 120 db, the order of the noise-shaping filter should be at least five. Advantages: High sampling frequency reduces warp-around of high harmonics due to non-linearity process Reduce distortion due to warping and ringing caused by sharp antialias filters Disadvantage: Need to convert to linear PCM format for postediting 70

Audio Coding Formats for CD and DVD Direct Stream Transfer (DST) Lossless Audio Coding DST is a lossless compression with a compression ratio of at least 2.7. The resulting bit rate is between 5.

71 Audio Coding Formats for CD and DVD Direct Stream Transfer (DST) Lossless Audio Coding DST is a lossless compression with a compression ratio of at least 2.7. The resulting bit rate is between 5.6 and 8.4 Mb/s for multi-channel audio. Unlike linear PCM, DSD data is onebit data so standard lossless audio compression algorithms working on multi-bit PCM data cannot be used. The main building blocks of DST are framing, prediction, and entropy encoding. Framing: 37,632 bits per frame. Predictor: look ahead linear predictor with order ranges between 1 and 128, the predicted signal is a multibit signal, i.e., predicting 1-bit signal. Probability table: Based on the predicted multibit signal, an error probability table is calculated. The difference between the quantized predicted signal (conversion from multibit to 1 bit) and the input signal gives the error signal which is then encoded by an entropy coder The filter coefficients and the entropy-coded error signal are packed and stored on the disc. 71

72 Audio Coding Formats for CD and DVD Digital Versatile Disc (DVD) Lossless Audio Coding The DVD family of formats was developed by a consortium of manufacturers known as the DVD Forum. Preliminary DVD format was announced in The DVD family includes formats for video, audio, and computer applications. DVD- Video and DVD-ROM families were first introduced in

73 Audio Coding Formats for CD and DVD Physical specification of DVD Uses same diameter (120mm) and thickness (1.2mm) as CD The read-only formats for DVD-Video, DVD-Audio and DVD-ROM share the same disc construction, modulation code and error correction. Single or dual layer per substrate construction Track pitch 0.74 µm, minimum pit length 0.4µm, laser wavelength 635 or 650 nm (CD 780 nm) A DVD layer can store 4.7 Gbytes of data, multiple layer provides greater capacity DVD data layers are embedded deeply within the disc and thus less vulnerable to damage than CD. DVD-Audio First version was finalized in Lossless Audio Coding All DVD-Audio discs must contain an uncompressed or MLP-compressed LPCM version of the DVD-Audio portion of the program. DVD-Audio discs may also include video programs with Dolby Digital, DTS and/or LPCM tracks. Two types of DVD-Audio discs are defined. An Audio-only disc contains only music information. An Audio-only disc can optionally include still pictures (one per track), text information, and a visual menu. Audio with Video (AV) disc can contain motion video information formatted as a subset of the DVD-Video format. 73

74 Audio Coding Formats for CD and DVD Coding Format Supported for DVD-Audio Lossless Audio Coding Parameter DVD-Audio SACD CD Audio Coding 16-,20-,or 24-b LPCM 1-b DSD 16-b LPCM Sampling rate 44.1, 48, 88.2, 96, 176.4, or 192 khz 2,822.4 khz 44.1 khz Channels Compression Yes (MLP) Yes (DST) None Content protection Yes Yes No Playback time * min min 74 min Frequency response DC-96 khz DC-100 khz DC-20 khz Dynamic range Up to 144 db Over 120 db 96 db * For 62 min, we assume 96kHz sampling, 20-b samples, and five channels. For 843 min, we assume 44.1kHz sampling, 16-b samples, and one channel. 74

75 Lossless Audio Coding Audio Coding Formats for CD and DVD Parameter Configuration Supported for DVD-Audio 75

76 Audio Coding Formats for CD and DVD Meridian Lossless Packing (MLP) Coding for DVD-Audio MLP is a lossless coding standard for DVD-audio which allows 74 minutes of high-quality multichannel music to be recorded on a single layer 4.7GB DVD. Problem with lossless coding generally achieves low compression rate compression rate achieved depends very much on the data, low compression for random signals and high compression for silent or near-silent signals Fortunately, it turns out that real acoustic signals tend not to provide full-scale white noise in all channels for any significant duration! will have a variable data rate on normal audio content Lossless Audio Coding MLP tackles this by attempting to maximize the compression at all times using this set of techniques: looking for dead air channels that do not exercise all the available word length channels that do not use the available bandwidth removing interchannel correlations efficiently coding the residual information smoothing coded information by buffering 76

filters managed first-in, first-out buffering across transmission decoder lossless

77 Audio Coding Formats for CD and DVD MLP Encoder Important novel techniques used lossless processing lossless matrixing lossless use of infinite impulse response filters managed first-in, first-out buffering across transmission decoder lossless self check operation on heterogeneous channels sampling frequency Lossless Audio Coding 77

78 Audio Coding Formats for CD and DVD MLP Encoding Incoming channels may be re-mapped to optimize the use of substreams. The MLP stream contains hierarchical structure of substreams. Incoming channels can be matrixed into two (or more) substreams. This method allows simpler decoders to access a substream of the overall signal. Each channel is shifted to recover unused capacity, e.g., less than 24-b precision or less than full scale. A lossless matrix technique optimizes the channel use by reducing the interchannel correlations. The signal in each channel is decorrelated using a separate predictor for each channel. The decorrelated audio is further optimized using entropy coding. Each substream is buffered using a FIFO memory system to smooth the encoded data rate. Multiple data substreams are interleaved. Lossless Audio Coding The stream is packetized for fixed or variable data rate for target carrier. 78

79 Audio Coding Formats for CD and DVD MLP Encoding: Lossless Matrix Lossless Audio Coding In general, the encoded data rate is minimized by reducing the commonality between channels, e.g., rotate a stereo mix from left/right to sum/difference. Conventional matrixing is not lossless since inverse matrix reconstructs the original signals with rounding errors. The MLP encoder decomposes the general matrix into a cascade of affine transformations. Each affine transformation modifies just one channel by adding a quantized linear combination of the other channels. If the encoder substracts a particular linear combination, then the decoder must add it back. The quantizers Q ensure constant input-output wordwidth and lossless operation on different computing platforms. A Single Loosless Matrix Encode and Decode 79

80 Audio Coding Formats for CD and DVD MLP Encoding: Prediction The function of a decorrelator (predictive filter) is to decorrelate the input signal such that there is no correlation between the currently transmitted difference signal and its previous values. A decorrelator can make significant gains by flattening the spectrum of the audio signals. Ideally, the transmitted difference spectrum should have a flat spectrum, i.e., white noise. The average power of the decorrelated difference signal is significantly lower than the original signal, hence the reduction in data rate. The MLP encoder uses a separate predictor for each encoded channel. The encoder is free to select IIR or FIR filters up to eighth order from a wide palette. Most lossless compression schemes use FIR filters, however, IIR filters have advantages in some situations where control of peak data rate is important and the input spectrum exhibits an extremely wide dynamic range. Lowest curve MLP nominal compression rate Middle curve lossless matrix is switched off Upper curve constrained to use FIR prediction only Lossless Audio Coding The top line shows the 9.6 Mb/s data-rate limit for DVD-Audio 80

81 Audio Coding Formats for CD and DVD MLP Encoding: Entropy coding Once the cross-channel and inter-sample correlations have been removed, it remains to encode the individual samples of the decorrelated signal as efficiently as possible. Audio signals even after decorrelation tend to be peaky with distribution resemble a Laplacian distribution, that is a two-sided decaying exponential. Therefore, there will be coding gain using entropy coding. The MLP encoder may choose from a number of entropy coding methods; Huffman and Arithmetic coding. Buffering Lossless Audio Coding Nominal audio signals can be well predicted, however, there are occasional fragments like, sibilants, synthesized noise, or percussive events that have high entropy. MLP uses a particular form of stream buffering that can reduce the variations in transmitted data rate, absorbing transients that are hard to compress. FIFO memory buffers are used in the encoder and decoder. These buffers are configured to give a constant notional delay across encode and decode. This overall delay is small, typically of the order of 75ms. FIFO management minimizes the part of the delay due to the decoder buffer. So, this buffer is normally empty and fills only ahead of sections with high instantaneous data rate. During these sections, the decoder s buffer empties and is thus able to deliver data to the decoder core at a higher rate that the transmission channel is able to provide. In the context of a disc, this strategy has the effect of moving excess data away from the stress peaks. The encoder can use the buffering for a number of purposes, e.g., keeping the data rate below a present (format) limit, minimizing the peak data rate over an encoded section. 81

82 Audio Coding Formats for CD and DVD MLP Encoding: Lossless Audio Coding Use of Substreams The MLP stream contains a hierarchical structure of substreams. Incoming channels can be matrixed into two (or more) sunstreams. This method allows simpler decoders to access a subset of the overall signal. 82

83 Audio Coding Formats for CD and DVD MLP Encoding: Two-Channel Downmix Lossless Audio Coding It is often useful to provide a means for accessing high-resolution multichannel audio streams on two-channel playback devices. In an application such as DVD-Audio, the content provider can place separate multi- and two-channel streams on the disc. However, to do this requires separate mix, mastering, and authoring processes and uses more disc capacity. In case where only one multi-channel stream is available, a fixed or guided downmix is required, that means it is first necessary to decode the full multichannel signal. MLP provides an elegant and unique solution. The encoder combines lossless matrixing with the use of two substreams in such a way as to optimally encode both the L0, R0 downmix and the multichannel version. Two Substream in Encoding 83

84 Audio Coding Formats for CD and DVD MLP Encoding: Downmix in Lossless Encoder Lossless Audio Coding Downmix instructions are fed to the matrix 1 to determine some coefficients for the lossless matrix. The matrices then perform a rotation such that the two channels on the substream 0 decode to the desired stereo mix and combine with substream 1 to provide full multichannel. Decoding Two Substreams Because the two-channel downmix is a linear combination of the multichannel mix, then strictly, no new information has been added. In practice only a modest increase in overall data rate (typically 1 bit per sample). The advantages of this method are: The quality of mix-down is guaranteed. The producer can listen to it at the encoding stage. A two-channel-only playback device does not need to decode the multichannel stream and then perform mix-down. Instead, the lossless decoder only needs to decode substream 0. A more complex decoder may access both the two-channel and multichannel versions losslessly. The downmix coefficients do not have to be constant for a whole track, but can be varied 84 under artist control.

85 Audio Coding Formats for CD and DVD MLP Compression Rate Lossless Audio Coding Sampling (khz) Data-rate Reduction Data-rate Reduction (bit/sample/channel) Peak to to to 14 Average MLP decoder is relatively low complexity. A decoder capable of extracting a two-channel stream at 192 khz requires approximately 27 MIPS, while 40 MIPS is required to decode 6 channels at 96 khz. Playing Time on DVD-Audio DVD audio holds approximately 4.7GB of data and has a maximum data transfer rate of 9.6 Mb/s for an audio stream. Six channels of 96kHz/24-b LPCM audio have a data rate of Mb/s which is well in excess of 9.6 Mb/s. In addition, at Mb/s, the data capacity of the disc would be used up in approximately 45 min. MLP meets the requirement of the industry norm of 74 min. Here are some examples of playing times that can be obtained: 5.1 channels, 96 khz/24-b: 100 min 6 channels, 96 khz/24-b: 86 min 2 channels, 96 khz/24-b: 4 hours 2 channels, 192 khz/24-b: 2 hours 2 channels, 44.1 khz/16-b: 12 hours 1 channels, 44.1 khz/16-b: 25 hours 85

86 MPEG-4 Audio Lossless Coding (MPEG-4 ALS) Lossless Audio Coding MPEG-4 ALS is an extension to MPEG-4 Part 3 audio coding standard to allow lossless compression. (ISO ). MPEG-4 ALS Encoder MPEG-4 ALS Decoder 86

MPEG-4 Audio Lossless Coding (MPEG-4 ALS) MPEG-4 ALS Predictor Lossless Audio Coding The predictor used in MPEG-4 ALS is a combination of a

The long-term predictor is aimed to remove the time correlation in audio signals which are mostly periodic in nature.

The short-term predictor is an order-k linear predictor of the form: The predicted signal from previous samples is calculated as: where Q is the

87 MPEG-4 Audio Lossless Coding (MPEG-4 ALS) MPEG-4 ALS Predictor Lossless Audio Coding The predictor used in MPEG-4 ALS is a combination of a short-term and longterm predictor. The long-term predictor is aimed to remove the time correlation in audio signals which are mostly periodic in nature. It has the form: where is the tap delay and M is the number of non-zero taps (M=5 in MEPG-4 ALS). The short-term predictor is an order-k linear predictor of the form: The predicted signal from previous samples is calculated as: where Q is the number of bits reserved for fixed-point representation of coefficients. The computation of linear predictor coefficients can be done using standard techniques such as autocorrelation approach with Levinson-Durbin algorithm. In MPEG4 ALS, PARCOR (reflection) coefficients are quantized to 8 bits each after an arcsine transformation. 87

Audio-coding standards

Audio-coding standards The goal is to provide CD-quality audio over telecommunications networks. Almost all CD audio coders are based on the so-called psychoacoustic model of the human auditory system.