/ / _ / _ / _ / / / / /_/ _/_/ _/_/ _/_/ _\ / All-American-Advanced-Audio-Codec

Similar documents
MUSIC A Darker Phonetic Audio Coder

HAVE YOUR CAKE AND HEAR IT TOO: A HUFFMAN CODED, BLOCK SWITCHING, STEREO PERCEPTUAL AUDIO CODER

<< WILL FILL IN THESE SECTIONS THIS WEEK to provide sufficient background>>

Multimedia Communications. Audio coding

Audio-coding standards

Module 9 AUDIO CODING. Version 2 ECE IIT, Kharagpur

Mpeg 1 layer 3 (mp3) general overview

Lecture 16 Perceptual Audio Coding

Audio-coding standards

Audio Compression. Audio Compression. Absolute Threshold. CD quality audio:

EE482: Digital Signal Processing Applications

5: Music Compression. Music Coding. Mark Handley

Appendix 4. Audio coding algorithms

ELL 788 Computational Perception & Cognition July November 2015

Chapter 14 MPEG Audio Compression

Figure 1. Generic Encoder. Window. Spectral Analysis. Psychoacoustic Model. Quantize. Pack Data into Frames. Additional Coding.

Compressed Audio Demystified by Hendrik Gideonse and Connor Smith. All Rights Reserved.

Principles of Audio Coding

MPEG-1. Overview of MPEG-1 1 Standard. Introduction to perceptual and entropy codings

New Results in Low Bit Rate Speech Coding and Bandwidth Extension

Optical Storage Technology. MPEG Data Compression

Contents. 3 Vector Quantization The VQ Advantage Formulation Optimality Conditions... 48

ROW.mp3. Colin Raffel, Jieun Oh, Isaac Wang Music 422 Final Project 3/12/2010

Introducing Audio Signal Processing & Audio Coding. Dr Michael Mason Snr Staff Eng., Team Lead (Applied Research) Dolby Australia Pty Ltd

Perceptual coding. A psychoacoustic model is used to identify those signals that are influenced by both these effects.

DRA AUDIO CODING STANDARD

Efficient Signal Adaptive Perceptual Audio Coding

Perceptual Coding. Lossless vs. lossy compression Perceptual models Selecting info to eliminate Quantization and entropy encoding

Both LPC and CELP are used primarily for telephony applications and hence the compression of a speech signal.

Ch. 5: Audio Compression Multimedia Systems

Introducing Audio Signal Processing & Audio Coding. Dr Michael Mason Senior Manager, CE Technology Dolby Australia Pty Ltd

AUDIOVISUAL COMMUNICATION

Audio Coding and MP3

The MPEG-4 General Audio Coder

DAB. Digital Audio Broadcasting

Chapter 4: Audio Coding

MPEG-4 General Audio Coding

AUDIOVISUAL COMMUNICATION

Audio Coding Standards

Wavelet filter bank based wide-band audio coder

Bit or Noise Allocation

Fundamentals of Perceptual Audio Encoding. Craig Lewiston HST.723 Lab II 3/23/06

Compression; Error detection & correction

Parametric Coding of High-Quality Audio

REDUCED FLOATING POINT FOR MPEG1/2 LAYER III DECODING. Mikael Olausson, Andreas Ehliar, Johan Eilert and Dake liu

A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING

2.4 Audio Compression

Parametric Coding of Spatial Audio

CISC 7610 Lecture 3 Multimedia data and data formats

Speech and audio coding

Module 9 AUDIO CODING. Version 2 ECE IIT, Kharagpur

Data Compression. Audio compression

A Parallel Reconfigurable Architecture for DCT of Lengths N=32/16/8

Compression; Error detection & correction

Digital Communication Prof. Bikash Kumar Dey Department of Electrical Engineering Indian Institute of Technology, Bombay

Design and Implementation of an MPEG-1 Layer III Audio Decoder KRISTER LAGERSTRÖM

Quality versus Intelligibility: Evaluating the Coding Trade-offs for American Sign Language Video

An Experimental High Fidelity Perceptual Audio Coder

25*$1,6$7,21,17(51$7,21$/('(1250$/,6$7,21,62,(&-7&6&:* &2',1*2)029,1*3,&785(6$1'$8',2 &DOOIRU3URSRVDOVIRU1HZ7RROVIRU$XGLR&RGLQJ

Perceptual Pre-weighting and Post-inverse weighting for Speech Coding

Lecture 7: Audio Compression & Coding

MRT based Fixed Block size Transform Coding

Audio Fundamentals, Compression Techniques & Standards. Hamid R. Rabiee Mostafa Salehi, Fatemeh Dabiran, Hoda Ayatollahi Spring 2011

THE STA013 AND STA015 MP3 DECODERS

Image Compression for Mobile Devices using Prediction and Direct Coding Approach

SIGNAL COMPRESSION. 9. Lossy image compression: SPIHT and S+P

_APP B_549_10/31/06. Appendix B. Producing for Multimedia and the Web

Convention Paper Presented at the 121st Convention 2006 October 5 8 San Francisco, CA, USA

Efficient Implementation of Transform Based Audio Coders using SIMD Paradigm and Multifunction Computations

University of Mustansiriyah, Baghdad, Iraq

Scalable Perceptual and Lossless Audio Coding based on MPEG-4 AAC

ITNP80: Multimedia! Sound-II!

Perceptual Audio Coders What to listen for: Artifacts of Parametric Coding

A Review of Algorithms for Perceptual Coding of Digital Audio Signals

A PSYCHOACOUSTIC MODEL WITH PARTIAL SPECTRAL FLATNESS MEASURE FOR TONALITY ESTIMATION

The following bit rates are recommended for broadcast contribution employing the most commonly used audio coding schemes:

_äìé`çêé. Audio Compression Codec Specifications and Requirements. Application Note. Issue 2

Blind Measurement of Blocking Artifact in Images

Principles of MPEG audio compression

Compression of Stereo Images using a Huffman-Zip Scheme

An adaptive wavelet-based approach for perceptual low bit rate audio coding attending to entropy-type criteria

Modeling of an MPEG Audio Layer-3 Encoder in Ptolemy

Lossy compression. CSCI 470: Web Science Keith Vertanen

CS 335 Graphics and Multimedia. Image Compression

DSP. Presented to the IEEE Central Texas Consultants Network by Sergio Liberman

Squeeze Play: The State of Ady0 Cmprshn. Scott Selfon Senior Development Lead Xbox Advanced Technology Group Microsoft

Audio coding for digital broadcasting

DVB Audio. Leon van de Kerkhof (Philips Consumer Electronics)

S.K.R Engineering College, Chennai, India. 1 2

MODIFIED IMDCT-DECODER BASED MP3 MULTICHANNEL AUDIO DECODING SYSTEM Shanmuga Raju.S 1, Karthik.R 2, Sai Pradeep.K.P 3, Varadharajan.

15 Data Compression 2014/9/21. Objectives After studying this chapter, the student should be able to: 15-1 LOSSLESS COMPRESSION

Context based optimal shape coding

Video Compression An Introduction

DIGITAL TELEVISION 1. DIGITAL VIDEO FUNDAMENTALS

Image, video and audio coding concepts. Roadmap. Rationale. Stefan Alfredsson. (based on material by Johan Garcia)

Stereo Image Compression

CHAPTER 6. 6 Huffman Coding Based Image Compression Using Complex Wavelet Transform. 6.3 Wavelet Transform based compression technique 106

CS 074 The Digital World. Digital Audio

Rich Recording Technology Technical overall description

Audio and video compression

Transcription:

/ / _ / _ / _ / / / / /_/ _/_/ _/_/ _/_/ _\ / All-American-Advanced-Audio-Codec () **Z ** **=Z ** **= ==== == **= ==== \"\" === ==== \"\"\" ==== \"\"\"\" Tim O Brien Colin Sullivan Jennifer Hsu Mayank Sanganeria Perceptual Audio Coding MUSIC 422 / EE 367C Center for Computer Research in Music and Acoustics Stanford University

Project Overview 1 Project Overview In AAAAC, we aim to deliver perceptually lossless audio quality at minimal data rates. Rather than design for low latency/streaming applications, we take an approach that assumes encoding delay is acceptable given the realizable gains in audio quality. In addition to the baseline MDCT-domain audio code, we implement Huffman coding, stereo coding, block switching, and a variable bit rate. These are discussed in more detail below. 2 Huffman Coding Huffman coding is an entropy encoding algorithm used for lossless data compression. The term refers to the use of a variable-length code table for encoding a source symbol where the variable-length code table has been derived in a particular way based on the estimated probability of occurrence for each possible value of the source symbol. Huffman coding gets better the more unbalanced the probabilities are. 2.1 Motivation Huffman coding is a lossless compression technique and therefore it does not affect the output quality of the coder. As long as the gains in data transmission are greater than the overhead of including information about the Huffman coding tables, using Huffman makes the coder better. Though Huffman is not the only form of lossless compression, it is popular among audio coders which means it has been tried and tested successfully in the past. 2.2 Methodology As is done in MPEG 1 Layer 3, the frame of data containing the mantissas is split up into 3 regions. The first region contains the low frequency data, which typically consists of various values with no significant pattern. The second region contains just 1 and 0. The third region has high frequency content and is typically all 0s. The third region can be coded by simply specifying the length of 0s (or inferring the length from the other 2 regions), while the second region gets gains by collapsing the 16-bit integers representing 0 and 1 to single bit 0 and 1 codes. The first region goes through a more traditional Huffman coding. Currently, the tables that are going to be used in the coder for this region have not been decided. 2.3 Harmony with VBR The Huffman routine will add the bits saved to the bit reservoir that the VBR routine is using, thus enabling the VBR to work more efficiently by counting the bits saved via Huffman encoding as well. 3 Stereo Coding Our basic coder codes stereo files in a dual mono process by separately encoding each of the two channels. By taking advantage of the fact that most signals have strong correlations between their left and right channels, we can reduce redundancy by cleverly coding the stereo signals. We implemented Mid/Side (M/S) stereo coding. Instead of transmitting the left and right signals, we transmit their sum and difference signals. We can fully recover the original left and right signals by adding or subtracting the sum and difference signals from each other. In the cases where we have a left channel and right channel that are very similar, we have a small difference signal and use less bits to encode this data. 1

3.1 M/S and L/R Switching 3.1 M/S and L/R Switching For every block of an input signal, our codec must decide whether or not to transmit M/S or L/R on a subband by subband basis. We transmit the M/S signals if the following equation is true. b upper b ( l 2 k rk) 2 < 0.8 upper ( l 2 k + rk) 2 (1) k=b lower k=b lower In this function, l k indicates kth MDCT frequency line from the left signal and r k indicates the kth MDCT frequency line from the right signal. b lower and b upper are the lower and upper indices of our subband. This function means that for every subband in our MDCT, if the energy in our side (or difference) signal is sufficiently lower than the energy in our mid (or sum) signal, then we encode the left and right information in this subband to mid and sum information. The 0.8 factor means that if the energy in the side and mid signals differs by 80%, then we switch to M/S mode. This is a tunable parameter, although most of the current literature uses the value of 0.8. Also, note that we used the MDCT frequency lines to make our decision. In many other codecs, the FFT is used. 3.2 M/S Encoding and Decoding Once we have made the decision to encode a subband of our signal block as M/S, we can calculate our encoded subband information as follows: M i = L i + R i (2) 2 S i = L i R i (3) 2 In the above functions, L i and R i are the left and right samples of our left and right signal data. M i and S i are the encoded values. With simple substitutions, we can see that the original left and right channel information can be completely recovered: L i = M i + S i (4) 3.3 Masking Threshold for Stereo Signals R i = M i S i (5) When calculating our masking thresholds, we first calculate the masking curve for the L/R signals in the same way as our original coder. For each masker that we detect, we use the following spreading function to calculate our overall masking threshold for the L/R channels: 0, for 1 2 z 1 2 10 log 10 (F (z, L M )) = 27( z 1 2 ), for z < 1 (6) 2 ( 27 + 0.367max(L M 40, 0))( z 1 2 ), for z > 1 2 where z = Bark(f maskee ) Bark(f masker ) and L M is the sound pressure level (SPL) of the masker in db. After we calculate this spreading function, we downshift the function by some amount δ to emulate perceptual masking. We then calculate the mask signal to mask ratio (SMR). Let SMR L/R denote the max SMR per band for the left and right channels. Having calculated the above for the L/R signals, we perform a modified version to find the masking threshold and max SMRs for the M/S signals. We first calculate our basic thresholds on the M/S signals using the same spreading function and downshift as stated above and denote these thresholds as BTHR M and BTHR S. 2

3.4 Some results and thoughts We then calculate the spreading function again on the M/S signals except this time, we do not consider the downshift δ factor. We can call these functions BTHR M and BTHR S. We then calculate the masking level difference factor (MLD) which gives us another level of detecting noise across the M/S signals. The equation that we used for the MLD is: min(z,15.5) (1.25(1 cos(π MLD = 10 15.5 )) 2.5) (7) for a frequency z in Barks. Please see the attached figure for a plot of the MLD. We incorporate this difference factor by multiplying our basic thresholds for M/S (without the downshift ) by the MLD calculated for each frequency of our MDCT lines to get MLD M = BTHR M and MLD S = BTHR S. From here, we can derive our overall masking threshold for each M/S signal as: T HR M = max(bt HR M, min(bt HR S, MLD S )) (8) T HR S = max(bt HR S, min(bt HR M, MLD M )) (9) These equations means that we use the MLD signal whenever there is a chance of stereo unmasking. We calculate our maximum SMRs per band in the same way as we did for the left and right masking thresholds, but just using the overall thresholds for the M/S signals and denote this SMR M/S. Last, we go through SMR L/R and SMR M/S and we create an overall SMR array. For each band, we check if we will transmit L/R or M/S based on the decision function described above. If we will send the L/R signal for band b, we set SMR[b] = SMR L/R [b], otherwise, SMR[b] = SMR M/S [b]. This overall SMR array is used for simultaneous bit allocation across both channels for maximum bit savings. 3.4 Some results and thoughts Please see the appendix for two plots that compare the differences between the MDCT SPL, overall masking threshold, and max SMRs for the L/R signals and M/S signals. The original signal used to generate these blocks had the same data in the left and right channels. Notice that in the M/S plot, we have extremely low SMRs for the difference signal. We will save space by not using bits for that channel. In this case, we would probably use about half the amount of bits that we would use for the L/S signals. After implementing all the coding, I found that it works well for some signals (castanets, speech, single sinusoidal addition tone), but it does not work well for others (especially the quartet sample). In the bad case, it sounds like the sound is traveling back and forth between the left and right channels. This may be due to a bug in the code. [1]. 4 Block Switching In order to provide maximum frequency resolution for quasi-stationary signals while minimizing pre-echo artifacts related to transients, we use a block switching algorithm similar to MPEG Layer III. Based on a look-ahead transient detection algorithm, short, long, or transition windows are used (see Figs. 1 and 2). To preserve block synchronicity with respect to the stereo channels, multiple short blocks are be implemented in series so that they are equal to one or more long window blocks. Transitions are handled with asymmetric long-to-short ( start ) and short-to-long ( stop ) windows. We model our transient detection algorithm on the one used in Dolby AC3. That is, we assume that transients are characterized by the sudden presence of broad high-frequency components in the signal spectrum. Thus at each frame, we high-pass the second half of the block and compare the spectrum to a threshold 3

Variable Bitrate level (tuned for best results) and to the previous block s spectrum values. If the high-frequency spectrum has sufficient energy and is sufficiently different from previous values, we switch to a block of short windows and pass one bit signifying this change as a control message to the decoder. Similarly, in a block of short windows, we test whether to go back to the long-window, steady-state regime, or to continue on with another block of short windows suitable for a continuing transient. The results of this detection regime for various input audio files are shown in Figs. 3, 4, 5, 6, 7, and 8. Figure 1: Transition windows used in block switching. Figure 2: Typical block switching transition to one set of short blocks, then back to the long block length. More detailed description to come. 5 Variable Bitrate Our coder now includes a variable data rate coding via a bit reservoir similar to MPEG Layer III. This allows the codec to adapt locally to variations in the masking threshold, allocating fewer bits to blocks with heavy masking and utilizing extra bits from the reservoir for more complicated blocks. Implementing VBR under the current infrastructure was fairly straightforward. Most of the work involved was understanding how the data is being processed and stored in the compressed file. The bit allocation 4

Variable Bitrate Figure 3: Transient detection in a recording of castanets. 5

Variable Bitrate Figure 4: Transient detection in a monophonic recording of a glockenspiel. 6

Variable Bitrate Figure 5: Transient detection in a polyphonic recording of a glockenspiel. 7

Variable Bitrate Figure 6: Transient detection in a harpsichord recording. 8

Variable Bitrate Figure 7: Transient detection in a recording of a soprano. 9

Variable Bitrate Figure 8: Transient detection in a recording of a trumpet. Note no transients meet the threshold for block switching. 10

Results algorithm in our baseline coder was fairly conservative and the variable bit allocation puts the unused bits to good use. To implement the feature, it was simply a matter of passing the unused bits in the codingparams variable so future calls to bitalloc will have these bits added in. This algorithm can certainly be tweaked in the future to utilize the unused bits even more effectively, but since the SMR calculation in the baseline codec is greedy and is merely using only the bits it needs, this technique works out. Additionally, our codec does not have the same challenges as the bit reservoir described in the book, as we are not concerned with sending frame headers on a regular basis. This means that our blocks can be of varying sizes and be literally stored in the file like that because the headers are just read between each block. This avoids all the issues associated with pointing to the location where the mantissas for the current block start. Figures 12 and 13 show the difference in bit allocation between the baseline coder and our coder with the VBR functionality. As can be seen in the 128 kbps case there is a huge variation in the amount of bits allocated per block and even in many cases less bits than before. In the 256 kbps case, because there are more bits in the original budget, the plot is more flat because the bits in the reservoir were used less. 6 Results An coded test file at 128 kbps and 256 kbps can be found in quar decoded 128.wav and quar decoded 256.wav. The compressed versions are included as well. Our compression ratio is still as expected, 5.42 for 128 kbps and 2.8 for 256 kbps. As mentioned in section 3.4, there are some stereo artifacts that we have yet to fix. These audio samples were encoded with only the VBR and stereo features of our codec. We are still working on integrating the Huffman encoding and block switching logic into our codec. 7 Future Work How we might improve the coder going forward. References [1] J. Johnston and A. Ferreira, Sum-difference stereo transform coding, in Acoustics, Speech, and Signal Processing, 1992. ICASSP-92., 1992 IEEE International Conference on, vol. 2, pp. 569 572 vol.2, mar 1992. 11

REFERENCES Figure 9: Masking level difference factor for Bark frequencies. 12

REFERENCES Figure 10: Left and right channels: MDCT SPL, overall masking threshold, and max SMRs. 13

REFERENCES Figure 11: Mid and Side channels: MDCT SPL, overall masking threshold, and max SMRs. 14

REFERENCES Figure 12: Difference in bit allocation for 128 kbps (aaaac - baseline) 15

REFERENCES Figure 13: Difference in bit allocation for 256 kbps (aaaac - baseline) 16