Using Noise Substitution for Backwards-Compatible Audio Codec Improvement Colin Raffel AES 129th Convention San Francisco, CA February 16, 2011
Outline Introduction and Motivation Coding Error Analysis Synthesis Example: row-mp3
Introduction and Motivation Problem: Many widely used audio codecs are out of date compared to the state-of-the-art because they were not made to be improved upon in a backwards-compatible way
Introduction and Motivation Problem: Many widely used audio codecs are out of date compared to the state-of-the-art because they were not made to be improved upon in a backwards-compatible way Observation: Adding the coding process s residual, or coding error, back into the audio file will result in a lossless audio quality
Introduction and Motivation Problem: Many widely used audio codecs are out of date compared to the state-of-the-art because they were not made to be improved upon in a backwards-compatible way Observation: Adding the coding process s residual, or coding error, back into the audio file will result in a lossless audio quality Technique: Find a way to represent the coding error with a small amount of data and include the information in the coded audio file
Introduction and Motivation Problem: Many widely used audio codecs are out of date compared to the state-of-the-art because they were not made to be improved upon in a backwards-compatible way Observation: Adding the coding process s residual, or coding error, back into the audio file will result in a lossless audio quality Technique: Find a way to represent the coding error with a small amount of data and include the information in the coded audio file Proposal: Store frame-by-frame, per-critical-band residual levels in the audio codec s metadata and re-synthesize the coding error as colored noise when decoding
Coding Error Achieving lower data rates requires some information loss We can define coding error as (original audio) (coded audio) Tends to be noisy Modeling as colored noise is cheap 1 0.8 Normalized sample autocorrelation comparison Original file Coding error 0.6 Autocorrelation 0.4 0.2 0 0.2 0.4 0.6 0 0.5 1 1.5 Correlation lag 2 2.5 3 x 10 4
Residual Analysis: Spectral Flux Idea: Model only the non-stationary component of the error Simple method: Spectral flux, defined as SF(n) = N 1 ( X [n, k] X [n 1, k] ) 2 k=0 Stationary signal components get subtracted out Roughly speaking, Full proof is in the paper SF(n) RMS(x[n]) Proportionality only holds for Gaussian noise and non-overlapping rectangular windows
Residual Analysis: Spectral Flux Coding error does not satisfy proportionality criterion The proportionality still roughly holds in practice Comparison of RMS and Spectral Flux 0.18 0.16 Coding error RMS Spectral flux 0.14 RMS/Spectral Flux 0.12 0.1 0.08 0.06 0.04 0.02 0 0 0.5 1 1.5 2 Time (seconds)
Residual Analysis: Spectral Flux To determine coloring, evaluate the flux on a per-band basis Band levels tended to change too rapidly from frame-to-frame However, RMS proportionality holds in practice and makes this technique useful 50 40 Spectra of coding error and flux based synthesized noise Error Noise coded 30 Magnitude (db) 20 10 0 10 10 2 10 3 10 4 Frequency (log)
Residual Analysis: Smoothed Cepstrum Obtain spectral envelope by windowing the real cepstrum and taking the DFT C[n] = R ( E[k] = R 1 N N 1 k=0 ( N 1 n=0 log( X (k) )e j2πnk/n ) w[n]c[n]e j2πnk/n ) Works well for relatively peak-free spectra Per-band level can be found by averaging over bins in band
Residual Analysis: Smoothed Cepstrum Generally results in band levels which are smooth from band to band and frame to frame Smoothed cepstrum of coding error spectrum 0 Coding error spectrum Smoothed Cepstrum 10 20 Magnitude (db) 30 40 50 60 70 2000 4000 6000 8000 10000 12000 14000 16000 18000 Frequency (Hz)
Residual Analysis: Comparison Flux is analytically clean, but varies rapidly because it is intentionally uncorrelated Smoothed cepstrum provides a reasonable estimate which is smoother in time and band 0.5 0.45 0.4 0.35 Critical band level estimates Spectral Flux Smoothed Cepstrum Spectral Mean Magnitude (linear) 0.3 0.25 0.2 0.15 0.1 0.05 0 5 10 15 20 25 Band Number
Residual Synthesis Generate coding error representation by applying critical band envelopes to a random spectra Envelope differences from frame-to-frame cause coloring discontinuities We can generate any amount of colored noise by generating a larger spectrum So, create additional noise per-frame and crossfade Transients in the residual result in frames of noise in the error representation Traditional methods for detecting and representing transients are not effective The coded audio and coding error s envelopes are similar We can modulate residual representation with the coded audio s envelope
Residual Synthesis We can parametrize the amount of envelope modulation by y[n] = ((1 α) + αl[n])) x[n] 0.25 0.2 Comparison of error level to matched and unmatched synthesized noise Original MP3 Coding error Modulated error estimate Unmodulated error estimate Amplitude estimate (linear) 0.15 0.1 0.05 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Time (seconds)
Implementation: row-mp3 The MP3 codec is highly pervasive but somewhat out-of-date To allow backwards-compatibility, we can store information in the ID3 (metadata) tag row-mp3 -aware decoders can use the information, while others will simply ignore it Including per-frame critical band levels results in a relatively small data overhead For example, with a 23.2 ms frame size and 8-bit quantized band level values we have (.0232) (8) (25) = 8.6 kbit/s/channel Data overhead can be reduced by using different quantization schemes or compression such as Huffman coding
Implementation: row-mp3 Created a simple MUSHRA-like web-based test to determine codec s effectiveness row-mp3 files used spectral flux method with no envelope modulation 60 subjects tended to rate the row-mp3 version about 150% better for low MP3 bit rates Further, more controlled testing with all error analysis and synthesis methods is needed!"##$% % /2B /2. /2A /20 /23 /21 /2, /2- /2% / *+,)%-. *+,),-/ *+,)01,23)4"5)678+'99 :#;#<#=># :78?@+,)%-. :78?@+,)01
Conclusions Audio coding error can be effectively modeled as colored noise Flux provided a theoretically-sound coloring estimate Cepstral smoothing works better in practice Synthesis by scaling random spectra Cross-fading and interpolation prevented coloring discontinuities Level-modulated error estimate helped prevent smeared transients row-mp3 codec and accompanying listening tests suggest feasibility
Future work Investigating the optimal number and spacing of bands Testing the effectiveness of other analysis techniques Evaluating different methods for dealing with transients Applying similar techniques to spectral modeling and other processes with residual Implementing inclusion schemes in other audio codecs Generating residual levels solely from the coded audio (as a sound enhancement)
Acknowledgements Jieun Oh and Isaac Wang for creating the row-mp3 codec Prof. Marina Bosi for her instruction in the field of audio coding Prof. Julius Smith for helpful advice and discussion on various topics
Sound examples and code http://ccrma.stanford.edu/ e craffel/software/noise/ http://ccrma.stanford.edu/ e craffel/software/rowmp3/