TEMPORAL ENVELOPE CORRECTION FOR ATTACK RESTORATION IN LOW BIT-RATE AUDIO CODING

Similar documents
SPREAD SPECTRUM AUDIO WATERMARKING SCHEME BASED ON PSYCHOACOUSTIC MODEL

Audio-coding standards

5: Music Compression. Music Coding. Mark Handley

Audio encoding based on the empirical mode decomposition

New Results in Low Bit Rate Speech Coding and Bandwidth Extension

Audio-coding standards

Perceptual Coding. Lossless vs. lossy compression Perceptual models Selecting info to eliminate Quantization and entropy encoding

Principles of Audio Coding

Multimedia Communications. Audio coding

ELL 788 Computational Perception & Cognition July November 2015

Lecture 16 Perceptual Audio Coding

I D I A P R E S E A R C H R E P O R T. October submitted for publication

Appendix 4. Audio coding algorithms

Mpeg 1 layer 3 (mp3) general overview

The MPEG-4 General Audio Coder

AUDIOVISUAL COMMUNICATION

Contents. 3 Vector Quantization The VQ Advantage Formulation Optimality Conditions... 48

2.4 Audio Compression

Audio Compression. Audio Compression. Absolute Threshold. CD quality audio:

A NEW DCT-BASED WATERMARKING METHOD FOR COPYRIGHT PROTECTION OF DIGITAL AUDIO

Perceptual coding. A psychoacoustic model is used to identify those signals that are influenced by both these effects.

MPEG-4 General Audio Coding

AUDIOVISUAL COMMUNICATION

Parametric Coding of High-Quality Audio

Both LPC and CELP are used primarily for telephony applications and hence the compression of a speech signal.

Compressed Audio Demystified by Hendrik Gideonse and Connor Smith. All Rights Reserved.

Figure 1. Generic Encoder. Window. Spectral Analysis. Psychoacoustic Model. Quantize. Pack Data into Frames. Additional Coding.

Perceptual Audio Coders What to listen for: Artifacts of Parametric Coding

ENTROPY CODING OF QUANTIZED SPECTRAL COMPONENTS IN FDLP AUDIO CODEC

Chapter 14 MPEG Audio Compression

19 th INTERNATIONAL CONGRESS ON ACOUSTICS MADRID, 2-7 SEPTEMBER 2007

Audio Watermarking using Colour Image Based on EMD and DCT

EE482: Digital Signal Processing Applications

DRA AUDIO CODING STANDARD

Optical Storage Technology. MPEG Data Compression

Speech and audio coding

Review and Implementation of DWT based Scalable Video Coding with Scalable Motion Coding.

Scalable Perceptual and Lossless Audio Coding based on MPEG-4 AAC

Perceptual Pre-weighting and Post-inverse weighting for Speech Coding

Application Note PEAQ Audio Objective Testing in ClearView

CT516 Advanced Digital Communications Lecture 7: Speech Encoder

ROW.mp3. Colin Raffel, Jieun Oh, Isaac Wang Music 422 Final Project 3/12/2010

signal-to-noise ratio (PSNR), 2

Perceptually motivated Sub-band Decomposition for FDLP Audio Coding

Data Compression. Audio compression

Audio issues in MIR evaluation

No-reference perceptual quality metric for H.264/AVC encoded video. Maria Paula Queluz

MULTIMODE TREE CODING OF SPEECH WITH PERCEPTUAL PRE-WEIGHTING AND POST-WEIGHTING

Module 9 AUDIO CODING. Version 2 ECE IIT, Kharagpur

Audio Fundamentals, Compression Techniques & Standards. Hamid R. Rabiee Mostafa Salehi, Fatemeh Dabiran, Hoda Ayatollahi Spring 2011

Data Hiding in Video

Efficient Signal Adaptive Perceptual Audio Coding

MPEG-4 aacplus - Audio coding for today s digital media world

Speech Modulation for Image Watermarking

<< WILL FILL IN THESE SECTIONS THIS WEEK to provide sufficient background>>

Audio Watermarking using Empirical Mode Decomposition

Performance Analysis of Discrete Wavelet Transform based Audio Watermarking on Indian Classical Songs

Parametric Coding of Spatial Audio

Bluray (

/ / _ / _ / _ / / / / /_/ _/_/ _/_/ _/_/ _\ / All-American-Advanced-Audio-Codec

Efficient Representation of Sound Images: Recent Developments in Parametric Coding of Spatial Audio

Audio Coding Standards

The Sensitivity Matrix

A MULTI-RATE SPEECH AND CHANNEL CODEC: A GSM AMR HALF-RATE CANDIDATE

MPEG-1. Overview of MPEG-1 1 Standard. Introduction to perceptual and entropy codings

JPEG 2000 vs. JPEG in MPEG Encoding

DSP. Presented to the IEEE Central Texas Consultants Network by Sergio Liberman

An investigation of non-uniform bandwidths auditory filterbank in audio coding

MUSIC A Darker Phonetic Audio Coder

FINE-GRAIN SCALABLE AUDIO CODING BASED ON ENVELOPE RESTORATION AND THE SPIHT ALGORITHM

A Novel Audio Watermarking Algorithm Based On Reduced Singular Value Decomposition

Audio Engineering Society. Convention Paper. Presented at the 126th Convention 2009 May 7 10 Munich, Germany

Fundamentals of Perceptual Audio Encoding. Craig Lewiston HST.723 Lab II 3/23/06

Wavelet filter bank based wide-band audio coder

Adaptive Fuzzy Watermarking for 3D Models

Chapter 4: Audio Coding

Ch. 5: Audio Compression Multimedia Systems

DAB. Digital Audio Broadcasting

Blind Measurement of Blocking Artifact in Images

3GPP TS V6.2.0 ( )

Audio coding for digital broadcasting

Modified SPIHT Image Coder For Wireless Communication

the Audio Engineering Society. Convention Paper Presented at the 120th Convention 2006 May Paris, France

A Comparison of Still-Image Compression Standards Using Different Image Quality Metrics and Proposed Methods for Improving Lossy Image Quality

AUDIO. Henning Schulzrinne Dept. of Computer Science Columbia University Spring 2015

Adaptive Quantization for Video Compression in Frequency Domain

Source Coding Basics and Speech Coding. Yao Wang Polytechnic University, Brooklyn, NY11201

JPEG IMAGE CODING WITH ADAPTIVE QUANTIZATION

ON-LINE SIMULATION MODULES FOR TEACHING SPEECH AND AUDIO COMPRESSION TECHNIQUES

DIGITAL IMAGE WATERMARKING BASED ON A RELATION BETWEEN SPATIAL AND FREQUENCY DOMAINS

Proceedings of Meetings on Acoustics

S.K.R Engineering College, Chennai, India. 1 2

HAVE YOUR CAKE AND HEAR IT TOO: A HUFFMAN CODED, BLOCK SWITCHING, STEREO PERCEPTUAL AUDIO CODER

MPEG-4 Version 2 Audio Workshop: HILN - Parametric Audio Coding

Quality versus Intelligibility: Evaluating the Coding Trade-offs for American Sign Language Video

BLOCK MATCHING-BASED MOTION COMPENSATION WITH ARBITRARY ACCURACY USING ADAPTIVE INTERPOLATION FILTERS

Parametric Audio Based Decoder and Music Synthesizer for Mobile Applications

Audio Compression Using Decibel chirp Wavelet in Psycho- Acoustic Model

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

AN AUDIO WATERMARKING SCHEME ROBUST TO MPEG AUDIO COMPRESSION

Transcription:

7th European Signal Processing Conference (EUSIPCO 29) Glasgow, Scotland, August 24-28, 29 TEMPORAL ENVELOPE CORRECTION FOR ATTACK RESTORATION IN LOW BIT-RATE AUDIO CODING Imen Samaali 2, Monia Turki-Hadj Alouane 2, Gaël Mahé Centre de Recherche en Informatique de Paris 5 (CRIP5), Université Paris Descartes, France email: imen.samaali@math-info.univ-paris5.fr, Gael.Mahe@math-info.univ-paris5.fr 2 Unité Signaux et Systèmes (U2S), Ecole Nationale d Ingénieurs de Tunis (ENIT), Tunisia email: m.turki@enit.rnu.tn ABSTRACT At reduced bit rates, the audio compression affects transient parts of signals, which results in pre-echo and loss of attack character. We propose in this paper an attacks restoration method based on the correction of the temporal envelope of the decoded signal, using a small set of coefficients transmitted through an auxiliary channel. The proposed approach is evaluated for single and multiple coding-decoding, using objective perceptual measures. The experimental results for MP3 and AAC coding exhibits an efficient restoration of the attacks and a significant improvement of the audio quality.. INTRODUCTION In perceptual coders at low bit-rates, variations in the masking threshold from one frame to the next leads to different bit assignments. An artefact known as pre-echo may appear in signals with transients (see Figure ). The silence before an attack may be affected by a relatively high quantization noise, since the masking threshold is computed using the part of the frame after the attack [][2]. However, if the pre-echo is short enough, it is post-masked by the attack. The phenomenon of pre-echo is amplified by multiple coding-decoding: the quantization noise piles up at each cycle and may become audible. Transient signals are affected by another artefact when coding at low bit-rates. Only a small low-frequency band of the signal is fully transmitted, whereas medium frequencies are restored at the decoder without phase information and higher frequencies are simply forgotten. This smoothes attacks resulting in a reduce of the percussive quality of sounds. In order to reduce or eliminate pre-echo and enhance the audio quality, some techniques with different complexities are developed in the literature []. These techniques are based on the use of an adaptive window selection algorithm to switch between long and short transform window. Long windows offer higher prediction gain, and better frequency resolution, while short windows reduce the length of pre-echo. Another technology, the Temporal Noise Shaping (TNS) [3], used in the AAC coder, aims at resolving the problem of Temporal Masking. This approach allows the encoder to control the temporal fine structure of the quantization noise. In the same way as temporal predictive coders reshape the quantization noise spectrum, the principle of TNS, is to replace the spectral coefficients to be quantized by the This work is part of the WaRRIS project granted by the French National Research Agency (project n ANR-6-JCJC-9)..6.4.2.2.4.6.8.4.2.2.4.6 5.8 5. 5.2 5.4 5.6 5.8 x 4 5.8 5. 5.2 5.4 5.6 5.8 x 4.5..5.5...5.5 x 4 x 4 Figure : Original signal (top) and pre-echo (bottom) from castanet (left) and triangle (right) coded at 56 kbps using MP3 coder coder with their frequency prediction residual (which flattens the temporal envelope). In the decoder, the inverse prediction filter is applied to the residual. As a consequence, the temporal shape of the quantization error is adapted to the temporal shape of the audio signal. For the MP3 coder, a similar correction is performed by the Temporal Masking (TM) technology [4]. Although these techniques reduce significantly the preecho phenomenon, the problems of temporal masking and attacks smoothing remain harmful for some transient signals, such as castanet, glockenspiel, triangle or certain types of speech signals. In this paper, we propose a novel method aiming at restoring attacks in coded-decoded signals. The idea is to perform a correction of the temporal envelope of the coded-decoded signal, using a small set of parameters (describing the temporal envelope of the ) transmitted over a very low bit-rate ( kbps) auxiliary channel which we suppose to have (an audio watermarking channel for example). This paper is structured as follows: in section 2, we present the new approach dedicated to attack restoration for the quality enhancement of audio decoded signals. Section 3 presents a performance evaluation of the proposed algorithm in the case of simple and multiple successive encodings (codings in tandem). EURASIP, 29 929

Input Audio Signal, x t Audio Core Encoder Bitstream Audio Core Decoder Frame Type Transient Characterization Localization Temporal Envelope Parameters ARMA-LSF conversion x t dec Audio Signal Correction Temporal Envelope Computation LSF-ARMA conversion Restored audio signal, ˆx t Vector Quantization Parameters Coding Codebook Auxiliary Channel Parameters Decoding Figure 2: Block diagram of the proposed approach. 2. ENVELOPE CORRECTION The proposed method is based on the correction of the temporal envelope of the decoded signal. The restored audio signal, ˆx t, is given at time t by: ê t ˆx t x t dec ê t dec where x t dec is the decoded audio signal, ê t is an estimate of the temporal envelope of the, and ê t dec is the temporal envelope of the coded-decoded signal. The correction approach constitutes a post-processing performed at the decoder. For this purpose, crucial parameters required for the temporal envelope estimation are extracted by the encoder and transmitted to the decoder through an auxiliary channel. Figure 2 illustrates the basic structure of the proposed approach. In addition to the standard coder/decoder blocks, the proposed system includes several components: Frame characterization/transient localization: characterizes the frame type as transient or not. In particular, for the transient frames, a localization of the attack s time position is performed. Temporal envelope coding and vector quantization: based on linear prediction in frequency domain. Audio signal correction: for the restoration of the audio signal according to the relation presented in equation. 2. Transient detector Characterization of frame type: To detect transient frames, we use the technique described in [5]. After high-pass IIR filtering with the transfer () function 7548 z H z (2) z 595 each frame of 24 samples is divided into 8 sub-blocks of 28 samples and the energy of each sub-block is computed by summing up the squared samples. An attack is detected if one of these sub-block energies exceeds a sliding average of the previous energies by a constant factor attackratio and is greater than a constant energy level minattacknrg 3 [5]..6.4.2.2.4.6 5 5 Frame Frame 2 Frame 3 3.3 3.35 3.4 3.45 3.5 3.55 3.6 3.65 Samples 3. 3.6 Samples Audio signal x 4 Attack Ratio Figure 3: The audio signal (violon+castanet) and its corresponding attack ratio coefficients Figure 3 illustrates the efficiency of the proposed method. The attack ratio coefficient corresponding to the transient frame (Frame 2) exceeds the threshold fixed to as in [5]. x 4 93

Transient localization: The method used to detect the transient time position is based on the stationarity index which corresponds to the Kolmogorov distance measured between the time frequency representation (TFR) of the signal at different times [6]. The stationarity index is given by: p SI t f NI t;τ f NI 2 t;τ f d f dτ (3) τ NI t;τ f and NI 2 t;τ f represent a normalization of respectively subimages I t;τ f and I 2 t;τ f : I t;τ f T FR t p τ f ; (4) I 2 t;τ f T FR t τ f (5) The parameter p delimits the considered analysis duration at each instant t and allows the selectivity/sensivity control of the SIs: a high value of p leads to smoother SIs. As in [6], p is fixed to 2. As illustrated in Figure 4, the peak of SI corresponds to the attack position. Attack position coding: Since coding directly the transient position would require a high bit-rate, we propose to code the difference between the actual attack position given by the original frame and the attack position computed from the coded/decoded frame. A 6 bits scalar quantization technique is used for this purpose. 2.2 Temporal envelope ARMA modeling The reduced bit-rate of the auxiliary channel implies a compact representation of the transmitted temporal envelope, through a small set of coefficients for each frame. The proposed coding approach is based on linear prediction in frequency domain providing an approximation of the temporal envelope of a signal, specifically the squared Hilbert envelope [7]. The block diagrams of the temporal envelope estimation is depicted in Figure 5. The Discrete Cosine Transform of x t is modeled by an ARMA(p,q) model. The ARMA model was here preferred to the classically used AR model, because it ensures an accurate representation of the spectral envelope by means of a reduced number of coefficients. The autoregressive parameters a i i p, of the ARMA(p,q) model are estimated.8.6.4.2.2.4.6 3 35 4 45 5 55 6 65 7 audio signal xt êt X k DCT ARMA(p,q) H z b i q i b iz i p i a iz i a i q Figure 5: Block diagram of the temporal envelope estimation. by minimizing the mean square prediction error defined as follows: E pre k X k p i p a i X k i (6) where X k DCT x t. The moving average parameters b i i q, are computed by a Prony method. An estimation of the temporal envelope of x t is therefore given by: ê t H e jwt (7) where H z q i b iz i p i a iz i H b z H a z Each frame of 248 samples is divided into two subframes, each modeled with an ARMA(2,3). For non transient frames, each sub-frames has 24 samples. Transient frames are divided according to the transient position. Figure 6 illustrates similarity between the estimated envelope and the original one for a castanet sound sampled at 44. khz. 2.3 ARMA to LSF transformation After the ARMA parameters are estimated, they must be coded and transmitted through the auxiliary communication channel. The autoregressive coefficients (AR) are characterized by large dynamic range and would require many bits per coefficient for accurate coding. In addition, small changes in the AR coefficients may result in instability of the synthesis filter. For these reasons, it is necessary to transform the.6.4.2 audio signal temporal envelope (8).5 Stationary index.2.5.4 3 35 4 45 5 55 6 65 7.6.85.9.95 2 Time step x 4 Figure 4: Transient frame (top), its corresponding stationary index (bottom) Figure 6: The castanet signal and its corresponding temporal envelope estimate 93

AR coefficients into an equivalent representation which ensures the stability : the Line Spectral Frequency representation (LSF) are classically used in predictive coders. The AR filter H a z of order p, given by equation 8, can be represented by :.5.5.5 5.8 5. 5.2 5.4 5.6 5.8 x 4..5.5. x 4. H a z 2 P z Q z (9) where P z and Q z are even and odd symmetric filters respectively. Their roots, z i e jθ i, lie on the unit circle [8]. The θ i i N! represent the Line Spectrum Frequency (LSF) coefficients. To transform the Moving Average (MA) parameters into the LSF coefficients, the MA filter H b z (equation 8) must be modified as follows H b z i z ib" i q () where b" i b i b, i q. In the decoder, the value of b is estimated from the decoded signal. 2.4 Vector quantization For each sub-frame, a vector grouping 5 LSF coefficients is coded using a classical vector quantization technique. The codebook C of a dimension K #L, (K=256 and L=5 here), is obtained by training on a database of approximately #K vectors taken from various kinds of audio signals. It can be computed by the Lloyd-Max algorithm [9]. For the proposed system, the LSF parameters are coded on 8 bits for each sub-frame. We recall that for transient frame, 6 bits are added to code the attack position. Consequently, the auxiliary channel bit-rate varies between 344 and 474 bps. 3. EXPERIMENTAL EVALUATION OF THE PROPOSED APPROACH The experiments aim at validating our approach and comparing the restored audio signal to the original one. We used the PEMO-Q software described in [] as a tool to measure the Objective Difference Grade () and the instantaneous Perceptual Similarity Measure (PSM t ). The is a perceptual audio quality measure, which rates the difference between test and reference signals among a scale from (imperceptible) to -4 (very annoying). The values of PSM t vary in the interval [,], with indicating the similarity between the reference and the test signals, whereas smaller values correspond to larger deviations between them. The perceptual measures are correlated to the Subjective Difference Grade (SDG) for audio quality. At first, the audio quality is evaluated when a single coding is performed. For each experiment, the reference audio signals are coded by MP3 and AAC coders at different bitrates varying from 24 to 96 kbps. The considered reference audio signals for all the simulations are mono castanet and triangle, which exhibit a remarkable transient character. As illustration of the proposed approach, the 56 kbps MP3 coded/decoded castanet and triangle signals and their corrected versions are shown in Figure 7. It can be seen that the MP3 coder introduces a pre-echo and smoothes attacks..5.5 5.6 5.8 5. 5.2 5.4 5.6 5.8 x 4 reconstructed signal.5 5.6 5.8 5. 5.2 5.4 5.6 5.8 x 4.5.5..5.5. x 4 reconstructed signal x 4 Figure 7: Attack restoration for a castanet (left) and triangle (right) signals coded by a MP3 coder at 56 kbps In the reconstructed signal, the pre-echo is considerably reduced and the attack is restored. Figures 8 and 9 compare the variations, over bit-rate, of the and PSM t for both s and their restored versions. As illustrated in Figures 8, for MP3 coder even with TM, the proposed correction provides a significant enhancement of the PSM t and. With AAC coding (Figure 9), the improvement is slighter..9.8.7.6 +Correction.5 2 4 6 8 +Correction 4 2 4 6 8.8.6.4.2.5.5.5 2 4 6 8.5 4 MP3 without TM +Correction MP3 without TM +Correction 4 6 8 Figure 8: Mean of Perceptual Similarity Measure and Objective Difference Grade for castanet (left) and triangle (right) signals (MP3 coder) At a second stage, the perceived quality in multiencoding case is analyzed. Referring to the experimental results of Reiss [2] related to the stereo MP3 coder at 28 kbps, the perceptual audio quality deteriorates with the increase of encoding number. Similar results were obtained for mono MP3 coder at 64 kbps for castanet and triangle sequences. The variations of the perceptual audio quality (PSM/) with the number of encodings are shown in Figures and. Without correction, we observe a fast deterioration. In fact, the quantization noise from each cycle piles up, so that the pre-echo, which was post-masked by the 932

.95.9.85.8.75.7.65.5.5.5.5 35 4 45 5 55 6 35 4 45 5 55 6.8.6.4.2.5.5.5 35 4 45 5 55 6 35 4 45 5 55 6 Figure 9: Mean of Perceptual Similarity Measure and Objective Difference Grade for castanet (left) and triangle (right) signals (AAC coder).95 +Correction.95.9.85.8.75.7.5.5.5.5 +Correction 2 3 4 5 6 2 3 4 5 6 +Correction Figure : Effect of multiple encoding on audio quality for triangle signal (MP3 coder) than computating stationarity indices should be considered. Further study will aim at using the audio watermarking as an auxiliary communication channel and evaluating the influence of the watermark detection error on the proposed system..9.85.8 2 3 4 5 6 2 3 4 5 6 +Correction Figure : Effect of multiple encoding on audio quality for castanet signal (MP3 coder) attack in the first encodings, is not masked anymore in the next encodings and becomes annoying. For the castanet signal, the temporal masking enhances the quality, but the reference temporal envelope for the n th coding is the deteriorated envelope of the signal from the n th coding-decoding. Thus, with only TM, the curve converges towards that without TM, whereas with our correction (with always the same envelope), the quality remains transparent. 4. CONCLUSION Low-bit-rate coding-decoding with standard coders AAC and MP3 smoothes attacks in transients signals and increases the pre-echo despite TNS/TM. We have proposed an attack restoration method, based on temporal envelope correction, using a small set of information transmitted through an auxiliary channel. Our method enhances significantly the audio quality as measured by and PSM t, and avoids the dramatic fall of quality caused by multiple coding-decoding. All components of the proposed system have a reasonable complexity, except of the transient localization: other methods REFERENCES [] Mahieux Y and Petit J.P High-quality audio transform coding at 64 kbps, IEEE Transactions on Communications, vol. 42, pp. 3-39, Nov. 994. [2] J. Reiss and M. Sandler, Audio Issues In MIR Evaluation, ISMIR, Barcelona, Spain, Oct. 24, pp. 28-33. [3] J. Herre, Temporal Noise Shaping, Quantization and Coding Methods in Perceptual Audio Coding: a Tutorial Introduction, AES 7 th International Conference on High Quality Audio Coding, Florence, Italie, Sep 999, pp. 32-325. [4] F. Sinaga, T. Surya Gunawan and E. Ambikairajah, Wavelet Packet Based Audio Coding Using Temporal Masking, ICICSPCM, Dec. 23, pp. 38-383. [5] 3GPP TS 26.43, Advanced Audio Coding (AAC) part Sep. 24. [6] S. Larbi and M. Jaidane, Audio Watermarking: A Way To Stationarize Audio Signals, IEEE Trans. Signal Processing, Vol. 53, pp. 86-823, Feb. 25. [7] M. Athineos and D. P.W.Ellis, Frequency-Domain linear Prediction For Temporal Features, ASRU 3, Nov. 23, pp. 26-266. [8] D. Na, S. Yeon and M. BAE, The High-speed LSF transformation Algorithm in CELP Vocoder, TENCON 2 Proc, vol., pp. 279-282, 2. [9] V.S. Jayanthi, K.S. Marothi, T.M. Ishaq, M. Abbas and A. Shanmugam, Performance Analysis of Vector Quantizer using Modified Generalized Lloyd Algorithm, IJISE, vol., pp. -5, Jan 27. [] R. Huber and B.Kollmeier, PEMO-Q A New Method for Objective Audio Quality Assessment Using a Model of Auditory Perception, IEEE Transactions on audio, speech and language processing, vol. 4, pp. 92-9, Nov 26. 933