Perceptual audio coding schemes based on adaptive signal processing tools

Size: px

Start display at page:

Download "Perceptual audio coding schemes based on adaptive signal processing tools"

Sheryl Kelley
6 years ago
Views:

1 Perceptual audio coding schemes based on adaptive signal processing tools Fernando A. Marengo Rodriguez, Sergio A. Castells, and Gonzalo D. Sad Citation: Proc. Mtgs. Acoust. 28, (2016); View online: View Table of Contents: Published by the Acoustical Society of America Articles you may be interested in Analysis of lightweight acoustic reflectors Proceedings of Meetings on Acoustics 28, (2017); / The acoustics of the concert hall Auditorio Juan Victoria from San Juan, Argentina Proceedings of Meetings on Acoustics 28, (2017); / Signal-dependent spatial audio reproduction based on playback-setup-defined beamformers Proceedings of Meetings on Acoustics 28, (2017); / Cross-frequency coupling and phase synchronization in nonlinear acoustics Proceedings of Meetings on Acoustics 28, (2016); / How long is a vocal tract? Comparison of acoustic impedance spectrometry with magnetic resonance imaging Proceedings of Meetings on Acoustics 28, (2017); /

2 Volume nd International Congress on Acoustics Acoustics for the 21 st Century Buenos Aires, Argentina September 2016 Signal Processing in Acoustics: Paper ICA Perceptual audio coding schemes based on adaptive signal processing tools Fernando A. Marengo Rodriguez, Sergio A. Castells, and Gonzalo D. Sad National University of Rosario, Rosario, Santa Fe, Argentina; In this paper, new perceptual audio coding schemes based on adaptive processing tools are proposed. They rely on both the empirical mode decomposition (EMD) and the ensemble empirical mode decomposition (EEMD) methods. In comparison with other perceptual coding schemes, the one presented here is simpler since physically meaningful components of the input signal are detected, then their local extrema are extracted and Golomb-Rice encoding of the extracted samples is performed. The proposed scheme is assessed in terms of compression ratio and perceptual quality for various tracks from the European Broadcasting Union Sound Quality Assessment Material (EBU-SQAM) compact disc. The obtained results are compared with those corresponding to other perceptual audio coding methodologies. Published by the Acoustical Society of America 2017 Acoustical Society of America [DOI: / ] Proceedings of Meetings on Acoustics, Vol. 28, (2017) Page 1

3 1. INTRODUCTION In order to optimize the use of data transmission channels as well as the storage capacity in hard disk drives, different lossless and perceptual (lossy) audio coding schemes were developed. The formers 1-3 allow to reduce the input audio file size without introducing distortion. Typical compact disc (CD) audio files in PCM format can be packed into one half or one sixth of their original sizes, depending on certain characteristics of the input data, such as the dynamic range and spectral content 4. Perceptual audio encoders 5-8 allow higher compression gains to be obtained at a cost of higher complexity of the encoder. In both types of encoders, it is crucial that the decoder be of low complexity, so as to allow low-cost portable devices to play the previously encoded audio file in real time. In this paper, new encoding schemes based on adaptive analysis techniques are proposed, and their performances are quantitatively analyzed and compared with previous techniques using musical tracks from the European Broadcasting Union Sound Quality Assessment Material (EBU-SQAM) CD 9. This document is organized as follows. The adaptive tools used for our system are described in Section 2, and the proposed encoder and decoder are outlined in Section 3. The criterion used for selecting the audio files to be tested is detailed in Section 4, and numerical results are summarized in Section 5. Conclusions are drawn in Section ADAPTIVE TOOLS FOR THE ENCODER Unlike other encoding schemes, our method relies on adaptive signal decomposition tools. They are: 1) the empirical mode decomposition (EMD) method and 2) the ensemble empirical mode decomposition (EEMD) 13 algorithm. These tools allow to decompose any one dimensional (1D) sequence into a reduced set of zero mean amplitude and frequency modulated (AM-FM) signals, each usually related with a physical phenomenon underlying the system under study 12. In the following, these tools will be briefly described. A. THE EMPIRICAL MODE DECOMPOSITION METHOD In the EMD method, details are extracted from the input data progressively, from the finest temporal resolution up to the coarsest one, by means of a sifting process. Intuitively, the input data ( ) (where is the time index) is seen as an addition of zero mean oscillatory detail functions ( ), each of which is added to slower temporal variations. Each detail is extracted as follows. 1. Local maxima (minima) are computed from the input signal ( ) and then interpolated, resulting the upper (lower) envelope ( ) ( ( )). 2. The local mean ( ) ( ) ( ) is computed, and the first order detail function ( ) ( ) ( ) is determined. 3. Step 1 is performed, using the first residue ( ) ( ) ( ) as the input sequence. The output sequences in this process are the second order detail function ( ), and the second residue ( ) ( ) ( ). 4. This sifting process continues iteratively from steps 1 through 3, using the residues ( ) ( ) ( ) as input sequences, and giving ( ) ( ) ( ) as outputs. This process is performed until the residue ( ) ( ) ( ) has no more local extrema and hence no more details to extract. At this point, the input signal is decomposed as Proceedings of Meetings on Acoustics, Vol. 28, (2017) Page 2

4 ( ) ( ) ( ) (1) ( ) is known as intrinsic mode function or IMF, and ( ) is the final residue, also denoted as ( ) for simplicity. It has to be stressed that each detail function at the end of step 2 may have nonzero mean, in which case it has to be iterated from step 1 to step 2 for subtracting such mean. This process is performed until the corresponding local mean is sufficiently small 10, 11. Since each IMF depends solely on the input signal, the EMD algorithm is characterized for being fully data-driven, adaptive and always gives a small amount of IMFs which may be described as AM-FM signals, i.e., ( ) ( ) ( ), where ( ) is the instantaneous ( ) amplitude and ( ) is the instantaneous frequency of the -th IMF ( ) 12. The EMD method has been used extensively for robust data analysis 12, and also for data compression in 2D 14 and 1D 15 including audio signals The present paper introduces further improvements on the method introduced in 16. B. THE ENSEMBLE EMPIRICAL MODE DECOMPOSITION METHOD This technique, also known by its acronym, EEMD 13, consists of multiple applications of the EMD algorithm to the input signal contaminated by different realizations of finite power white Gaussian noise (WGN). Such noise adds uniform spectral content to the input sequence, allowing the EMD to work as a dyadic filter bank 19, 20 and to obtain a set of IMFs more concentrated in some specific spectral bands. (More precisely, EMD works as a dyadic filter bank only for WGN, since each IMF is mainly concentrated in one octave band. For other class of input signals, the spectral band of each IMF could be very difficult to predict, unless the input is contaminated with finite power WGN.) Finally, the homologous IMFs ( ) ( ) ( ) are ensemble averaged over all the L realizations, resulting a set of average IMFs ( ) ( ) given by ( ) ( ) (2) One drawback of the EEMD method is that it does not fulfill completeness since Eq. (1) is not satisfied. However, this problem is minimized by reducing the power of the WGN added in this method 21, which is viable if the input spectrum is more concentrated at low frequencies 13. This is true for many classes of audio signals. An additional benefit for this case is the need of less number of total realizations, which increases the speed of the EEMD algorithm. Also, an important advantage of EEMD over EMD is that each resulting IMF does not contain information regarding two or more different physical phenomena, also known in the literature as mode mixing. This result is a consequence of adding uniform spectral density via WGN. This advantage is useful for the proposed encoder herein, since mode mixing is related with redundancies in two or more IMFs. Proceedings of Meetings on Acoustics, Vol. 28, (2017) Page 3

5 3. PROPOSED ENCODER/DECODER The proposed encoder and decoder are explained in the following subsections. A. ENCODER The encoder processes the input audio file (PCM data stored in a digital WAV file) on a frame by frame basis, according to the block diagram illustrated in Fig. 1. Figure 1. Block diagram for the proposed encoder. The encoder detects the sampling rate of the input file, the number of channels and the number of bits per sample. Then, the input data stream is segmented in frames of fixed length (4096 samples for the present case). For each frame, data are processed via either EMD or EEMD, resulting a set of IMFs. According to some numerical tests, some IMFs may not be relevant, since they carry little information of the input signal 22. For such reason, a two stage filtering process is applied to the IMFs, allowing to detect the relevant ones (see Fig. 2). In such filter, the correlation coefficient between the input sequence and the k-th IMF is determined, and those IMFs with correlation lower than a given threshold ( ) (with ) are ignored 22. Figure 2. Diagram for detecting the relevant IMFs. The second stage of the filter considers a masking model which allows to detect and remove perceptually irrelevant information 23, with little distortion for the human auditory system. In order to detect and remove masked components for each critical band, the spreading function utilized in ISO/IEC MPEG Psychoacoustic Model 2 (see 24, pp. 187) is applied. Then, the intensities of the most relevant IMFs (each associated with the corresponding masking curve in a given critical band) are added up according to 24, pp. 192, setting the parameter.33. The IMFs for an experimental sequence are illustrated in Fig. 3(a), and the relevant ones after the two stage filtering process are depicted in Fig. 3(b). The proposed filter removes a great amount of information (7 IMFs), allowing more efficient data compression. The relevant IMFs resulting from the previous step are represented via the corresponding local extrema, which is equivalent to its critical sampling rate locally (see Fig. 4). The interpolation of such extrema allows to reconstruct each IMF with low error The abscissas and ordinates of the previously mentioned local extrema are encoded separately. The former are differentiated, resulting a set of smaller numbers. Since each IMF is oscillating around zero, the ordinates are sign alternating and represented by their absolute values ( ). Just one additional bit is added (at the beginning of the frame), stating which is the sign of the first ordinate in the corresponding IMF. The set of absolute ordinates is subtracted from their median (i.e., ( ) ( ) ), so as to obtain a data set more symmetrical Proceedings of Meetings on Acoustics, Vol. 28, (2017) Page 4

6 around zero. Both sets corresponding to the abscissas and ordinates are Golomb-Rice encoded separately 25 and finally multiplexed with each other. (a) (b) Figure 3. (a) Set of IMFs for an audio input signal. (b) Resulting IMFs after the filtering process. Proceedings of Meetings on Acoustics, Vol. 28, (2017) Page 5

7 Figure 4. IMF and its local extrema for a specific frame. B. DECODER The decoder is represented in Fig. 5 and works as follows. First, a demultiplexer splits the encoded data stream into the data sets corresponding to the abscissas and ordinates. Then, each of these data sets are Golomb-Rice decoded, resulting the differentiated abscissas and processed ordinates ( ) ( ). The abscissas are recovered via cumulative sum, i.e.,, and the ordinates are determined after addition of the corresponding median and sign shifting, according to the sign of the first ordinate. The resulting local extrema ( ) are then interpolated via piecewise 3rd order cubic Hermite interpolating polynomial (PCHIP), giving the corresponding IMF. This process is performed for all the relevant IMFs, which are then added altogether, giving the decoded signal. Figure 5. Block diagram for the decoder. It has to be stressed that the encoder and above all the decoder are quite simple. The simplicity of the decoder is crucial in order to allow low-cost portable devices to play the encoded audio file in real time. 4. PERFORMANCE ANALYSIS The EMD/EEMD based audio coding scheme was tested with PCM coded WAV audio files in the EBU-SQAM CD 9. Our aim is to test a variety of audio signals according to the following parameters recommended in 26 : Proceedings of Meetings on Acoustics, Vol. 28, (2017) Page 6

8 - Transients (pre-echo sensitive, smearing of noise in temporal domain), - Tonal structure (noise sensitive, roughness), - Natural speech (distortion sensitive, smearing of attacks), - Complex sound (stresses the device under test), - High bandwidth (stresses the device under test, loss of high frequencies, high frequency noise). Therefore, the selected files in the EBU-SQAM CD for our tests were the following: - Castanets (file 27.wav), - Clarinet (file 16.wav), - Female speech (file 49.wav), - Soprano (file 44.wav), - Glockenspiel (file 35.wav). Each encoded audio file was evaluated in terms of the compression ratio (CR), which is the ratio between the input and the output file sizes. The perceived audio quality of the encoded data was measured via the objective difference grade (ODG), an objectively measured parameter according to the algorithm specified in ITUR BS The ODG ranges from 0 to -4, depending on the impairment produced by data compression (imperceptible for ODG = 0, perceptible, but not annoying for ODG = -1, slightly annoying for ODG = -2, annoying for ODG = -3, and very annoying for ODG = -4). The results were compared with those obtained using the following perceptual audio coding schemes: 1) OGG Vorbis 8 and 2) the audio coding standard ISO/IEC MPEG Layer 3 or MP RESULTS The numerical values for both the CR and the ODG associated with the selected audio files are shown in Table 1. For the EEMD algorithm, 100 realizations were performed, and different values for the WGN power were used. For better compression, the value of was determined according to the audio file under analysis (see Table 1). Table 1. Compression ratio (CR) and ODG for the audio files processed by different encoding algorithms. Input file Parameter EMD EEMD MP3 VBR OGG VBR MP3 64k Castanets (Note 1) Clarinet (Note 1) Female speech (Note 2) Soprano (Note 2) Glockenspiel (Note 1) Note 1: = 0.01 in EEMD. Note 2: = 0.05 in EEMD Proceedings of Meetings on Acoustics, Vol. 28, (2017) Page 7

9 For clarinet, soprano and glockenspiel audio files, the highest compression is achieved by the EMD based encoder, while for castanets and female speech audio files the highest compression is achieved by the MP3 64k encoder (see values in bold in Table 1). Besides, the EEMD based approach gives almost as much compression as the EMD algorithm, but with higher fidelity (the ODG is less negative). Such fidelity improvement illustrates the advantage of the EEMD method for providing IMFs with better spectral concentration, i.e., less mode mixing. Finally, it is observed that in most cases the fidelity (ODG) of the encoded audio files with the proposed method is not very different from the one corresponding to previously developed techniques. For instance, the difference between the ODG produced by the EMD based encoder and by the MP3 64k encoder is -0.1 for castanets, -0.9 for clarinet, for female speech, for soprano and for glockenspiel. This issue is currently under study for further improvements. 6. CONCLUDING REMARKS The EMD/EEMD audio encoding scheme was presented and tested with well-known audio files and compared with other existing encoding algorithms. This encoder is simple and provides higher compression for some cases. Further improvements regarding fidelity and speed are under development. Its advantages are its simplicity and flexibility, since the decoder is the same regardless of the algorithm (EMD or EEMD) used in the encoder. This statement is crucial for low-cost portable devices that perform audio decoding and playing. ACKNOWLEDGMENTS The authors are very grateful to Professor Federico Miyara for having inspired this work. REFERENCES 1 xiph.org Foundation, FLAC - Free lossless audio codec (2014). 2 M. T. Ashland, Monkey s Audio (2013). 3 M. Hans and R. W. Schafer, Lossless compression of digital audio, IEEE Signal Processing Magazine, Vol 18(4), pp (2001). 4 F. A. Marengo Rodriguez, E. A. Roveri, J. M. Rodríguez Guerrero and M. Treffiló, Análisis comparativo de codificadores de audio sin pérdidas y una herramienta gráfica para su selección y predicción de su desempeño, Mecánica Computacional, Vol 30 (41), Acoustics and Mechanical Vibrations (B), pp (2011). 5 O. Bonello, Tecnología de radiodifusión para la década del 90, Revista telegráfica electrónica, pp 293 (1990). 6 O. Bonello, AUDICOM - Un invento argentino, Coordenadas, Vol 85, pp 4-8 (2010). 7 ISO/IEC, ISO/IEC : Information technology Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s Part 3: Audio (1993). 8 xiph.org Foundation, Vorbis audio compression (2016). 9 European Broadcasting Union, Sound Quality Assessment Material, Recordings for subjective tests Users Handbook for the EBU-SQAM Compact Disc (2008). 10 N. E. Huang, Z. Shen, S. R. Long, M. C. Wu, H. H. Shih, Q. Zheng, N. Yen, C. C. Tung and H. H. Liu, The empirical mode decomposition and the Hilbert spectrum for nonlinear and non-stationary time series analysis, Proc. of the Royal Soc. of London (A), Vol 454 (1971), pp (1998). 11 G. Rilling, P. Flandrin, and P. Gonçalves, On empirical mode decomposition and its algorithms, Proc. of IEEE- EURASIP Workshop on Nonlinear Signal and Image Processing NSIP-03, Grado, Italy (2003). 12 N. E. Huang and S. S. P. Shen, The Hilbert-Huang Transform and Its Applications (Interdisciplinary Mathematical Sciences), World Scientific Publishing Company (2005). Proceedings of Meetings on Acoustics, Vol. 28, (2017) Page 8

10 13 Z. Wu and N. E. Huang, Ensemble Empirical Mode Decomposition: a Noise-Assisted Data Analysis Method, Advances in Adaptive Data Analysis, Vol 1 (1), pp 1-41 (2009). 14 A.Linderhed, 2-D empirical mode decompositions - in the spirit of image compression, Proceeding of SPIE, Wavelet and Independent Component Analysis Applications IXI, Orlando, USA, Vol 4738, pp 1-8 (2002). 15 C. C. Ho, Empirical Mode Decomposition Based Novel Data Compression Algorithm for Wireless Data Transmission in Machine Health Monitoring, Master s Thesis, City University of Hong Kong (2009). 16 F. A. Marengo Rodriguez and F. Miyara, Representación de Señales de Audio con Descomposición Empírica de Modos y Submuestreo Adaptativo, Primeras Jornadas Regionales de Acústica, Rosario, Argentina, number A056R. In CD-ROM (2009). 17 K. Khaldi, A. O. Boudraa, M. Turki, I. Samaali and T. Chonavel, Audio encoding based on the empirical mode decomposition, EUSIPCO 09, Glasgow, United Kingdom (2009). 18 K. Khaldi, A. O. Boudraa, B. Torresani and T. Chonavel, HHT - based audio coding, Signal, image and video processing, Vol 7 (2), pp 1-9 (2013). 19 Z. Wu and N. E. Huang, A study of the characteristics of white noise using the empirical mode decomposition method, Proc. of the Royal Society of London(A), Vol 460 (2046), pp (2004). 20 P. Flandrin, G. Rilling and P. Gonçalves, Empirical mode decomposition as a filter bank, IEEE, Signal Processing Letters, Vol 11 (2), pp (2004). 21 M. E.Torres, M. A. Colominas, G. Schlotthauer and P. A. Flandrin, A complete ensemble empirical mode decomposition with adaptive noise, ICASSP, Prague, Czech Republic, pp (2011). 22 Z. K. Peng, P. W. Tse and F. L. Chu, A comparison study of improved Hilbert-Huang transform and wavelet transform: Application to fault diagnosis for rolling bearing, Mechanical Systems and Signal Processing, Vol 19 (5), pp (2005). 23 E. Zwicker and H. Fastl, Psychoacoustics: Facts and Models, Springer-Verlag, Berlin (Germany), 3rd edition (2007). 24 M. Bosi and R. Goldberg, Introduction to Digital Audio Coding and Standards, ser. Kluwer international series in engineering and computer science, Power electronics and power systems. Springer US (2003). 25 D.Salomon, Data Compression: The Complete Reference, Springer-Verlag, New York (USA), 3rd edition, (2004). 26 ITU, Method for objective measurements of perceived audio quality, Recommendation ITUR BS (2001). Proceedings of Meetings on Acoustics, Vol. 28, (2017) Page 9

Perceptual audio coding schemes based on adaptive signal processing tools

Biomedical Acoustics: Paper ICA2016-728 Perceptual audio coding schemes based on adaptive signal processing tools Fernando A. Marengo Rodriguez (a), Sergio A. Castells (b), Gonzalo D. Sad (c) (a) National