Signal processing in sound engineering

Size: px

Start display at page:

Download "Signal processing in sound engineering"

Alexander Baldwin
6 years ago
Views:

1 Assessment of speech quality in MP3 compression Stefan Brachmański Wroclaw University of Technology, Wybrzeże Wyspiańskiego 27, Wrocław Summary The development of the telecommunication services demands the necessity of more effective use of band assigned to the transmission. The phonic signals, including the speech signal, are transformed before the transmission (i.e. compression or coding). The aim of presented research was determining of the bit rate influence for quality evaluation of speech transmission coded with the MP3 technique and defining the minimal value of bit rate giving the satisfactory quality of coded speech signal. The quality evaluation was performed with recommended by International Telecommunication Union ACR and DCR subjective methods and objective PESQ method. In the performed tests the sentences lists were red by male and female. The results obtained with the ACR method indicate that very good speech quality (the MOS over 4.5) is gained for the bit rate of minimum 128 kb/s, while with the DCR method - 56 kb/s. The good speech transmission quality was assessed by listeners in the ACR method for the bit rate over 64 kb/s, and over 32 kb/s in the DCR method. 1. Introduction There was the significant development of the telecommunication technology lately, among others in the domain of video and audio signals transmission including speech signal. As the result more effective use of band assigned for the signal transmission is required. Nowadays many solutions is performed in which speech signal is transformed in many ways for its more sufficient transmission, acquisition or recognition. Various algorithms named codecs (coder on the sending side and decoder on the receiving side) are applied for this purposes. The codecs characteristic features are among others: demands according to band width, brought delays, compression level and the quality of reproduced speech signal. All these factors can however badly influence the quality of transmitted speech which is one of crucial elements of whole assessment of telecommunication services. Nowadays the quality service is fundamental for all: end-users, operators, service providers and even for hardware suppliers, so it is obvious that the significant role is the proper quality assessment of transmitted speech signal. One of factors influencing the quality of the transmitted signal is the kind of applied compression techniques. Among others various compression techniques, this paper is focused on the MP3 standard, which is extremely popular standard and moreover is used in the DAB digital radio. In presented tests there was examined the influence of bit rate on speech transmission quality. The speech transmission quality can be assessed by subjective or objective methods. The big disadvantage of objective methods is that they are expensive and time consuming hence for many years there were quests for objective method which results would be consistent with the user opinion. Despite the big progress in the new objective methods creation [1],[3],[12], [13],[16],[17],[18],[25],[26] there is still the only way of their verification which are methods based on subjective tests [4],[5],[6],[7],[8],[9],[15],[20],[21],[22],[24]. Among many subjective methods of transmitted and coded speech quality assessment the most popular are nowadays methods which would give direct five ranks degree like recommended by ITU-T in P.800 [15] the ACR method (Absolute Category Rating) and the

DCR method (Degradation Category Rating). In presented research both these methods were applied in the process of quality assessment of MP3 coded speech with various bitrate speeds. 2.

2 DCR method (Degradation Category Rating). In presented research both these methods were applied in the process of quality assessment of MP3 coded speech with various bitrate speeds. 2. MPEG-1 Audio Layer 3 (MP3) Standard The MPEG-1 Audio Layer 3 standard, known as MP3 was designed in the Fraunhofer Institute in cooperation with Thomson company in 1991 and approved by ISO as the international standard ((ISO ) [14]. The MP3 was realized in three development versions named as Layer 1, Layer 2 and Layer 3, which basic parameters in relation to PCM are given in Table 1. Tab. 1. Basic MPEG audio compression parameters in relation to CD quality stereo signal Compression Compression degree Demanded bit rate speed PCM (CD quality) 1:1 1,4 Mb/s MPEG-1 Layer 1 1:4 384 kb/s MPEG-1 Layer 2 1:8 192 kb/s MPEG-1 Layer 3 1: kb/s All MPEG audio standards use the same idea which is the limitation of audio stream by removing the part of signal which is unimportant from the listeners point of view. The imperfections of human ear are used, in particular the masking effect (Fig.1). The weaker sounds appearing around stronger ones can be removed and the human ear can not notice this fact and at the same time lasting usable signal contains less information [23]. In the MP3 standard, there is also used the phenomenon that in conjunction with small speed of neural stimulus transmitted to brain, human can not distinguish weak sounds which appear shortly before or after stronger ones. The MP3 standard uses that expanding the masking range, and before the masking signal the masking appears in a very short time between 2 and 5 ms (some sources say about 20 ms), and after the signal in much longer period between 50 ms and 200 ms. In the masking process, the MP3 coder is based on two compression methods which are loss and lossless, but the dominant is the loss compression and its algorithm is much more complex [19], [27]. Fig. 1 Masking effect. (white bars - sounds, which can be masked during compression, gray bar audible sound).

3 Fig. 2. Illustration of inaudible sounds in presence of strong signal. The first phase of coding is the division of digital audio signal into 32 bands and calculating of 1024 points Fast Fourier Transform (FFT). Signal components laying outside human s audible range are omitted. In the next phase the Modified Discrete Cosine Transform is used (MDCT) and the psycho-acoustic model. The MDCT is the primary element which distinguish Layer 3 MPEG-1 from Layers 1 and 2. It is a lapped transform which let us avoid artifacts stemming from the block boundaries, which would be audible each several milliseconds. The effect of that block are MDCT coefficients. The psycho-acoustic model assumes that regarding the human ear and brain characteristics it is not able to receive and process all acoustic information carried by sound. Psycho-acoustic model (Layer 3) predicts the audible range from 20 Hz to 16 khz (Layer -2 takes range 20 Hz 20 khz), and maximum ear sensitivity in range from 2 khz to 4 khz. In the effect of operations done in the block of psycho-acoustic model the decision is made which data should be given more precisely and which are less relevant. There would be rejected data which is not consistent with the model. After that the signal is given to the quantizer and coder. To obtain more efficient coding there is applied nonlinear quantification, adaptive segmentation and Huffman coding method. In the final phase the audio signal is divided into small parts called frames. Each MP3 file consist of frames containing data responding fraction of recording reconstructed be decoder. Each frame has the heading which includes 32 bits of additional information describing between others the kind and parameters of sound (Fig. 3). There are also all necessary information for proper reconstruction of music data, taking place in forthcoming part of the frame.

4 Fig. 3. Frame block in MP3 3. Absolute Category Rating (ACR) method The Absolute Category Rating (ACR) method is recommended by International Telecommunication Union (ITU) [15] for assessing the quality of speech signal. Test lists comprise simple, short (2-3s), semantically unrelated sentences. A test list is divided into groups of five sentences. The experimenter must decide how many sentences are required in each group to constitute a speech sample. A minimum of two and a maximum of five are recommended. The test material should be properly prepared and recorded. The speaker should pronounce the sentences fluently and should not have any speech defects. To reduce the influence of the individual characteristics of the speaker s voice on the obtained result, several speakers should take part in the experiment. The ITU-T P.800 recommendation permit earlier recording of utterance on high quality equipment like a conventional two-track tape recorder (high grade tape), a two-channel digital audio processor a high quality video cassette recorder or digital tape recorder (DAT) or a computer with acoustic input and output. Speech is recorded from a linear microphone and low-noise amplifier with a flat frequency response. The recording should be carried out in the room of volume m 3, reverberation time less than 0.5s (preferably in the range s) and noise level not bigger than 30dBA. The recording level should be set 20-30dB less than the level of overdriving of recording system. In the beginning of each recording, 20 seconds calibrating tone of known level is recording. Usually as the calibrating tone the harmonic signal of 1000Hz frequency is applied unless the system is sensitive for this frequency (i.e. this frequency is used for another purposes). In such cases there can be used tone of different frequency. The recoded calibrating tone can be used for setting of listening level. The listening is carried out in a room of parameters like during the recordings of testing lists. It is recommended to measure the noise level at least twice, i.e. in the beginning and end of the measures. If there is a big difference between measures, then the leading person should judge if it can influence the final measures score. The experiment s listening part should take place in a room with a noise level below 30 dba. Listeners are chosen at random from the normal telephone using population.

5 Listeners read instruction of experiment before beginning of the measurements. Various scales may be used for different purposes. Operator gives the following opinion scales recommended by ITU: a) listening-quality scale (Excellent speech is rated 5, Good - 4, Fair - 3, Poor 2, Bad - 1), b) listening-effort scale(complete relaxation possible; no effort required is rated 5, Attention necessary; no appreciable effort required - 4, Moderate effort required - 3, Considerable effort required - 2, No meaning understood with any feasible effort - 1 ), c) loudness-preference scale (Much louder than preferred is rated 5, Louder than preferred - 4, Preferred 3, Quieter than preferred 2, Much quieter than preferred - 1). Listeners listen to the sentences and give their opinions in five levels scale. The mean value should be calculated over listeners and speakers for each condition of speech transmission. The male and female voice have different characteristics which is why both types of voices should be regarded and the obtained results should be analyzed separately. In cases when the differences in results obtained from both type of voices are irrelevant the final results can be averaged. 4. Degradation Category Rating (DCR) The Degradation Category Rating (DCR) method is recommended by ITU [15] as an alternative to the ACR method which is not accurate enough for high-quality systems. The measurement consists of comparing the tested system with a high-quality reference system. Speech samples, i.e. different sentences (sentence lists the same as in ACR), are selected from a larger, balanced test list and presented to the listeners in single pairs (A-B) or repeated pairs (A-B-A-B) where A is the reference sample and B is the tested sample. Each pair is rated separately. It is recommended to use several zero pairs (A A) to verify the quality and sensitivity of the ratings given by the participants in the experiment. Samples A and B should be separated by a s interval. In a repeated pair procedure (A-B-A-B), the separation between the two pairs should be s. The listeners evaluate the degree of deterioration in the quality of the second sentence in comparison with the first one, using a five-point scale of quality deterioration. In the method DCR the listeners try to answer the question Please rate the degradation of the second sample relative to the first.. Listeners hear two sentences (original and transmitted) and give their opinions in five point scale (degradation is inaudible - 5, degradation is perceived but not annoying - 4, degradation is slightly annoying - 3, degradation is annoying - 2 and degradation is very annoying - 1). The average rating (Degradation Mean Opinion Score DMOS) is calculated over the listeners and the speakers for each tested speech transmission condition. The requirements as to the room s acoustics, recording, listening sessions, the selection of a listening group and the test material are similar as for ACR 5. Comparison Category Rating (CCR) The Comparison Category Rating (CCR) method [15] is similar to the DCR. The process of recording and replaying of the list is the same whereas the model and tested samples are played in random order (the A-B pairs are created randomly). The listeners aim is to compare two samples A and B and to assess if the quality of the first signal in comparison to the second one is the same or different. There is the seven grades scale from 3 to -3 (3- the quality of the first signal in comparison to the second one is much better 2- better, 1 slightly better, 0 about the same, -1 slightly worse, -2 worse, -3 much worse). The gained assessment is the mean of all partial marks and is named CMOS (Comparison Mean Opinion Score). 6. Perceptual Evaluation of Speech Quality Method

6 The Perceptual Evaluation of Speech Quality (PESQ) method [17] is the improved version of PSQM (Perceptual Speech Quality Measure) [16] based on a transformation of physical signals with the psychoacoustic modeling [2]. PSQM method was recommended by ITU-T as P.861 [16], and in 2001 ITU-T accepted the PESQ as a new standard P.862 [17], which replaced previously recommended PSQM method. The idea of PESQ measurement is based on so called internal representation which reflects a theoretical form of speech signal in a human brain, similarly to PSQM. As a reference signal, the previously recorded male and female voices (one sentence by each voice) are used. Such prepared original signal is transmitted via telecommunication channel being under investigation, and at the output of this channel this signal is distorted (degraded). Next, these two signals are compared in a psychoacoustic domain which reflects the human impression of speech. The transformation from the physical form into the psychoacoustic representation appears in three stages time frequency reflection, frequency-critical bank scaling and scaling of the signal levels. In the first operation the time signals are mapped to the time-frequency domain using a short-term FFT with a Hann window of size 32 ms (N=256 samples at sampling frequency of 8 khz, or N=512 samples at sampling frequency of 16 khz). The overlap between successive time windows is 50%. The second stage takes into account the fact that hearing system features the worse frequency discrimination for higher frequency range in comparison to the lower band. This fact, with the signal-by-noise masking phenomena by the means of filter bank as critical bands with a bark scale, leads to the modeling of the hearing process at the particular stimuli. The continuous spectrum is a representation of a stimuli distribution over the nerves connected to basilar membrane which reflects all the complex phenomena as nonlinear smoothing in critical bands. Scaling of the signal levels in db into levels of loudness in phones, and finally into the loudness scale in sons [28], is the third step of transformation. This step reflects the nonlinearity of relation: signal level loudness impression. At the end of processing chain, the cognitive model is applied, and the final decision is a result of comparison between two internal representations (spectra) of tested and reference signals. The output value is PESQ score. The range of the PESQ score is -0.5 to 4.5. This PESQ score can be transformed into a subjective listening quality MOS-like scale between 1.0 and 4.5, the normal range of MOS values found in an ACR experiment. 7. Experiment The aim of tests carried out was: - checking the bit rate influence on MP3 coded speech signal quality evaluation, - determining of the bit rate minimal value giving satisfactory quality of coded speech signal, - examining the differences in results obtained with ACR and DCR methods, - chocking the possibility of determining the relation between subjective ACR and DCR methods and the objective PESQ one. The ACR and DCR measurements were carried out at room of Chair of Acoustics and Multimedia of Wroclaw University of Technology. The room fulfilled the ITU-T P.800 recommendation [15]. The listeners were from 20 to 30 years old. Two native speakers (women and man) of Polish participated in the study. Listeners were students at the Wroclaw University of Technology whose age ranged from 20 to 23 years The test material consisted of phonetically-balanced sentences lists (Fig.4) which were pronounced by male and female speakers. The model sentences lists were recorded on the digital tape recorder with the sampling frequency of 44,1kHz and 16 bits. Testing lists were

7 created with the GX Transcoder converter, which allows for conversion of most popular audio formats. In tests carried out for the MP3 format the bit rate speed was changed in range from 24kb/s to 320kb/s. For each measurement point (different bit rate) there were two lists 50 sentences each, one pronounced by male and one by female speaker. The listeners were giving their marks in 5 degree scale from 1 to 5. The testing signals were presented to the listeners with the earphones and the computer software created in the Laboratory of Analysis and Processing of Acoustic Signals at Chair of Acoustics and Multimedia, Faculty of Electronics at Wroclaw University of Technology. The software allows for carrying out tests with ACR and DCR methods. At the front of each listener there is a screen, keyboard and mouse. The listener according to information from the leading person has to choose the measurement option (ACR or DCR). Choosing the ACR method there is only one tested signal presented to the listeners, whereas choosing the DCR method there are two signals presented the original (the first one) and the tested one (the second one). Before the start of the measurements there is the instruction displayed on the screen regarding the measurement method and the way of giving the mark. After listening to testing sequence, the listener gives his mark in 5 degree scale. The evaluation in the form of marks is giving in certain time limits. The passage of time is presented to the listener on the screen. If the listener is not able to give his mark in the planned time then the program gives the lowest possible mark, i.e. 1. The software calculates marks for each listener separately and after the end of the session for each tested transmission condition (with certain bit rate) and gives medium MOS opinion for each listener. There is a possibility of network connection of all computers taking part in the experiment and the result can be presented as the medium mark of each individual listener or medium value of all listeners with full statistic information. The measurement with the network connection demands gathering of all the listeners at the same time because the evaluation is done simultaneously by all listeners. On the contrary one-stand measurement allows carrying out tests in different times but it is necessary to keep the same measurement conditions. The results of MOS evaluation obtained with the ACR method for male and female voice are presented in Fig.5. In Fig.6 results obtained for the DCR method are presented. The measurements presented in the table below were partially done within the framework of diploma project realized at Faculty of Electronics of Wroclaw University of Technology [11].

8 Rys 4. Sample sentences list During the analysis of the obtained results ther were no significant differences between marks obtained for male and female voices for both ACR and DCR methods. According to the P.800 recommendation [15] in such case the results for male and female voices can be averaged. The medium results obtained for ACR and DCR methods are presented in Fig.7. The measurements of objective PESQ method were carried out with the same sentences lists and identical transmission conditions. The obtained results are presented in Fig.7.

9 Fig. 5. MOS speech signal quality evaluation obtained with ACR method in bit rate function (ACR-M male voice, ACR-F female voice) Fig. 6. MOS speech signal quality evaluation obtained with DCR method in bit rate function (DCR-M male voice, DCR-F female voice)

10 Fig. 7. Medium MOS speech quality evaluation obtained with ACR, DCR and objective PESQ methods in bit rate function. 8. Summary During the analysis of the results obtained in presented experiment it can be noticed that as could be expected in both subjective methods Absolute Category Rating (ACR) and Degradation Category Rating (DCR), increase of the bit rate is improving the speech quality. There were no significant differences between results for male and female voices which allowed the averaging the results obtained for both voices. Results for the ACR methods indicate that very good speech quality (MOS over 4.5) can be achieved for the minimal bit rate of 128 kb/s, whereas for the DCR method it is 56 kb/s. The good speech transmission quality was decided by listeners for bit rates from 64 kb/s 128 kb/s in the ACR method, and from 32 kb/s 56 kb/s in the DCR method. When comparing the ACR and DCR methods it can be noticed that there is bigger increase of MOS opinion in the DCR than in ACR method. Similar research were carried out in the framework of evaluation of speech transmission in the system of digital radio DAB+. The results were presented at 134th Convention Audio Engineering Society in Rome [4]. In the research the sound samples after the transmission: multiplex radio-transmitter radio receiver were used. For the coding the systems MPEG- 2/4 AAC and HE AAC v.1 were used during the experimental emission carried out in Wroclaw. The tested samples (male and female voices) were transmitted with six different bit rates (136 kbit/s, 128 kbit/s, 96 kbit/s, 64 kbit/s, 48 kbit/s oraz 24 kbit/s), for the sampling frequency of 48 khz. For each bit rate value two versions of signals were recorded: with switch on and switched off the SBR processor. As the model samples the CD recordings were applied, which were the audio source. The tested samples were recorded with the digital tape recorder DAT Tascam DA 30, with sampling frequency of 44,1 khz and 16 bits on the analog output of the radio receiver DAB CLINT Audio 01. In the above research it was confirmed that for bit rates from 96 kb/s the DCR gives very good opinion whereas the ACR method from the bit rate of 128 kb/s. The results obtained with the ACR method in these two experiments are convergent. In both cases the very good

11 speech transmission evaluation was obtained for bit rates of minimum 128 kb/s; and the results are unanimous despite different speech coding techniques (MP3 and AAC). The differences are in the experimental results obtained for the DCR method. In the presented research (MP3 coding) the very good evaluation was obtained from the bit rate of 56 kb/s, whereas in experiments with digital radio transmission DAB+ (AAC coding) from the bit rat of 96 kb/s. The measurements carried out with the objective PESQ method are burdened with error related to bit rates overriding 128 kbps. The PESQ allows the maximal sampling frequency of 16kHz with 16bits, which gives 128kbps. It can be expected that for higher bit rates the value of evaluation will not change and reach the maximal value. At the same time it was stated that the results obtained for the ACR method are higher than for the DCR one, so in the opposite way than in research presented in this paper for the MP3 coding. It demands further research to clarify this fact by tests carried out for the bigger listeners group and more diverse testing material. After the finishing of speech transmission quality evaluation measurements the listeners were asked for sharing their impressions regarding particular methods. The listeners emphasized the lack of well defined mark grade scale in the ACR method and no model sample to which the testing sample could be compared. That fact caused difficulties in marking during the evaluation. In the DCR method the model signal exists but even the listeners had problems with giving mark to testing sample and also for that method there were complaints regarding not enough precise definition of the marking scale.. Literature [1] ANSI S 3.5, (1997), Methods for the calculation of the speech intelligibility index (SII). [2] Barbedo J.G.A., Lopes A., (2005), A new cognitive model for objective assessment of audio quality, J. Audio Eng. Soc., 53, 1/2, 22-31,. [3] Beerends J.G., Stemerdink J.A., (1994), A Perceptual Speech-Quality Measure Based on a Psychoacoustic Sound Representation, J. Audio Eng. Soc., 42, 3, [4] Brachmański S., Kin M., (2013), Assessment of speech quality in Digital audio Broadcasting (DAB+) system, 134th Convention Audio Engineering Society, Convention paper 8829, Rome, Italy. [5] Brachmański S., (2012), Automation of Subjective Measurements of Logatom Intelligibility in Classrooms, Automation, ed. by Florian Kongoli, InTech [6] Brachmański S., (2008), Automation of subjective measurements of speech inteligibility in analogue telecommunication channels, Archives of Acoustics, 33, 3, [7] Brachmański S., Kula S., (2003), Badanie jakości mowy w połączeniach głosowych. Stara usługa - nowe problemy, Przegląd Telekomunikacyjny i Wiadomości Telekom., 8-9, [8] Brachmański S., (1999), Subiektywne metody oceny jakości transmisji mowy w cyfrowych kanałach telekomunikacyjnych, Krajowe Sympozjum Telekom. 1999, Tom B, , Bydgoszcz, Poland. [9] Brachmanski S., (2001), Automatyzacja subiektywnych pomiarów jakości transmisji mowy metodą ACR, XLVIII Open Seminar on Acoustics, , Wrocław-Polanica Zdrój, Poland. [10] Brachmański S., (2001), Fonetyczna struktura materiału testowego stosowanego w pomiarach jakości transmisji mowy metodą ACR, XLVIII Open Seminar on Acoustics, , Wrocław-Polanica Zdrój, Poland.. [11] Dończyk R., (2013), Wpływ techniki kompresji na jakość mowy, Praca dyplomowa, Politechnika Wrocławska, Wrocław.

12 [12] French, N.R., Steinberg, J.C., (1947), Factors governing the intelligibility of speech sounds, J. Acoust. Soc. Am. 19, [13] Houtgast T., Steeneken H.J.M., (1973), The Modulation Transfer Function in room acoustics as a predictor of speech intelligibility, Acustica, 28, [14] ISO/IEC, (1993), Information Technology - Coding of moving pictures and associated audio for digital storage media at up to about 1.5 Mbit/s Part 3: Audio; Standard ISO/IEC [15] ITU-T Recommendation P.800, 1(996), Method for subjective determination of transmission quality. [16] ITU-T Recom. P.861 (1996), Objective Quality Measurement of Telephone-band ( Hz) Speech Codecs, (1996). [17] ITU-T Recom. P.862, (2007), Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. [18] ITU-T Recom. P.863, (2011), Methods for Objective and Subjective Assessment of Speech Quality. Perceptual Objective Listening Quality Assessment. [19] Li Z.N., Drew M.S., (2004), Fundamentals of multimedia, Pearsons Education Inc. [20] Majewski W., Myślecki W., Baściuk K., Brachmański S., (1998), Application of modified logatom intelligibility test in telecommunications, audiometry and room acoustics, Proc. 9 th Mediterranean Electrotechnical Conf. Melecon 98, 25-28, Tel-Aviv, Israel. [21] Polska Norma PN-90 / T 05100, (1990), Analogowe łańcuchy telefoniczne. Wymagania i metody pomiaru wyrazistości logatomowej. Warszawa,. [22] Polska Norma PN V , (1999), Cyfrowe łańcuchy telefoniczne. Wymagania i metoda pomiaru wyrazistości logatomowej., Wyd. Norm., Warszawa. [23] Rabiner L.R., Schafer R.W., (2011), Theory and applications of digital speech processing, Pearsons Education Inc.. [24] Sotschek J., (1976), Methoden zur Messung der Sprachgüte I: Verfahren zur Bestimmung der Satz- und der Wortverständlichkeit, Der Fernmelde Ingenieur, 10, [25] Voran S., (1999), Objective Estimation of Perceived Speech Quality Part I: Development of the Measuring Normalizing Block Technique, IEEE Trans. Speech Audio Process., 7, [26] Voran S., (1999), Objective Estimation of Perceived Speech Quality Part II: Evaluation of the Measuring Normalizing Block Technique, IEEE Trans. Speech Audio Process., 7, [27] Zölzer U., (2008), Digital Audio Signal Processing, John Wiley & Sons Ltd.. [28] Zwicker E, Feldtkeller R., (1967), Das Ohr als Nachrich-tenempfänger, S. Hirzel Verlag, Stuttgart,.

Proceedings of Meetings on Acoustics

Proceedings of Meetings on Acoustics Volume 19, 213 http://acousticalsociety.org/ ICA 213 Montreal Montreal, Canada 2-7 June 213 Engineering Acoustics Session 2pEAb: Controlling Sound Quality 2pEAb1. Subjective