ROBUST SPEECH CODING WITH EVS. Nokia Technologies, Tampere, Finland

ROBUST SPEECH CODING WITH EVS Anssi Rämö, Adriana Vasilache and Henri Toukomaa Nokia Technologies, Tampere, Finland anssi.ramo@nokia.com, adriana.vasilache@nokia.com, henri.toukomaa@nokia.com ABSTRACT This paper discusses the voice and audio quality characteristics of EVS, the recently standardized 3GPP codec. Especially frame erasure conditions were evaluated. Comparison to industry standard voice codecs: 3GPP AMR and AMR- WB as well as direct signals at varying bandwidths was made. Speech quality was evaluated with two subjective listening tests containing clean and noisy speech in Finnish language. Five different random frame erasure rates were evaluated: 0 %, 3 %, 6 %, 10 % and 15 %. Nine-scale subjective mean opinion score was calculated for all tested conditions. Index Terms speech coding, listening testing, multibandwidth testing, mean opinion score, frame-erasure 1. INTRODUCTION In August 2014 3GPP SA4 accepted EVS (Enhanced Voice Services) codec as the next generation conversational codec for 3GPP Release 12 onwards [1], [2]. The requirements for the EVS codec performance were quite strict [3], and there were thorough listening tests performed by three independent laboratories during the summer of 2014. These results are available in the EVS selection phase global analysis (GAL) report [4]. Further official listening test results were published in the characterization test report [5]. However, in these reports all listening tests in frame erasure conditions were conducted with a single bandwidth in each test. Therefore, in this work we performed two multibandwidth (narrowband, wideband, superwideband and fullband) characterization listening tests in order to compare EVS and previous generation AMR [6] and AMR-WB [7] voice codecs against each other in noisy channel conditions with varying signal bandwidth. Similar clean channel characterization was performed in our earlier paper[8]. Modified 9-scale absolute category rating (ACR) test methodology was used for all experiments [9], [10]. The EVS codec supports four input and output sampling rates (8, 16, 32, and 48 khz). There are also twelve bitrates ranging from 5.9 kbit/s to 128 kbit/s. The 5.9 kbit/s mode is using VBR (Variable BitRate) with discontinuous transmission (DTX) always enabled and all other bitrates are CBR (Constant BitRate) where DTX functionality may be enabled. Frame error robustness is also optimized to a great degree providing significantly better frame error concealment performance than for example AMR-WB or G.718 [11], [12]. Audio and speech coding modes are switched internally in realtime by the EVS codec depending on the input signal characteristics. Also enhanced voice quality AMR-WB interoperable mode is integrated to the EVS codec. More technical details can be found from the EVS specification as well as papers from ICASSP 2015 special session [13], [14], [15]. All of these features could not be incorporated into a single listening test. It was decided to test the most interesting EVS primary mode bitrate range for all signal bandwidths with both clean and noisy speech. Robustness to frame erasures was the most interesting aspect in this listening evaluation. Evaluated frame erasure rates were: 0 % (i.e. no frame erasures, clean channel), 3 %, 6 %, 10 % and 15 %. The tested frame erasure range is wider than usually, so that the benefits of the EVS codec robustness features can be shown in full. Section 2 details some EVS speech core specific robustness features not explained earlier in any conference paper. Section 3 details the listening test methodology and the conditions tested in listening evaluation. Section 4 shows the subjective listening evaluation test results with several different figures. Finally conclusions are drawn in section 5. 2. FRAME ERROR CONCEALMENT IN THE EVS CODEC An overview of the packet loss concealment methods from the EVS codec are presented in [11]. Additional details concerning the time domain concealment are given in [16]. Since most of the analysis done in this paper focuses on speech data, we will concentrate in this section on the concealment techniques related to the speech core of the codec, most precisely those aspects related to the linear prediction coefficients (LPC), that have not been presented in the above mentioned works. The speech operating mode of the speech/audio switched EVS codec is based on a code excited linear prediction (CELP) approach. The encoding of the LPC parameters is done in the line spectral frequencies (LSF) domain. Details of the actual encoding technology of the LSF parameters and the variable bit budget are presented in [17]. In order to increase the compression efficiency, prediction is used in the

quantization of the LSF parameters. Within the speech core the signal is classified as voiced, unvoiced, transition, generic, inactive or audio like. Based on the signal type a purely predictive, a purely non-predictive (safety-net) or switched safety-net/predictive quantizer is used. The purely predictive quantizer uses a moving average (MA) predictor. The auto-regressive (AR) prediction has higher coding gain but also higher recovery time after a frame loss. It is therefore used for signal types where it brings the most coding gain advantage, like for instance in voiced signal type. However, in order to limit sensitivity to frame losses, the AR predictive quantizer is used in conjunction with the safety net. Table 1 indicates, for each signal type, the prediction mode used in the LSF quantizer. I UV V GE T A NB 1 1 2 2 0 2 WB < 9.6 kbit/s 1 1 2 2 0 2 WB 9.6 kbit/s 1 1 2 1 0 1 WB2 1-2 1 0 1 Table 1. Predictor allocation for each of the signal types: inactive (I), unvoiced (UV), voiced (V), generic (G), transition (T), audio (A). The values in the table correspond to safety net only - 0, MA prediction - 1, switched safety net/ar prediction - 2. The UV mode for WB2 is not used. WB2 stands for wideband signals encoded with a core working at 16 khz sampling rate. For the modes where switched safety-net prediction is allowed, the selection between the two is done as follows. For frame error concealment reasons safety net is imposed at core or bitrate switching. In addition, safety net is imposed for the next frame after voiced class signals, if the frame erasure (FE) mode LSF estimate of the next frame, based on the current frame, is far from the current frame LSF vector. Far means that the distance is larger than 0.25. The distance, or stability factor, sf, is calculated as: sf = 1.25 256 D 400000 L where L is the frame length in samples of the current frame and D is the Euclidean distance between the current frame LSF vector and the FE mode LSF estimate for the next frame. In this case the safety net decision is forced for the subsequent frame. The FE mode LSF estimate is calculated based on a linear combination of an adaptive LSF mean vector and preset values. The combination factors values depend on the coding mode and signal class. Details can be found in [14]. When the safety net usage is not forced, it is decided in closed loop (CL), after computing the quantization error in AR predictive and safety net modes, Err[1] and Err[0] respectively. If Err[0] < at or Err[0] SL < 1.05 Err[1] (1) then the safety net is selected. Thus the safety net mode is selected if for the quantized safety net codevector the quantization distortion (weighted Euclidean distance) is smaller than at, an absolute threshold of 41000 for narrowband or 45000 for wideband frames. For these relatively low error values the quantization is already transparent to original LSF values and it makes sense from the error recovery point of view to use safety-net as often as possible. Finally, the safety net quantized error is compared to the predictively quantized error, with a scaling of 1.05 to prefer safety net usage. The streaklimit factor (SL) is originally 1 and subsequently multiplied by 0.8 at each consecutive predictive frame, after the streak limit is passed. In voiced mode streak limiting starts after 6 frames, in other modes after 3 frames. The preference for predictive frames gets smaller, when the streak of continuous predictive frames gets longer. This is done in order to restrict the very long usage streaks of predictive frames for frame-erasure concealment reasons. For voiced speech longer predictive streaks are allowed than for other speech types. For bitrates larger or equal to 16.4 kbit/s the decision between safety net and predictive mode is done in open loop, based on the energy of the prediction residual. However, the predictive streak limiter is still active in these modes. Table 2 presents objective results in terms of spectral distortion (SD), its average and outliers distribution, for clean and prone to frame erasure channels. The SD is measured between the unquantized LSF s and their decoded versions. CL SN/AR CL limited SN/AR SD [2,4] >4 SD [2,4] >4 (db) (%) (%) (db) (%) (%) 10 % All 1.612 18.8 4.64 1.61 18.6 4.51 FER V 1.634 21.9 4.95 1.61 21.2 4.41 GE 1.599 17.8 4.76 1.60 17.8 4.77 0 % All 1.20 5.26 0.27 1.21 5.43 0.27 FER V 1.138 5.67 0.10 1.16 6.31 0.12 GE 1.199 4.47 0.34 1.20 4.47 0.34 Table 2. Average SD and SD outliers distribution (percentage of frames having SD between 2 and 4 db, ([2,4]), and percentage of frames having SD larger than 4dB, (>4) ) for clean channel cases (0 % frame error rate (FER)), and 10 % FER channels using a closed loop decision between SN and AR prediction, and a restricted predictive usage within the closed loop decision between SN and AR prediction. The results for an SN/AR closed loop decision based on the weighted quantization error are presented, as well as those for the closed loop decision with the restrictions on selecting the predictive path previously presented. Only the generic (GE) and voiced (V) signal types are considered separately because, as seen in Table 1, these are the modes where the SN/AR quantizer is used. The restrictions imposed do not significantly decrease the quality in clean channel, but there

is an increase in quality in the channel with errors, illustrated at 10 % FER. Even though the quality increase as shown by the objective measures is not very high, this restricted decision improves subjectively the quality by eliminating artefacts present for consecutive lost frames. Results of subjective listening test will be presented in section 4. 3. LISTENING TESTING A modified version of the ACR[9] mean opinion score (MOS) method was used for the multibandwidth listening test [18]. The MOS scale was extended to be 9 categories wide in order to get more accurate results with relatively high quality and wider than narrowband or wideband bandwidth speech and audio signals. Only the extreme categories were defined with verbal description: 1 Very bad and 9 Excellent. The assessment is not free sliding, but nine different values still provide the listener more ways to discriminate the samples than five. For example using a seven scale ACR was in independent study found out to give more accurate results than five scale assessment [19]. In practice 9-scale MOS test is also much faster to conduct with naive listeners than for example MUSHRA methodology. By coincidence narrowband test results often hit the traditional MOS range of 1-5, like also in this test. The listening test procedure and result description is similar to that used for speech codec evaluations in [20], [21] and [22]. 9-scale MOS scores also correlate nicely with objective measures such as POLQA and WB-PESQ [23]. 3.1. Test conditions The following test conditions were included in the evaluation: -Direct reference conditions with limited audio bandwidth but no speech coding. Four lowpass cutoff frequencies were evaluated: 4 khz, 8 khz, 10 khz, and 20 khz. -MNRU reference conditions with artificially added distortion. NB used Q=16 db, WB used Q=18 db both with P.810[24]. FB used Q=24 db and Q=16 db with modified MNRU using P.50 shaped noise [25]. -AMR narrowband codec [6] commonly employed in mobile networks. 12.2 kbit/s bitrate was evaluated. -AMR-WB wideband codec [7], supported in an increasing number of mobile networks [26]. Bitrates evaluated: 12.65, and 23.85 kbit/s. -EVS latest 3GPP voice and audio codec[1]. For NB 8.0 and 13.2 kbit/s, for WB 9.6, 13.2 and 24.4 kbit/s, for SWB 13.2 and 32 kbit/s, and for FB 16.4 and 48 kbit/s were evaluated. -FER Frame Erasure Rates. Five frame erasure rates (0, 3 %, 6 %, 10 % and 15 %) were tested with all above mentioned voice codecs. All tested conditions can be seen in the Figure 3. 3.2. Listening tests Two listening tests were organized: -Clean speech 4 talkers (2 females, 2 males), 4 sentence pairs of about 6 seconds from each speaker. Clean speech test had the DTX enabled. -Noisy speech 4 talkers (2 females, 2 males), 4 sentence pairs of 7 seconds from each speaker. Noise types were street, cafeteria, and car noise as well as classical music all with signal-to-noise ratio (SNR) of 15 db. Noisy speech test was conducted with DTX disabled. The tests took place in sound-proof booths in the listening test laboratory of Nokia Technologies [27]. Subjects listened to samples diotically through Sennheiser HD-650 headphones. The listening level was set to a sound pressure level (SPL) of 76 db and could not be adjusted by the listeners. Twenty-four native naive Finnish listeners participated in each test. 4.1. Clean Speech Results 4. RESULTS Clean speech results in Figure 1 show that EVS is significantly more robust than either AMR or AMR-WB at all tested operation points. Especially impressive is that EVS-WB and EVS-SWB at 13.2 kbit/s with 15 % frame erasure rate provides approximately the same quality than AMR 12.2 at 3 % FER and AMR-WB 12.65 at 6 % FER. This means that in heavily congested networks EVS provides usable voice quality. Also worth noting is that EVS-FB 48 kbit/s provides better than direct NB voice quality even in maximum tested FER rate of 15 %. Fig. 1. Clean speech MOS scores with increasing frame erasure rate (in FER-%) 4.2. Noisy Speech Results Noisy speech results in Figure 2 are very similar to the clean speech results. For some reason 10 % FER rate EVS seems to work somewhat better with noisy speech compared to clean

speech. Probably background noise masks some audible effects audible in clean speech. Overall the quality drops very linearly with the increasing frame erasure rate. similar voice quality than AMR-WB 12.65 kbit/s. EVS-SWB 13.2 kbit/s is already about 1 MOS point better. AMR-WB 23.85 kbit/s similarly is at least 1.2 MOS point worse than EVS at 16.4 kbit/s. Finally EVS-FB 48 kbit/s provides statistically equivalent quality to direct FB signal and even with extremely high FER rate of 15 % EVS-FB 48 kbit/s is better than direct narrowband signal or AMR-WB 23.85 kbit/s at 6 % FER rate. Fig. 2. Noisy speech MOS scores with increasing frame erasure rate (in FER-%) 4.3. Combined Results Finally both listening test results containing all 48 listeners were combined and a single overall results bar diagram Figure 3 was generated. From this overall results in Figure 3 it can be seen that EVS is better than or equivalent to AMR or AMR- WB at all bitrates and at all respective frame erasure rates. Even EVS-NB 8.0 kbit/s at 15 % FER is better than AMR- WB 23.85 kbit/s at 15 % FER. In the Figure 4 EVS quality is shown for NB 8.0, WB 9.6, SWB 13.2, FB 16.4, SWB 32 and FB 48 kbit/s bitrates. As can be seen EVS with 6 % FER rate conveniently provides better than any clean channel AMR / AMR-WB coding mode. Overall it could be estimated that EVS provides additional 5-6 percentage points of additional robustness margin compared to AMR-WB and about 10 percentage points more robustness compared to AMR 12.2 kbit/s. Thus EVS provides the same voice quality than earlier generation voice codec, at the same bitrate, although the channel contains significantly more channel errors. Fig. 3. All combined results with confidence intervals 5. CONCLUSIONS A subjective quality evaluation was conducted with two listening tests in Nokia Technologies listening facilities. From the results it can be seen that the 3GPP EVS codec produces state-of-the-art voice and audio quality across all tested bitrates, bandwidths and frame erasure rates. Compared to the previous generation AMR-WB codec, EVS provides the same quality with about 5-6 percentage points additional FER margin. Compared to the AMR codec the additional FER margin is about 10 percentage points. If we compare clean channel performance in this listening test EVS-NB 8.0 kbit/s is better than AMR 12.2 kbit/s. Also EVS-WB 9.6 kbit/s provides Fig. 4. EVS performance (NB 8 kbit/s, WB 9.6 kbit/s, SWB 13.2 kbit/s, FB 16.4 kbit/s, SWB 32 kbit/s, and FB 48 kbit/s) at all frame erasure rates together with AMR and AMR-WB.

6. REFERENCES [1] Stefan Bruhn, Harald Ploboth, Markus Schnell, Bernhard Grill, Jon Gibbs, Lei Miao, Kari Järvinen, Lasse Laaksonen, Noboru Harada, Nobuhiko Naka, Stéphane Ragot, Stéphane Proust, Takako Sanda, Imre Varga, Craig Greer, Milan Jelinek, Minjie Xie, and Paolo Usai, Standardization of the new 3GPP EVS codec, in Proc. ICASSP, Brisbane, Australia, Apr. 2015. [2] Kari Järvinen, Imed Bouazizi, Lasse Laaksonen, Pasi Ojala, and Anssi Rämö, Media coding for the next generation mobile system LTE, Elsevier Computer Communications, vol. 33, no. 16, pp. 1916 1927, Oct. 2010. [3] 3GPP Tdoc S4-130522, EVS Permanent Document (EVS- 3): EVS performance requirements, Version 1.4, 3GPP, Apr. 2013, online: http://www.3gpp.org/ftp/tsg sa/wg4 CODEC/ TSGS4 73/Docs/S4-130522.zip. [4] 3GPP Tdoc S4-141065, Report of the Global Analysis Lab for the EVS Selection Phase, 3GPP, Aug. 2014, online: http://www.3gpp.org/ftp/tsg sa/wg4 CODEC/ TSGS4 80bis/Docs/S4-141065.zip. [5] 3GPP TR 26.952, Performance Characterization of the EVS codec, 3GPP, Dec. 2014, online: http://www.3gpp.org/dynareport/26952.htm. [6] 3GPP TS 26.090, Adaptive multi-rate (AMR) speech codec; Transcoding functions, 3GPP, Sept. 2012. [7] 3GPP TS 26.190, Adaptive multi-rate wideband (AMR-WB) speech codec; Transcoding functions, 3GPP, Sept. 2012. [8] Anssi Rämö and Henri Toukomaa, Subjective quality evaluation of the 3GPP EVS codec, in Proc. ICASSP, Brisbane, Australia, Apr. 2015, pp. 5157 5161. [9] ITU-T P.800, Methods for subjective determination of transmission quality, ITU, Aug. 1996, online: http://www.itu.int/rec/t-rec-p.800-199608-i/en. [10] Anssi Rämö, Voice quality evaluation of various codecs, in Proc. ICASSP, Dallas, TX, USA, Mar. 2010, pp. 4662 4665. [11] Jérémie Lecomte, Tommy Vaillancourt, Stefan Bruhn, Hosang Sung, Ke Peng, Kei Kikuiri, Bin Wang, Shaminda Subasingha, and Julien Fauré, Packet loss concealment technology advances in EVS, in Proc. ICASSP, Brisbane, Australia, Apr. 2015. [12] Venkatraman Atti, Daniel J. Sinder, Shaminda Subasingha, Vivek Rajendran, Duminda Dewasurendra, Venkata Chebiyyam, Imre Varga, Venkatesh Krishnan, Benjamin Schubert, Jeremie Lecomte, Xingtao Zhang, and Lei Miao, Improved error resilience for VOLTE and VOIP with 3GPP EVS channel aware coding, in Proc. ICASSP, Brisbane, Australia, Apr. 2015. [13] 3GPP TS 26.441, EVS Codec General Overview, 3GPP, Aug. 2014, online: http://www.3gpp.org/dynareport/26441.htm. [14] 3GPP TS 26.445, Codec for Enhanced Voice Services (EVS); Detailed algorithmic description, 3GPP, Sept. 2014, online: http://www.3gpp.org/dynareport/26445.htm. [15] Martin Dietz, Markus Multrus, Vaclav Eksler, Vladimir Malenovsky, Erik Norvell, Harald Ploboth, Lei Miao, Zhe Wang, Lasse Laaksonen, Adriana Vasilache, Yutaka Kamamoto, Kei Kikuiri, Stéphane Ragot, Hiroyuki Ehara, Vivek Rajendran, Venkatraman Atti, Hosang Sung, Eunmi Oh, Hao Yuan, and Changbao Zhu, Overview of the EVS codec architecture, in Proc. ICASSP, Brisbane, Australia, Apr. 2015. [16] Jeremie Lecomte, Adrian Tomasek, Goran Markovic, Michael Schnabel, Kimitaka Tsutsumi, and Kei Kikuiri, Enhanced time domain packet loss concealment in switched speech/audio codec, in Proc. ICASSP, Brisbane, Australia, Apr. 2015. [17] Adriana Vasilache, Anssi Rämö, Hosang Sung, Sangwon Kang, Jonghyeon Kim, and Eunmi Oh, Flexible spectrum coding in the 3GPP EVS codec, in Proc. ICASSP, Brisbane, Australia, Apr. 2015. [18] Anssi Rämö and Henri Toukomaa, On comparing speech quality of various narrow- and wideband speech codecs, in ISSPA, Sydney, Australia, Aug. 2005, pp. 603 606. [19] Kerrie Lee, Phillip Dermody, and Daniel Woo, Evaluation of a method for subjective assessment of speech quality in telecommunication applications, in ASTA, Apr. 1996, pp. 199 203, online http://www.assta.org/?q=sst-1996-proceedings. [20] Anssi Rämö and Henri Toukomaa, Voice quality evaluation of recent open source codecs, in Proc. Interspeech, Tokyo, Japan, Sept. 2010, pp. 2390 2393. [21] Anssi Rämö and Henri Toukomaa, Voice quality characterization of IETF Opus codec, in Proc. Interspeech, Florence, Italy, Aug. 2011, pp. 2541 2544. [22] Hannu Pulakka, Anssi Rämö, Ville Myllylä, Henri Toukomaa, and Paavo Alku, Subjective voice quality evaluation of artificial bandwidth extension: Comparing different audio bandwidths and speech codecs, in Proc. Interspeech, Singapore, Sept. 2014, pp. 2804 2808. [23] Hannu Pulakka, Ville Myllylä, Anssi Rämö, and Paavo Alku, Speech quality evaluation of artificial bandwidth extension: Comparing subjective judgements and instrumental predictions, in Proc. Interspeech, Dresden, Germany, Sept. 2015. [24] ITU-T P.810, Telephone transmission quality: methods for objective and subjective assessment of quality. Modulated noise reference unit (MNRU), ITU, Feb. 1996. [25] ITU-T P.50, Telephone transmission quality, telephone installations, local line networks: Objective measuring apparatus, ITU, Sept. 1999, online: http://www.itu.int/rec/t-rec-p.50-199909-i. [26] Global mobile suppliers association (GSA), Mobile HD voice: Global update report, April 2015. [27] Mikko Kylliäinen, Heikki Helimäki, Nick Zacharov, and John Cozens, Compact high performance listening spaces, in Proc. Euronoise, Naples, Italy, May 2003.