ROBUST SPEECH CODING WITH EVS Anssi Rämö, Adriana Vasilache and Henri Toukomaa Nokia Techonologies, Tampere, Finland

ROBUST SPEECH CODING WITH EVS Anssi Rämö, Adriana Vasilache and Henri Toukomaa Nokia Techonologies, Tampere, Finland 2015-12-16 1

OUTLINE Very short introduction to EVS Robustness EVS LSF robustness features Listening test results More results Summary Questions? 2

INTRODUCTION TO EVS EVS stands for Enhanced Voice Services Latest generation voice and audio codec for 3GPP and VoIP networks Introduces SWB and FB at low bitrates of 9.6 and 16.4 kbit/s Also supports legacy narrowband and wideband bandwidths Supports internal resampling between all supported sampling frequencies: 8, 16, 32 and 48 khz. Bitrates from 5.9 to 128 kbit/s. State-of-the-art quality with both speech and generic audio Communications codec with delay less than 32 ms Very robust against frame loss DTX available for all bitrates and bandwidths 3

ROBUSTNESS Robustness is needed in real communication networks Whenever frame is lost in communication channel it has to be replaced in real time in the decoder with best possible approximation If nothing is done, there would be either silent gaps or on the other extreme loud bangs, when the current signal model is not stable. EVS has several novel methods in several different domains to enhance robustness This paper discusses spectral modelling robustness features related to LSF quantization Listening test results account for all of the EVS robustness increasing methods 4

LSF ROBUSTNESS FEATURES. 5 Mode NB WB at bitrates <9.6kbps Inactive MA MA MA Unvoiced MA MA MA WB at bitrates 9.6kbps Voiced SN/AR SN/AR SN/AR Generic SN/AR SN/AR MA Transitio n SN SN SN Audio SN/AR SN/AR MA

LSF ROBUSTNESS FEATURES.. The purely predictive quantizer uses a moving average (MA) predictor. The auto-regressive (AR) prediction has higher coding gain but also higher recovery time after a frame loss. In order to limit sensitivity to frame losses, the AR predictive quantizer is used in conjunction with the safety net. Transition mode always uses the non-predictive quantizer, due to signal being by definition highly changing. Unvoiced and inactive modes always use MA-predictive coding Voiced, audio and generic modes use switched non-predictive/arpredictive LSF coding at low bitrates. For higher bitrates the MApredictor is used for generic and audio mode. 6

LSF ROBUSTNESS FEATURES... In case of switched coding the predictor usage is selected in closed loop, based on several criteria: - If non-predictive is good enough (SD <~1.0) use it. - If prediction helps only very little use non-predictive. - If there is already a very long streak of predictive frames prefer non-predictive frame time-to-time. In practice this means that for stable signal segments predictive coding is used quite often (over 85%), but when the signal is more unstable the quantizer automatically inserts non-predictive LSF codebook entries. 7

LSF OBJECTIVE RESULTS Even with high frame erasure rate of 10%, there are less than 5% frames with Spectral Distortion larger than 4dB. 8

LISTENING TESTING AMR, AMR-WB and EVS were compared against each other Tested 0%, 3%, 6%, 10% and 15% frame erasure rates. Listening test consisted of two tests: clean speech (DTX enabled) and noisy speech (DTX disabled) ACR9 test methodology was used: 1 (very bad) to 9 (excellent) scale without reference i.e. MOS test. Tested bitrates: Around 12.2-13.2 for all AMR, AMR-WB and EVS. Additional test points at around 24 kbit/s (comparison to AMR-WB). EVS also tested 8, 9.6, 16.4, 32 and 48 kbit/s at various bandwidths. 24 naïve listeners in both tests; Finnish language; Sennheiser HD-650 headphones, diotical listening. Noise types were: street, cafeteria, car, and classical music at -15dB. 9

CLEAN SPEECH RESULTS. 10

CLEAN SPEECH RESULTS.. 11

CLEAN SPEECH RESULTS EVS is significantly more robust than either AMR or AMR-WB at all bitrates Especially impressive is that EVS-WB and EVS-SWB at 13.2 kbit/s with 15 % frame erasure rate provides approximately the same quality as AMR 12.2 at 3 % FER and AMR-WB 12.65 at 6 % FER. Also worth noting is that EVS-FB 48 kbit/s provides better than direct NB voice quality even in maximum tested FER rate of 15 %. 12

NOISY SPEECH RESULTS. 13

NOISY SPEECH RESULTS.. 14

NOISY SPEECH RESULTS Noisy speech results are very similar to the clean speech results. For some reason 10 % FER rate EVS seems to work somewhat better with noisy speech compared to clean speech. Background noise likely masks some audible effects that are audible in clean speech. Overall the quality drops very linearly with the increasing frame erasure rate. 15

COMBINED RESULTS AT LOW RATES 16

RESULTS AT LOW BITRATES.. As can be seen EVS-SWB 13.2k with 6 % FER rate provides better than any clean channel AMR / AMR-WB coding mode. Overall it could be estimated that EVS provides additional 5-6 percentage points of additional FER robustness margin compared to AMR-WB and about 10 percentage points more robustness compared to AMR 12.2 kbit/s. Thus EVS provides the same voice quality than earlier generation voice codec, at the same bitrate, although the channel contains significantly more channel errors. 17

COMBINED RESULTS HIGH RATES 18

RESULTS AT HIGH BITRATES AMR-WB 23.85 kbit/s is at least 1.2 MOS point worse than EVS at 16.4 kbit/s over all FER rates. EVS-FB 48 kbit/s provides statistically equivalent quality to direct FB signal at 0% FER. Even with extremely high FER rate of 15 % EVS-FB 48 kbit/s is better than direct narrowband signal or AMR-WB 23.85 kbit/s at 6 % FER rate. 19

DEMOSAMPLES AMR 12.2 0% AMR-WB 12.65 0% EVS 13.2 no FER AMR 12.2 10% AMR-WB 12.65 10% EVS 13.2 10% AMR 23.85 10% EVS 48 0% EVS 48 10% 20

SUMMARY EVS is extremely robust against frame erasures In clean channel performance it is transparent to original (FB 48kbit/s) 21

QUESTIONS? 22

BACKUP SLIDES Combined results in full screen by FER rate Combined results in full screen by bitrate 24

COMBINED CURVES 25

COMBINED CURVES 26