Missing Frame Recovery Method for G Based on Neural Networks

Size: px

Start display at page:

Download "Missing Frame Recovery Method for G Based on Neural Networks"

Letitia Ross
5 years ago
Views:

1 Missing Frame Recovery Method for G7231 Based on Neural Networks JARI TURUNEN & PEKKA LOULA Information Technology, Pori Tampere University of Technology Pohjoisranta 11, POBox 300, FIN Pori FINLAND Abstract: The quality of speech is an essential issue in the context of low bit rate video conferencing and voice over IP networking applications In this paper we present a lost frame recovery mechanism for the ITU- T G7231 recommendation The basic idea was to find a mechanism that can recover the missing frames from the past while preserving the speech quality as high as possible without delays The results show that the proposed multilayer perceptron neural network system can recover the missing speech frames from the previous frames accurately enough without having to wait for the next good frame as in the case of linear interpolation Key-Words: Frame estimation, multilayer perceptron, speech coding, error concealment, simulation 1 Introduction Most videoconferencing systems are based on International Telecommunication Union (ITU) H323 standard [2], which specifies that the audio and video stream packets are transferred over the Internet by using a real-time protocol (RTP), which is only intended to report packet losses and speed to both sending and receiving ends The H323 standard also utilizes the ordinary Internet protocol, Transport Control Protocol (TCP), for signaling and non-real time data transmission, for example document transfer Although the network technology and information transfer mechanisms are being developed all the time, speech transferring with several transform mechanisms suffers packet losses in low-bit rate communications According to [1] the global average packet loss in the Internet is 3% In some network servers the packet loss can be much higher than that in rush-hour times The rushhour traffic optimization algorithms in servers may themselves cause, for example, packet discards to the packet transfer TCP will take care of missed packets, but the packet loss indication, re-sending request and packet re-transmission will take too much time in real-time applications, and it can be seen as jammed video stream and heard as silence, if used instead of RTP, which is much more uncomfortable for the conference or call participant than RTP packet loss The network packet loss is harmful in real-time voice communications and video-conferencing systems even when using the RTP protocol The H323 standard based videoconferencing system uses ITU-T speech and audio coders G711, G722, G7231, G728, G729 and MPEG audio layer 1 We selected and examined the G7231 dual rate speech coder that is described in [3] It can operate at a maximum capacity with 63 kbit/s producing good quality sound, and if silence compression is included the bitrate varies between kbit/s This codec is especially suitable for future mobile multimedia systems that are very sensitive to the bandwidth capacity When considering the packet losses in G7231 there are some recovering mechanisms available, but in this paper a proposed neural network based frame estimator is described The G7231 decoding mechanism itself is equipped with a bursty recovery mechanism that estimates the missing LPC parameters and the residual vector by using information from the previous frame If the next two frames are lost the estimation continues and the output signal level is attenuated 25 db After three consecutive losses the output is muted completely An improvement regarding the G7231 recovery mechanism was made in [4] where the lost frame was interpolated from the preceding and following frames This improvement removes the high-energy spikes and metallic sound artifacts and can recover

2 up to 15 % of the lost frame rate Unfortunately this is made at the cost of increased algorithmic delay Linear interpolation is a common recovery mechanism in many speech coders, as for example in [5], and in their improvements However, the nonlinear processing of speech has proved to be a good mechanism, for example in [6] where the CELP-coder parameters are estimated more efficiently by using radial-basis neural networks than by using a linear predictor Another experiment where the nonlinear estimators worked well was made in [7] In this experiment the G728 residual information was estimated by using different techniques and the tests showed that the neural network based estimator produced the most intelligible sounding packets when compared to linear estimation methods In this paper we have concentrated on estimating the missing G7231 packets from the past data so that the recovery delay will not increase, as it will when using the linear interpolators The encoder and decoder have been left unchanged 2 Methods The background idea of using a neural network to estimate the lost frame information was derived from the multilayer perceptron s nonlinear properties and its possible ability to adapt speech properties better than other methods Linear interpolation is a good and simple way to estimate lost packets but there are two reasons why neural networks may be better First, if the neural network can learn and estimate the basic trends of the speech, then it can predict the lost packet In this way the lost packet delay and the resulted jittering problems are avoided or at least reduced Secondly, when compared with the efficiency of the linear estimator, the linear estimator will smooth out the missing packet too much by calculating the interpolation from previous and next arriving packets, so the output sound is mild It is impossible to predict speech parameters accurately because it would be practically predicting the future But if the neural network estimator can predict the speech parameters accurately enough then it has fulfilled the demands set on it We designed a packet recovery mechanism that can be implemented in the current G7231 coder/decoder mechanism in such a way that the base encoding/decoding system does not need any modifications A schematic diagram of the system is presented in Fig 1 x[n] G7231 coder Recovery mechanism Transmission chael Fig 1: The recovery structure G7231 decoder G7231 encoded speech stream frames consist of 189 bits of encoded speech in 63 kbit/s and 158 bits in 53 kbit/s If the silence compression is switched on the silent frames are transferred to the destination in 32 bits corresponding roughly to 10 kbit/s The encoder processes frames of 30 milliseconds or 240 samples of speech at once and calculates 10 Linear Predictive Code (LPC) parameters from the whole frame The LPC information is compressed with the Line Spectrum Pair (LSP) mechanism and quantized with three 8-bit codebooks The final values of the LPC are then sent by using only 24 bits [3] The frame is divided into 4 subframes, with 60 samples each Then the parameters common to both coding rates (the adaptive codebook lags and gain values) are extracted The pulse position and pulse sign values are encoder dependent values Finally the grid index values are added to the frame The structure of the G7231 codec is presented in Fig 2 Input speech LPC analyzer Frame division Adaptive codebook lags & gain values Pulse position & sign values Grid indices ^ x[n] Fig 2: The simplified diagram of G7231 codec Encoded frame The total number of bits is then either 189 for the 63 kbit/s or 158 for the 53 kbit/s Six previous frames of information 3 hidden layers with 9 x 6 x 3 rows Output layer (Estimated frame values) Fig 3: Neural network input/output structure The recovery mechanism presented in Fig 2 consists of a missing packet-monitoring/decision mechanism, a memory that will maintain essential information of six previous packets for current

3 packet estimation, and a multilayer perceptron neural network estimator for the indicated missing packet The base system was built from an ITU-T G7231 distributed source code The neural network was designed and converted to the C programming language by using University of Stuttgart neural network simulator software (SNNSv41) ruing on the Linux operating system The multilayer perceptron neural network is designed for predicting the estimates for the critical speech frame parameters (LSP parameters, amplitude, gain and pulse positions) from six previous frames The rest of the missing parameters, which are not so critical for the estimation process, were directly copied from the previous frame A diagram of the neural network is presented in Fig 3 As a reference, a linear interpolation method that will interpolate the missing information to the lost frame from the previous and following frames has been taken to the experiment 3 Experiments Approximately 10 minutes of spoken Fiish and English by several male, female and children speakers were collected to serve as training data so that everyone pronounced a sentence The spoken sentences were sampled at a 8000 Hz sampling frequency A minority of the sentences were spoken in an environment with background noise and distortion The training data was then encoded with G7231 codec and it consisted of frames of MLP Neural Network coded speech The coded frames were organized as training data so that six consecutive frames were organized for input data and the succeeding seventh frame for the output data The whole training set was introduced in a random order to the neural network estimator Our goal in the experiment was to find a system that would predict the missing packet from the past information, so that there is no need to wait for the next future value for interpolation This system must contain an adequate amount of information from the history for estimating the future transition The proposed experiment system will estimate the next missing frame from the past six frames The size of the neural network was obtained by using the network growing method described in [8] The initial network was selected to be a small network New neurons were added to the layers, and eventually new hidden layers were added to the network The number of nodes in the layers was selected experimentally The selection criterion for the final network size was to make the overall prediction error as small as possible The adding of new nodes was stopped in a situation where the bigger networks did not yield considerably better results The feed forward network size was chosen to be (1 input layer with 156 input nodes (6 x 26 values), 3 hidden layers with hidden nodes (9 x 26, 6 x 26 and 3 x 26 values respectively) and 1 output layer with 26 output nodes), with a nonlinear sigmoidal function The past six frames LSP parameters (6x10 nodes) Gains and amplitude (6x12 nodes) Pulse position (6x4 nodes) Input layer Hidden layers 1, 2 and 3 Output layer predicted frame (1x26 nodes) LSP parameters Gain and amplitudes Pulse positions Fig 4: Schematic diagram of the neural network topology

4 implemented to all layers After preliminary tests the network was divided into three parts to reduce the number of weights and thus the training time: the first part is dedicated to the LSP parameters (the input layer is 6 x 10 nodes), the second part (6 x 12 nodes) for gains and amplitudes and the third part for pulse position (6 x 4 nodes) Each unit is fully coected between the unit layers The structure of the multilayer perceptron network is presented in Fig 4 The selected network was trained by using backpropagation together with a momentum term The learning rate parameter η was kept as 02 In the simulation experiments the neural network was implemented directly in the G7231 decoder The speech data stream was manipulated with a random data loss module with predefined average loss value After the lost packet place is found in the stream the monitor module will request an estimate from the neural network based on the six previous values In order to ensure the quality of the prediction output in the test phase the network output was smoothed by averaging the output with the previous frame value After these operations the predicted and enhanced frame was decoded as a part of speech stream Three tests were conducted to study random packet losses: 5 %, 10 % and 15 % of the data In all cases the lost frame recovery was done by using neural networks and linear interpolation 5 % average loss 10 % average loss 15 % average loss 50 % loss (every second frame) 50 % average loss Lost Frame Arrived frame Fig 5: Speech stream loss experiments As a case study the 50 % frame loss situations were tested so that in the first study every second frame was lost Another experiment was done with randomly selected 1-2 consecutive frame losses in every second, third, or fourth frame The places were also randomly selected We then tried to recover these damages with neural networks and linear extrapolation/ interpolation This means that the lost and recovered frames will be a part of the next lost frames input information, so there will be a cumulative effect in the recovery process An example of the speech stream loss simulation experiments is presented in Fig 5 4 Results Ten people aged participated in the Mean Opinion Score (MOS) test described in [9] The listeners were first introduced to the reference signal and then they could listen to the test signals in the order they desired This test was done at Tampere University of Technology, Pori unit, by using the Intranet The test persons could listen to these signals in their own environments and judge all the signals by using the Pori unit s web server The objective quality method of the study was the segmental Signal-to-Noise ratio (SEG SNR ): mj 2 s ( n) 1 1 M n= mj N+ 1 SNR = 10log (1) SEG 10 mj M j= 0 2 ( ) ˆ( ) s n s n = + n mj N 1 where the s(n) is the original speech sample at time n, and (n) is the encoded/decoded speech sample at time n, M is the number of segments, m j is the end of the current segment and N is the segment length In Tables 1 and 2 the MOS test results and SEG SNR results are presented The segment length to be compared between the reference and test signals is 1000 samples, and the silence sections are taken out when the energy level is close to zero The segmental SNR values are averaged and presented in decibels In Tables 1 and 2 the Ref is an abbreviation for reference signal In Table 1 the 5 % means an average 5 % frame loss which is recovered with neural networks The 10 % lin is for a 10 % average frame loss recovered with linear interpolation Table 1: The MOS and SNR results for 5 %, 10 % and 15 % frame loss recovery Ref 5 % 5 % lin 10 % 10 % lin 15 % 15 % lin MOS SNR

5 In Table 2 the 50 % fr means the 50 % frame loss in the case study where every second packet is lost and recovered with neural networks The 50 % rand linear means that randomly selected 1-2 consecutive frames are lost and interpolated/extrapolated with linear methods Table 2: The MOS and SNR values for 50 % frame loss experiments Ref 50% fr 50% fr linear 50% rand 50 % rand linear MOS SNR Discussion The neural network results are comparable with linear predicted results and the neural network preserves the intelligibility as well as or even better than the linear prediction in 5%, 10% and 15% frame losses The test showed that the neural network could estimate the parameters accurately enough in real-time situations It should also be noted that there is only past information available for the neural network estimator while the linear interpolator can utilize the future information For every lost frame the total enhancement or recovery saving is 30 milliseconds with the neural network This is especially important in real-time applications and videoconferencing systems when the quality of speech is in concern Another thing that should be taken into account is that the original coder and decoder can be kept unchanged, which is extremely important in systems that are built to work respecting certain standards, such as H323 The losses near 50 % gave similar results in both recovery methods The sound quality was poor in the 50 % experiment because every second frame was lost, but the neural network reconstructed speech was still more understandable than the linearly predicted speech This is obvious because the linear prediction will average the missing packet to join together the previous and the next frame, while the neural network tries to predict what will come next The SNR emphasizes neural networks in the 50 % frame loss case more than linear interpolation and supports the understandability On the other hand the MOS scale gave a slightly better result in to the linear interpolation than to neural networks in every second frame loss This is obvious because the human ear likes more smoother and steady sounds more than coarse sounds containing a little bit more information The packet loss limit is % of the stream where the understandability starts to weaken radically Only parts of the sentences are understandable without enhancement and nonredundant information is totally lost These facts raised an idea to test the ability of neural networks to solve the prediction problem in a very difficult situation However, as mentioned before, a neural network caot predict the next frame precisely because that is basically artificial prediction of the future The total reconstruction delay is smaller with neural networks than with linear interpolation, due to the designed structure This is caused by the fact that the neural network will predict the future solely based on past information and there is no need to wait for the next good frame to come The algorithmic calculation time is longer in the neural network approach (21650 multiplications and 504 summations / predicted frame) than with the linear interpolation (26 divisions and 26 summations / interpolated frame) but the results are better and there is no waiting time for the next good frame The network topology is also our concern What makes the feedforward multilayer perceptron network better than other topologies in our case? When compared with for example recurrent or timedelayed networks the difference is that a basic recurrent network must be updated continuously with the stream information In our experiment the idea was to call the neural network when necessary and give it all the information needed for estimation Self Organizing Maps are designed for classification and they are suitable only if the output is limited to a finite number of possibilities Radial-Basis functions can provide the right type of output estimate but the full non-linearity of the feedforward multilayer perceptron network was preferred more in the experiment Speech has certain long-term trends that a neural network can track and estimate However, the optimal size of the network for frame recovery is very difficult to determine as is the correct number of training data samples The speech samples that are compressed to a finite G7231 parameter form still have a huge number of possible combinations when it comes to the recovery process and tracking This raised a further vision of a finite number of solutions where the network will learn only a certain number of output possibilities that have been generalized from the training data with vector quantization methods This approach will reduce the network size and thus the number of calculations

6 The other topologies of the neural networks must then be taken into account The adaptive codebook causes the worst problems in the recovery because it is impossible to estimate the missing adaptive codebook index The only solution was to copy the information from the previous frame adaptive codebook index References: [1] [2] ITU-T H323 Packet based multimedia communication systems 2/ pages [3] ITU-T G7231 Aex A: Dual rate speech coder for multimedia communications transmitting at 53 kbit/s and 63 kbit/s 11/96 27 pages [4] Grant Ho, Suat Yeldener & Marion Baraniecki Improved Lost Frame Recovery Techniques for ITU-T G7231 Speech Coding System, Proceedings of EUSIPCO98, Greece, pp [5] Aamir Husain & Vladimir Cuperman: Reconstruction of missing packets for celpbased speech coders Proceedings of IEEE ICAASP, Detroit 1995 volume 1 pp [6] Fernando Díaz-de-Maria & Aníbal R Figueiras-Vidal Radial Basis Functions for Nonlinear Prediction of speech in Analysis-by- Synthesis Coders IEEE ICASSP Detroit 1995 Topics in speech analysis, pp 788 [7] Jari Turunen & Pekka Loula: Assessment of Various Frame Recovery Mechanisms for G728 Proceedings of ISCOM99 Conference, Kaohsiung, Taiwan, 1999 [8] Simon Haykin Neural Networks Prentice Hall International, Inc 1999 [9] IEEE recommended practise for speech quality measurements IEEE Transactions on Audio and Electroacoustics, September 1969

Perceptual Pre-weighting and Post-inverse weighting for Speech Coding

Perceptual Pre-weighting and Post-inverse weighting for Speech Coding Niranjan Shetty and Jerry D. Gibson Department of Electrical and Computer Engineering University of California, Santa Barbara, CA,