On Improving the Performance of an ACELP Speech Coder

Size: px

Start display at page:

Download "On Improving the Performance of an ACELP Speech Coder"

Oswin Preston
5 years ago
Views:

1 On Improving the Performance of an ACELP Speech Coder ARI HEIKKINEN, SAMULI PIETILÄ, VESA T. RUOPPILA, AND SAKARI HIMANEN Nokia Research Center, Speech and Audio Systems Laboratory P.O. Box, FIN-337 Tampere, Finland Abstract: - In this paper we evaluate the performance of a variety of techniques to improve the parameter analysis in CELP speech coders. These methods include using extended cost horizon in the fixed codebook search process, as well as joint optimization and delayed decision coding of the adaptive and fixed codebook parameters. Based on our simulations for the IS- speech coder, substantial improvements in terms of objective performance are achieved especially by using delayed decision coding, while the subjective improvements are more marginal. This paper also presents the justification for efficient coding methods based on the distribution of adaptive and algebraic codebook indices in the modified IS- coders, as well as demonstrates the performance improvements achieved by using a shaped lattice structure and adaptive pulse positioning to encode the adaptive and algebraic codebook indices. While the simulations were made using the IS- speech coder or a modified version of it, the results and observations can be generalized to most ACELP and CELP coders. At lower bit rates the importance of each approach described in this paper is expected to increase. Key-Words: - algebraic code excited linear prediction Introduction In recent years, code excited linear prediction (CELP) [] has been the most popular approach for high quality speech coding at bit rates approximately above kbps. This is especially true for a derivative of CELP coders called algebraic CELP (ACELP), and different ACELP coders have been widely accepted in recent speech coding standardization processes in 3GPP, ITU-T, ETSI and TIA. One example of such a coder is the 7. kbps IS- speech coder adopted by TIA []. However, at bit rates below kbps the quality of CELP coders in general deteriorates rapidly, which is partly proven by the recent efforts in ITU-T to standardize a high quality kbps speech coder [3]. To improve the performance of CELP coders and simultaneously to make it more amenable for lower bit rates, methods including relaxed waveform matching [] and phase dispersion [5] have been suggested, which efficiently exploit the properties of the human speech perception mechanism. On the other hand, to tackle the limitations concerning the parameter analysis in CELP coding, extended cost horizon [], joint optimization [7] and delayed decision coding have been proposed [8]. For increased coding efficiency, also methods exploiting V.T. Ruoppila is presently with VoiceAge Corp. in Montreal, Canada. S. Pietilä is presently with Nokia Mobile Phones in Tampere, Finland. the uneven distribution of the excitation parameters of a CELP coder have been presented, see e.g. [9,, ]. In this paper, we evaluate the performance of using extended cost horizon, joint estimation and delayed decision coding in the excitation search process of the IS- speech coder. Furthermore, the justification for the enhanced methods employing the uneven distribution of the adaptive and algebraic codebook indices are given, together with the simulation results of the proposed approaches for their efficient coding. This paper is organized as follows. In Section the structure of the IS- speech coder is briefly described. The simulation results for using the extended cost horizon, joint optimization and delayed decision coding are presented in Section 3. The empirically found distributions for the adaptive and algebraic codebook indices are shown in Section. The concepts of shaped lattice and adaptive pulse positioning for efficient coding of adaptive and fixed codebook indices are also shortly described, together with the simulation results. Finally, conclusions are drawn. IS- Speech Coder In the ACELP speech coder, a cascade of time variant pitch predictor and linear prediction (LP) filter is used to filter an excitation signal, see Fig.. An all-pole LP filter

2 τ u b ( n ) z b u(n) A( z) s(n) sˆ ( n) e(n) i u ( b k ) z uk ( ) A( z) A(z) P (z) yk ( ) Excitation Generator u c ( n ) g W (z) b Error Minimization Excitation Generator u ( c k ) g Fig. Block diagrams of ACELP encoder (left) and decoder (right). H ( z) = A z = a z a z a p n z () ( ) where a...a p are the coefficients, is used to model the short-time spectral envelope of the speech signal. A pitch predictor of the form = B( z) bz utilizes the pitch periodicity of speech to model the fine structure of the spectrum. The gain b is bounded to the interval of -., and the pitch period, or similarly pitch lag, to the interval of -3 samples (sampling frequency is 8 khz). The pitch predictor is also referred to as long-term predictor (LTP) filter. In Fig., the LTP filter is represented by the feedback loop consisting of the delay z and the gain. The LTP memory can also be seen as a codebook consisting overlapping codevectors. This codebook is usually referred to as the LTP or adaptive codebook. An algebraic excitation, and more generally fixed excitation, signal u c (n) is multiplied by a gain g to form an input signal to the filter cascade. The algebraic excitation signal is composed of pulses having a value of ± and zeros, and the corresponding codebook is called algebraic codebook. The output of the filter cascade is a synthesized speech signal s ˆ( n). An error signal e(n) is computed by subtracting the synthesized speech signal s ˆ( n) from the original speech signal s(n). The optimal adaptive and algebraic codevectors are sequentially selected by minimizing the weighted sum-squared error. The purpose of the weighting filter W(z) is to shape the spectrum of the error signal so that it is less audible. a () The frame length used in the IS- coder is ms, and a frame is further divided into four subframes of equal lengths. One set of LP coefficients is derived for each frame and it is encoded with bits. The other parameters are derived subframe wise. The pitch lag is encoded by bits (8585) while 8 bits ( 7) are used to code the pulse positions together with their signs. The pitch gain and the algebraic codebook gains are vector quantized by 8 bits ( 7). The decoder receives the parameters from the channel, see Fig., and determines the algebraic excitation signal by the received index and gain. The algebraic excitation signal is filtered through the LTP-LP filter cascade to produce the synthesized speech signal. Finally, a postfilter P(z) is employed to enhance the perceptual speech quality. 3 Modified Parameter Analysis In a typical CELP coder, there are two important limitations in the parameter estimation process, which can partly be justified by the reduced complexity. Firstly, different parameters are sequentially optimized instead of joint optimization. Secondly, the cost function used to find the excitation signals (adaptive and fixed) minimizes the sum-squared error within the current subframe, but it does not take into account the effect that the excitation signal has on the subsequent subframes. One result of subframe based error minimization is that the excitation samples at the first positions of the subframe will have greater contribution to the cost function than the samples at the last positions due to LP filtering. To alleviate these problems, it has been proposed in [] that the cost function of the fixed codebook search is extended to cover the beginning of the next

3 Joint Optimization Delayed Decision, NUM ALG = NUM ADA Delayed Decision, NUM ALG = NUM ADA Delayed Decision, NUM ALG = NUM ADA 3.5 Whole Speech.5 Voiced Speech 5. Unvoiced Speech SegSNR Max{ NUM ALG, NUM ADA }. 3 Fig.. Simulation results for joint optimization and delayed decision coding of adaptive and algebraic codebook parameters in the IS- speech coder. subframe. In the presented approach, the target signal and the synthesized speech signal are extended by concatenating their free evolutions (output of zero valued excitation) to the original signals. In [7], the adaptive and fixed codebook parameters were jointly searched instead of sequential search. A solution described in [8] is the delayed decision method, where a predetermined number of fixed and adaptive codebook parameter candidates are chosen for each subframe in the current frame. After the last subframe, the parameter combination that gives the best total performance over the whole frame is chosen. The advantages of this approach include simultaneous optimization of the adaptive and fixed codebook excitation parameters, as well as taking into account the influence of the current subframe parameters to the successive subframes. In delayed decision coding various kinds of tree coding algorithms can be used, which are mainly classified by the decision timing. In the first method of the two most typical ones, a decision is made simultaneously for all subframes in a frame by selecting the best path in the tree. In the other widely used method the decision is made for each subframe s by considering the cumulative distortion from sth to (s N)th subframe. In our simulations the second approach was used with N set to one, resulting thus to an additional coder delay of one frame. This delay is needed to determine the excitation parameters for the last subframe of the current frame. To evaluate the performance of the three methods described above, we implemented them to the IS- speech coder. Based on our simulation results, a maximum increase of. db in segmental SNR was achieved by using the extended cost horizon approach for the algebraic codebook search. This improvement was achieved with the extension length of eight samples while the other extension lengths in range of - samples performed approximately.-. db better than the original coder. In general, the improvements were bigger for voiced than for unvoiced speech. In computing the extended excitation signal, no pitch sharpening was used to the extended algebraic excitation segment. In Fig., the simulation results for different delayed decision configurations in the IS- speech coder are shown. In the figure, the number of adaptive and algebraic codebook parameter sets derived at each stage is depicted by NUM_ADA and NUM_ALG, respectively. The explosion of the amount of paths was restricted by considering only NUM_ADA NUM_ALG best candidates at each stage in the tree. Unquantized gain values were used in the simulations. In addition to different delayed decision configurations, the performance of joint optimization of the adaptive and algebraic codebook parameters within each subframe is illustrated in Fig.. As it can be observed from Fig., clear improvements in terms of segmental SNR can be achieved by using delayed decision coding. Also, improvements can be achieved by joint optimization of adaptive and algebraic codebook excitation parameters although better performance is achieved by delayed decision coding. In informal listening

4 d d d d 3 Fig. 3. The differences between successive pitch periods in the modified IS- speech coder. d d 3 D D 3 D d c D D D d a D D b Fig.. A three-dimensional lattice for delta periods in the modified IS- speech coder. experiments, the improvements achieved by all tested methods were judged to be rather marginal. At lower bit rates, however, the subjective importance of these methods is expected to be higher. Distribution of Codebook Indices. Adaptive Codebook Indices In the IS- speech coder, the smooth evolution of pitch contour during voiced speech is exploited by using differential coding for every other pitch value. The absolute pitch period is searched from the range of 9 / 3-3 samples for the first and third subframe. In the range of 9 / 3-8 / 3 samples, a resolution of /3 is used while integer values are used in the range of 85-3 samples. For the second and fourth subframes the pitch periods are searched from the neighborhood of the pitch period in the previous subframe. The range of the search for the delta pitch periods is - / 3 to 5 / 3 samples using a resolution of /3. Generally speaking, coding of n successive delta pitch periods can be described as an n-dimensional lattice where each dimension represents a pitch period in a corresponding subframe []. In a typical lattice coding of delta periods, attention is only paid to the selection of its boundary values while the rectangular shape of the lattice is maintained. No further care is taken to describe how a suitable set of points is chosen to cover only the most likely points used. Since the pitch period evolves usually smoothly during voiced speech, the rectangular lattice covers also points that are used rarely. Thus, the coding efficiency can be increased by shaping the lattice to eliminate unlikely pitch period combinations from the resulting coding scheme.

5 In [] we proposed a shaped lattice structure derived from the empirically found distribution of delta periods in a modified IS- coder. In the modified coder, the absolute pitch period is used only for the first subframe while delta pitch periods are used for the other subframes. The distribution of delta periods over a large database is shown in Fig.3 where the difference between the pitch periods of the (i)th subframe and the ith subframe is denoted by d i. The proposed shaped lattice is given in Fig., and is composed of a union of non-overlapping hypercubes D i, which are defined by the delta period range and the resolution used in each dimension. Different hypercubes are marked by the dashed lines in the figure, and can be defined by their unique edges. For example, the hypercube D is defined by the edges a, b and c in the figure. The lattice structure used for the simulations was symmetric with respect to axis d, d and d 3. The point distribution in the last three dimensions was uniform and /3 resolution was used. Because of the symmetry, the three-dimensional lattice can be unambiguously defined by one corner point of the projection of D to axis d and d, see Fig.. In the optimal index search from the lattice, a single open-loop pitch estimate was first derived jointly for the last three subframes. The closed-loop pitch was then derived from the neighborhood of the derived open-loop pitch. In the simulations, three different shaped lattices S A, S B, and S C were implemented for the modified IS- coder with corner points ( / 3, / 3 ), ( / 3, / 3 ), and ( / 3, / 3 ), respectively. As a reference, two cubic lattices L A and L B with maximum delta periods of / 3 and / 3 were used. These ranges were selected based on the distributions presented in Fig 3. The simulation results are presented in Table. The results are expressed as segmental SNRs between the voiced sections of the prefiltered input speech and synthesized and postfiltered speech, together with the number of bits needed for the coding of the delta periods in each frame. As it can be seen from Table, the coding efficiency of successive pitch periods can be increased by using the shaped lattice structure. Scheme SegSNR (db) Bits Lattice L A 8.. Lattice L B Shaped Lattice S A Shaped Lattice S B 8.. Shaped Lattice S C Table. Segmental SNRs and the number of bits needed for different three-dimensional lattices.. Algebraic Codebook Indices In low bit rate CELP coders, the target signal for the fixed codebook search is highly periodic due to the inability of the adaptive codebook to model the periodicity of input speech. In ACELP coders, periodicity is thus introduced to the algebraic excitation signal by the pitch sharpening procedure, where the gain-scaled algebraic excitation is repeated by the pitch interval. To further exploit the periodicity of the target signal, an adaptive algebraic codebook was presented in []. The presented approach was based on the assumption that the distribution of the pulses in the algebraic codebook is related to the locations of pitch pulses during voiced speech. In our experiments, we first wanted to verify the assumption that pulse locations in the algebraic excitation are located to the vicinity of pitch pulses during voiced speech in the IS- coder. In the experiments, we first located the pitch pulses in the voiced regions of speech using the time domain energy contour of the LP residual signal. Subsequently, we encoded the same signal with a modified IS- coder. In the modification, all excitation pulse combinations instead of the tabulated positions were used in the coder in order to give more reliable results about the desired pulse positions. Finally, we compared the pitch pulse locations and the excitation pulse positions. Fig. 5 depicts the relative distribution of the excitation pulses with respect to the pitch pulse locations. As it can be seen from the figure, pitch pulse position and its vicinity clearly dominate the graph. In addition, it was observed in the experiments that positive pulses dominated this region over negative pulses. Based on the observations done, a simplistic approach derived from the one described in [] was taken to generate an adaptive algebraic codebook for simulation purposes. In the original IS- coder, 7 bits are used to code four positive or negative pulses per subframe (indices,5,,35;,,,3;,7,,37; 3,,8,9,38,39). In our modification, we replaced the positions,9,,39 of the fourth pulse by adaptive locations centered on the largest energy peak of the adaptive codebook excitation, typically indicating a pitch pulse. After this modification, an increase of. db in segmental SNR during voiced speech was achieved compared to the original method. It should be noted that the improvements by using adaptive pulse positioning are expected to be higher at lower bit rates due to the sparser algebraic codebook. Also, it is likely that further

6 Percentage Distance from Closest Pitch Pulse in Normalized Pitch Periods Fig. 5. Histogram of excitation pulse locations with respect to pitch pulse locations improvements can be achieved by using more sophisticated methods for defining the adaptive pulse positions. 5 Conclusion In this paper the performance of different techniques to improve the parameter analysis in CELP speech coders was evaluated using the IS- speech coder as the simulation platform. The evaluated methods included using extended cost horizon in the algebraic excitation search process, as well as joint optimization and delayed decision coding of the adaptive and algebraic codebook parameters. Also, justification for efficient coding methods based on the distribution of adaptive and algebraic codebook indices in the modified IS- coders was given, and the performance of shaped lattice and adaptive pulse positioning for coding the codebook indices was demonstrated. Based on the simulations done, substantial improvements in terms of objective performance are achieved especially by using delayed decision coding, while the improvements in subjective speech quality were found to be more marginal. On the other hand, it is expected that higher subjective improvements are achieved with the described methods whilst lowering the bit rate from around 7. kbps. While the simulations were made using the IS- speech coder or a modified version of it, the conclusions made can be generalized to a majority of ACELP and CELP coders. References: [] M.R. Schroder and B.S. Atal, Code-excited linear prediction (CELP): high-quality speech at very low bit rates, Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp , 985. [] T. Honkanen, J. Vainio, K. Järvinen, P. Haavisto, R. Salami, C. Laflamme and J.-P. Adoul, Enhanced full rate speech codec for IS-3 digital cellular system, Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, pp , 997. [3] ITU-T, Q./ Rapporteur s Meeting Report, September, 999. [] W.B. Kleijn, P. Kroon and D. Nahumi, The RCELP speech coding algorithm, European Transactions on Telecommunications, Vol. 5, No. 5, pp , 99. [5] R. Hagen, E. Ekudden, B. Johansson and W.B. Kleijn, Removal of sparse-excitation artifacts in CELP, Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 5-8, 998. [] S. Cucci, M. Fratti and M. Ronchi, On improving performance of analysis by synthesis speech coders, IEEE Transactions on Speech and Audio Processing, Vol., No. 3, pp. 3-7, 99. [7] L. Zhang, T. Wang and V. Cuperman, A CELP variable rate speech codec with low average rate, Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp , 997. [8] K. Mano and T. Moriya,.8 kbit/s delayed decision CELP coder using tree coding, Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. -, 99. [9] T. Eriksson and J. Sjöberg, Dynamic bit allocation in CELP excitation coding, Proceedings of International Conference on Acoustics, Speech and Signal Processing, pp. 7 7, 993. [] T. Amada, K. Miseki and M. Akamine, CELP speech coding based on an adaptive pulse position codebook, Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 3-, 999. [] A. Heikkinen, V.T. Ruoppila and S. Pietilä, A shaped lattice quantizer for successive pitch periods, Proceedings of EUROSPEECH, pp ,.

Perceptual Pre-weighting and Post-inverse weighting for Speech Coding

Perceptual Pre-weighting and Post-inverse weighting for Speech Coding Niranjan Shetty and Jerry D. Gibson Department of Electrical and Computer Engineering University of California, Santa Barbara, CA,