Speaker Diarization System Based on GMM and BIC

Speaer Diarization System Based on GMM and BIC Tantan Liu 1, Xiaoxing Liu 1, Yonghong Yan 1 1 ThinIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences Beijing 100080 {tliu, xliu,yyan}@hccl.ioa.ac.cn Abstract. This paper presents an approach for speaer diarization based on a novel combination of Gaussian mixture model (GMM) and standard Bayesian information criterion (BIC). Gaussian mixture model provides a good description of feature vector distribution and BIC enables a proper merging and stopping criterion. Our system combines the advantage of these two method and yields favorable performance. Experiments carried out on mandarin broadcast news data demonstrate the advantage of the proposed approach, which shows better performance than the approach only based on GMM clustering. Keywords: speaer diarization, clustering, GMM, BIC. 1 Introduction Speaer diarization is the process of detecting the turns in speech because of the changing of speaer and clustering the speech from the same speaer together, and thus provides useful information for the structuring and indexing of the audio document. By separating the input speech according to speaer identity, diarization system could produce speaer-homogeneous speech clusters for more accurate speaer model for speaer recognition tas in telephone conversations. In contrast to speech tracing tas, which has already got the information of speaers, there is no training data for speaers in speaer diarization, and the number of speaers in the input speech is unnown in advance neither. There are many approaches for speaer diarization which are mainly different in the choice of the inter-cluster distance and the stopping criteria. In [1], adapted Gaussian mixture model (GMM) is used to model speech segments and computes the inter-cluster distance based on the parameters of GMMs, and the distance threshold also acts as the stopping criterion. In [], Bayesian information criterion is used both for the inter-cluster distance and stopping criterion. The cross log-lielihood ratio is proposed in [3] and generalized lielihood ratio (GLR) is proposed in [4] as the intercluster distance. A set of anchor models are used in [5] to map segments into a vector set, and Euclidean distances and an ad hoc occupancy stopping criterion are applied. We propose an approach using a bottom-up clustering scheme integrating the adapted Gaussian mixture model and Bayesian information criterion, which could both have a good description of the feature vector distribution and provide a reliable stopping criterion. This system use a novel grouping criterion and stopping criterion

based on inter-distance derived from adapted Gaussian mixture model and Bayesian information criterion. We compared the performance of the proposed method with that of GMM parameter distance based approach. The remainder of this paper is organized as follows: Section describes the principle of adapted Gaussian mixture model. Section 3 describes the principle of Bayesian information criterion. Section 4 describes our system in detail and the approach of integrating the former approaches. The experimental results are presented in section 5 followed by some conclusions. Clustering based on adapted Gaussian models As in [1], input speech is chopped into small segments in the hope that each segment contains only one speaer. Initially, each segment is a cluster and is modeled by Gaussian mixture models. A universal bacground Gaussian mixture model (UBM) is trained using the whole input speech, and then cluster-dependent Gaussian mixture model which is adapted from the universal bacground Gaussian mixture model is obtained for each cluster..1 Inter-distance based on adapted Gaussian mixture models The probability density function of a K-component Gaussian mixture model for a random variable x is defined as: K P( x Λ) = ϖ b ( x m, S ) (1) = 1 Where b ( ) is Gaussian density function, and Λ = { ϖ, m, S ) is the set of parameters, ϖ is the weight of each Gaussian mixture model with the K constraint ϖ = 1. = 1 The universal bacground Gaussian mixture model is trained based on expectation maximum algorithm (EM), and the cluster-dependent Gaussian mixture model is adapted [6] from UBM model based on maximum a posteriori algorithm(map). For the use of computing inter-distance, only the mean m is adapted, and the weight ϖ i and variance σ i are left unchanged. The distance between two GMM models is: K D ( m, d m, d ) D( P, P) = ϖ σ = 1 d = 1, d i ()

. Clustering procedure During the clustering procedure, clusters with the minimum distance are grouped as a new cluster and a new Gaussian mixture model is estimated for the new cluster. When the minimum distance is above the threshold, the clustering procedure is stopped. 3 Bayesian information criterion At the beginning of clustering, the short duration segments may not be able to support the large set of parameters of Gaussian mixture model, and thus the clusters with the minimum inter-cluster distance may not really come from the same speaer. In addition, the optimal threshold varies from one input speech to another. To solve this problem, we use Bayesian information criterion [] as a merging and stopping criterion in the clustering step. 3.1 The principle of Bayesian information criterion Generally, let X = { xi, i = 1,... N} be the feature vectors of input speech; let M be the candidates of desired parametric models. The BIC criterion is defined as: # M BIC( M ) = ln L( X, M ) λ log N (3) Where L ( X, M ) is the lielihood of input speech given the model of M, and # M is the number of parameters in the model M, N is the sample size of input speech. 3. Merging criterion Assuming two segments are modeled by Gaussian model N ( μ 1, Σ1) and N ( μ, Σ ) separately, and the sample size of these two segments are N and 1 N. These two segments are modeled by a Gaussian model N ( μ, Σ) and the sample size is N 1 + N. The increasing of BIC value is: Δ BIC = ( N1 + N ) log Σ N1 log Σ1 N log Σ λp (4) Where λ is the penalty weight and the penalty P is: 1 1 P = ( d + d ( d + 1)) log( N ) (5) Where d is the feature vector dimension, and N = N 1 + N is referred as a local N BIC penalty, while in general the size of the whole set of cluster, N = = 1 N is

referred as a global BIC penalty. According to [3], the local BIC penalty seems to be a better merging criterion. If the increasing of BIC in the equation of (4) is negative, the two segments are from the same speaer and should be merged. 4 Clustering based on the combination of GMM and BIC Our system, shown in figure 1, is based on the combination of Gaussian model mixture and Bayesian information criterion. Input speech Chop into small segments gender classification Train a small GMM for each segment GMM reestimation Yes Merge clusters with the minimum distance and delta BIC above threshold Positive Delta BIC? No No Less clusters? Yes Train a large GMM for each cluster cluster recombination Re-classification Output diarization Fig. 1. Diarization system based on GMM and delta BIC

4.1 Segmentation The aim of this step is to detect the liely changing points in input speech [7]. A pair of sliding windows is applied to the audio feature vector stream extracted from input speech and the feature vectors within each window are modeled by two separate single Gaussian models. The distance of Gaussian models is calculated by Bhattachayya distance. The points with local maximum distance are detected as a liely changing point in the input speech. The input speech is segmented into small acoustically homogeneous segments by the liely changing points. 4. Gender classification Classification is done using maximum lielihood classification with GMMs for male, female, noise and music. The GMMs, each with 64 Gaussians, were trained on about 1 hour of acoustic data from CCTV broadcast News. 4.3 Clustering In the clustering process, each cluster is modeled by a GMM adapted from UBM as described in Section. At each iteration, the clusters with the minimum inter-cluster distance as well as the negative delta BIC are merged. Here the minimum inter-cluster distance is the smallest of inter-cluster distances with negative delta BIC as well, while the inter-cluster distances with positive delta BIC are not included. If the chosen minimum inter-cluster distance is above the threshold or all delta BIC values are positive, the clustering process is stopped. The threshold is determined on the development data and used on test set to mae sure the size of clusters.is large enough to support the large GMM model. 4.4 Re-clustering At the beginning of clustering process, GMM used to model clusters are trained by short duration segments with a limited set of parameter per cluster: a GMM with 3 diagonal components. With the increasing of clusters in the process, a more complex GMM is needed. Furthermore, the former clustering procedure tends to split a speaer s speech with different bacground conditions, thus cepstral mean normalization is used to mitigate the effects of bacground. Here we use a 18 GMM for each cluster, the clustering and stopping criterion are the same as in 4.3. While the stopping threshold is optimized on the development data and used on test set to determine the expected number of speaers in input speech. 4.5 Re-classification In the last step of this system, each speech segments is reclassified using maximum lielihood classification with the final Gaussian mixture models for each cluster.

5 Experimental Results The data for development and evaluation were drawn from mandarin broadcast news. Each set contains a total of six broadcasts corresponding to 1hour and 30 minutes of CCTV broadcast News and 1 hour of SiChuan Television (SCTV) broadcast News. 5.1 Error Measure Diarization performance is evaluated according to the performance measures defined by NIST for Rich Transcription 003 evaluation [8]. The output of a diarization system is a set of hypothesized speaers with the beginning and end times of the speaer s speech. An optimal mapping of the reference speaers to the hypothesis speaers is performed to maximize the overlap of the reference and mapped hypothesis speaers. 5. Results Table 1 shows the results of the two systems on the development set. The first column shows error rates based on adapted GMM distance, while the second column shows the error rates of our system. From the table, we can see that the combination of GMM and delta BIC yield significant improvements over the system only based on GMM: 7.% compared to 14.0%. Furthermore, the table also demonstrates that for the system only based on GMM, it is difficult to get a global threshold for all the input speech. As in the table, with the threshold we choose in the system, some speech has a low error rate, while some speech has a rather high error rate. With the combination of BIC criterion, the system could better determine to stop the agglomerative clustering properly. Table 1. Diarization error rates for each broadcasts on the development set achieved by the baseline system and proposed system CCTV1 CCTV CCTV3 SCTV1 SCTV SCTV3 All GMM 11.5 1 14.7 11.4 10.9 1.1 14.0 GMM&BIC 7.5 4.5 9.7 6.5 8.1 6.7 7. Table. Diarization error rates for each broadcasts on the evaluation set achieved by the baseline system and proposed system CCTV1 CCTV CCTV3 SCTV1 SCTV SCTV3 All GMM 30.4 13 19.0 19.7 1.6 13.9 19.8 GMM&BIC 11.1 16.0 13.9 9.9 9.7 8.9 1.0 Table shows the results of the two systems on the evaluation set. As on the development set, our system exhibits better performance than the system based on

GMM clustering: 1.0% compared to 19.8%. However, in the speech of CCTV, the GMM clustering performance shows better performance. This is because that the stopping criterion of our system is partly based on the minimum inter-cluster distance and the global threshold is also needed. Unfortunately, the global threshold we choose does not suit this speech well, causing the high error rate. 6 Conclusions We proposed an approach based on a novel combination of BIC criterion and adapted GMM clustering. This approach combines the advantage of these two methods with a stopping criterion applicable to general case and a low computational cost. The method was evaluated on a mandarin broadcast news corpus and the results show the advantage of the proposed algorithm. In the future, we will focus on integrating speech recognition information in our system to train more accurate model for speech segments, and try other inter-cluster distances, such as GLR and cross cluster distance, combined with the BIC criterion. Acnowledgments. This wor is (partly) supported by Chinese 973 program (004CB318106), National Natural Science Foundation of China (10574140, 60535030), and Beijing Municipal Science & Technology Commission (Z0005189040391) References 1. Ben, M. and Betser, M. and Bimbot, F. and Gravier, G., "Speaer Diarization using Bottomup clustering based on Parameter-derived Distance between adapted GMMs", Proceedings of the International Conference on Spoen Language Processing, 004. S.S Chen and P.S. Gopalarishnan, Speaer, Environment and Channel Change Detection and Clustring via Bayesian Information Criterion, Proceedings of DARPA Broadcast News Transcription and Understanding Worshop Landsdowne, VA, Feb. 1998 3. Barras, C., Zhu, X., Meignier, S., Gauvain, JL Claude Barras, Xuan Zhu, Sylvain Meignier, and Jean-Luc Gauvain. Improving Speaer Diarization Proc. DARPA RT04, 004. 4. H. Gish, M. Siu and R. Rohlice", Segregation of Speaers for Speech Recognition and Speaer Identification'', Proc. International Conference on Acoustics, Speech and Signal Processing, volume, pages 873-876, 1991. 5. D. A. Reynolds and P. Torres-Carrasquillo, The MIT Lincoln Laboratory RT-04F Diarization Systems: Applications to Broadcast Audio and Telephone Conversations, RT- 04F Worshop, Nov. 004 6. D. Reynolds, T. Quatieri, and R. Dunn, Speaer verification using adapted Gaussian mixture models, Digital Signal Processing, vol. 10, no. 1 3, 000. 7. Ran Xu, Jielin Pan, Yonghong Yan, Audio Segmentation Method Via Metric-based Bayesian Information Criterion, the 8th National Conference on Man-Machine Speech Communication,005 8. NIST, Rich transcription spring 03 evaluation plan, http://www.nist.gov/speech/tests/rt/rt003/spring/docs/rt03-spring-eval-plan-v4.pdf