Speaker Diarization System Based on GMM and BIC

Similar documents
A ROBUST SPEAKER CLUSTERING ALGORITHM

Client Dependent GMM-SVM Models for Speaker Verification

A Hybrid Approach to News Video Classification with Multi-modal Features

Speaker Verification with Adaptive Spectral Subband Centroids

Scott Shaobing Chen & P.S. Gopalakrishnan. IBM T.J. Watson Research Center. as follows:

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

The Approach of Mean Shift based Cosine Dissimilarity for Multi-Recording Speaker Clustering

Detector. Flash. Detector

Part I. Hierarchical clustering. Hierarchical Clustering. Hierarchical clustering. Produces a set of nested clusters organized as a

Note Set 4: Finite Mixture Models and the EM Algorithm

Story Unit Segmentation with Friendly Acoustic Perception *

Machine Learning. Unsupervised Learning. Manfred Huber

Image Denoising AGAIN!?

Latent Topic Model Based on Gaussian-LDA for Audio Retrieval

Graph Matching Iris Image Blocks with Local Binary Pattern

Multifactor Fusion for Audio-Visual Speaker Recognition

Clustering CS 550: Machine Learning

EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition

COMP5318 Knowledge Management & Data Mining Assignment 1

Motivation. Technical Background

Optimization of Observation Membership Function By Particle Swarm Method for Enhancing Performances of Speaker Identification

IMPROVED SPEAKER RECOGNITION USING DCT COEFFICIENTS AS FEATURES. Mitchell McLaren, Yun Lei

Improving Speaker Verification Performance in Presence of Spoofing Attacks Using Out-of-Domain Spoofed Data

10-701/15-781, Fall 2006, Final

COMP 551 Applied Machine Learning Lecture 13: Unsupervised learning

Mixture Models and EM

Fall 09, Homework 5

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

Confidence Measures: how much we can trust our speech recognizers

Probabilistic Location Recognition using Reduced Feature Set

Mixture Models and the EM Algorithm

An Efficient Model Selection for Gaussian Mixture Model in a Bayesian Framework

Comparative Evaluation of Feature Normalization Techniques for Speaker Verification

A SCANNING WINDOW SCHEME BASED ON SVM TRAINING ERROR RATE FOR UNSUPERVISED AUDIO SEGMENTATION

Unsupervised Learning: Clustering

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

A MODIFIED FUZZY C-REGRESSION MODEL CLUSTERING ALGORITHM FOR T-S FUZZY MODEL IDENTIFICATION

ALTERNATIVE METHODS FOR CLUSTERING

Automatic Shadow Removal by Illuminance in HSV Color Space

Quickest Search Over Multiple Sequences with Mixed Observations

Learning the Three Factors of a Non-overlapping Multi-camera Network Topology

Decision trees with improved efficiency for fast speaker verification

[2008] IEEE. Reprinted, with permission, from [Yan Chen, Qiang Wu, Xiangjian He, Wenjing Jia,Tom Hintz, A Modified Mahalanobis Distance for Human

Finding and Detection of Outlier Regions in Satellite Image

Comparison of Clustering Methods: a Case Study of Text-Independent Speaker Modeling

Machine Learning and Data Mining. Clustering (1): Basics. Kalev Kask

An Introduction to Pattern Recognition

Clustering Lecture 5: Mixture Model

Person instance graphs for mono-, cross- and multi-modal person recognition in multimedia data: application to speaker identification in TV broadcast

A Study on Clustering Method by Self-Organizing Map and Information Criteria

SYDE Winter 2011 Introduction to Pattern Recognition. Clustering

Methods for Intelligent Systems

Najiya P Fathima, C. V. Vipin Kishnan; International Journal of Advance Research, Ideas and Innovations in Technology

Probabilistic scoring using decision trees for fast and scalable speaker recognition

Developing a Data Driven System for Computational Neuroscience

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information

Image Quality Assessment Techniques: An Overview

AN ALGORITHM FOR BLIND RESTORATION OF BLURRED AND NOISY IMAGES

Clustering K-means. Machine Learning CSEP546 Carlos Guestrin University of Washington February 18, Carlos Guestrin

A Robust Two Feature Points Based Depth Estimation Method 1)

FOUR WEIGHTINGS AND A FUSION: A CEPSTRAL-SVM SYSTEM FOR SPEAKER RECOGNITION. Sachin S. Kajarekar

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV

Fuzzy and Markov Models for Keystroke Biometrics Authentication

Two-layer Distance Scheme in Matching Engine for Query by Humming System

Moving Object Segmentation Method Based on Motion Information Classification by X-means and Spatial Region Segmentation

Robust color segmentation algorithms in illumination variation conditions

Unsupervised Learning

2 Proposed Methodology

ABSTRACT 1. INTRODUCTION 2. METHODS

HIGH RESOLUTION REMOTE SENSING IMAGE SEGMENTATION BASED ON GRAPH THEORY AND FRACTAL NET EVOLUTION APPROACH

Background Subtraction in Video using Bayesian Learning with Motion Information Suman K. Mitra DA-IICT, Gandhinagar

TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University

Machine Learning (BSMC-GA 4439) Wenke Liu

Clustering and Dissimilarity Measures. Clustering. Dissimilarity Measures. Cluster Analysis. Perceptually-Inspired Measures

Image Segmentation Using Iterated Graph Cuts Based on Multi-scale Smoothing

Anomaly Intrusion Detection System Using Hierarchical Gaussian Mixture Model

Annotated multitree output

Textural Features for Image Database Retrieval

Enhanced Image. Improved Dam point Labelling

Vulnerability of Voice Verification System with STC anti-spoofing detector to different methods of spoofing attacks

CS 229 Midterm Review

Iterative MAP and ML Estimations for Image Segmentation

Bus Detection and recognition for visually impaired people

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

Person Authentication from Video of Faces: A Behavioral and Physiological Approach Using Pseudo Hierarchical Hidden Markov Models

An indirect tire identification method based on a two-layered fuzzy scheme

COMBINING FEATURE SETS WITH SUPPORT VECTOR MACHINES: APPLICATION TO SPEAKER RECOGNITION

DATA MINING LECTURE 7. Hierarchical Clustering, DBSCAN The EM Algorithm

Optimizing feature representation for speaker diarization using PCA and LDA

Random projection for non-gaussian mixture models

A NEW CLASSIFICATION METHOD FOR HIGH SPATIAL RESOLUTION REMOTE SENSING IMAGE BASED ON MAPPING MECHANISM

Clustering. Introduction to Data Science University of Colorado Boulder SLIDES ADAPTED FROM LAUREN HANNAH

UNSUPERVISED MINING OF MULTIPLE AUDIOVISUALLY CONSISTENT CLUSTERS FOR VIDEO STRUCTURE ANALYSIS

Chapter DM:II. II. Cluster Analysis

Data Mining Chapter 9: Descriptive Modeling Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Supplementary Material: The Emergence of. Organizing Structure in Conceptual Representation

IMPROVED SIDE MATCHING FOR MATCHED-TEXTURE CODING

Clustering: Classic Methods and Modern Views

A Feature Point Matching Based Approach for Video Objects Segmentation

A Robust and Efficient Motion Segmentation Based on Orthogonal Projection Matrix of Shape Space

Transcription:

Speaer Diarization System Based on GMM and BIC Tantan Liu 1, Xiaoxing Liu 1, Yonghong Yan 1 1 ThinIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences Beijing 100080 {tliu, xliu,yyan}@hccl.ioa.ac.cn Abstract. This paper presents an approach for speaer diarization based on a novel combination of Gaussian mixture model (GMM) and standard Bayesian information criterion (BIC). Gaussian mixture model provides a good description of feature vector distribution and BIC enables a proper merging and stopping criterion. Our system combines the advantage of these two method and yields favorable performance. Experiments carried out on mandarin broadcast news data demonstrate the advantage of the proposed approach, which shows better performance than the approach only based on GMM clustering. Keywords: speaer diarization, clustering, GMM, BIC. 1 Introduction Speaer diarization is the process of detecting the turns in speech because of the changing of speaer and clustering the speech from the same speaer together, and thus provides useful information for the structuring and indexing of the audio document. By separating the input speech according to speaer identity, diarization system could produce speaer-homogeneous speech clusters for more accurate speaer model for speaer recognition tas in telephone conversations. In contrast to speech tracing tas, which has already got the information of speaers, there is no training data for speaers in speaer diarization, and the number of speaers in the input speech is unnown in advance neither. There are many approaches for speaer diarization which are mainly different in the choice of the inter-cluster distance and the stopping criteria. In [1], adapted Gaussian mixture model (GMM) is used to model speech segments and computes the inter-cluster distance based on the parameters of GMMs, and the distance threshold also acts as the stopping criterion. In [], Bayesian information criterion is used both for the inter-cluster distance and stopping criterion. The cross log-lielihood ratio is proposed in [3] and generalized lielihood ratio (GLR) is proposed in [4] as the intercluster distance. A set of anchor models are used in [5] to map segments into a vector set, and Euclidean distances and an ad hoc occupancy stopping criterion are applied. We propose an approach using a bottom-up clustering scheme integrating the adapted Gaussian mixture model and Bayesian information criterion, which could both have a good description of the feature vector distribution and provide a reliable stopping criterion. This system use a novel grouping criterion and stopping criterion

based on inter-distance derived from adapted Gaussian mixture model and Bayesian information criterion. We compared the performance of the proposed method with that of GMM parameter distance based approach. The remainder of this paper is organized as follows: Section describes the principle of adapted Gaussian mixture model. Section 3 describes the principle of Bayesian information criterion. Section 4 describes our system in detail and the approach of integrating the former approaches. The experimental results are presented in section 5 followed by some conclusions. Clustering based on adapted Gaussian models As in [1], input speech is chopped into small segments in the hope that each segment contains only one speaer. Initially, each segment is a cluster and is modeled by Gaussian mixture models. A universal bacground Gaussian mixture model (UBM) is trained using the whole input speech, and then cluster-dependent Gaussian mixture model which is adapted from the universal bacground Gaussian mixture model is obtained for each cluster..1 Inter-distance based on adapted Gaussian mixture models The probability density function of a K-component Gaussian mixture model for a random variable x is defined as: K P( x Λ) = ϖ b ( x m, S ) (1) = 1 Where b ( ) is Gaussian density function, and Λ = { ϖ, m, S ) is the set of parameters, ϖ is the weight of each Gaussian mixture model with the K constraint ϖ = 1. = 1 The universal bacground Gaussian mixture model is trained based on expectation maximum algorithm (EM), and the cluster-dependent Gaussian mixture model is adapted [6] from UBM model based on maximum a posteriori algorithm(map). For the use of computing inter-distance, only the mean m is adapted, and the weight ϖ i and variance σ i are left unchanged. The distance between two GMM models is: K D ( m, d m, d ) D( P, P) = ϖ σ = 1 d = 1, d i ()

. Clustering procedure During the clustering procedure, clusters with the minimum distance are grouped as a new cluster and a new Gaussian mixture model is estimated for the new cluster. When the minimum distance is above the threshold, the clustering procedure is stopped. 3 Bayesian information criterion At the beginning of clustering, the short duration segments may not be able to support the large set of parameters of Gaussian mixture model, and thus the clusters with the minimum inter-cluster distance may not really come from the same speaer. In addition, the optimal threshold varies from one input speech to another. To solve this problem, we use Bayesian information criterion [] as a merging and stopping criterion in the clustering step. 3.1 The principle of Bayesian information criterion Generally, let X = { xi, i = 1,... N} be the feature vectors of input speech; let M be the candidates of desired parametric models. The BIC criterion is defined as: # M BIC( M ) = ln L( X, M ) λ log N (3) Where L ( X, M ) is the lielihood of input speech given the model of M, and # M is the number of parameters in the model M, N is the sample size of input speech. 3. Merging criterion Assuming two segments are modeled by Gaussian model N ( μ 1, Σ1) and N ( μ, Σ ) separately, and the sample size of these two segments are N and 1 N. These two segments are modeled by a Gaussian model N ( μ, Σ) and the sample size is N 1 + N. The increasing of BIC value is: Δ BIC = ( N1 + N ) log Σ N1 log Σ1 N log Σ λp (4) Where λ is the penalty weight and the penalty P is: 1 1 P = ( d + d ( d + 1)) log( N ) (5) Where d is the feature vector dimension, and N = N 1 + N is referred as a local N BIC penalty, while in general the size of the whole set of cluster, N = = 1 N is

referred as a global BIC penalty. According to [3], the local BIC penalty seems to be a better merging criterion. If the increasing of BIC in the equation of (4) is negative, the two segments are from the same speaer and should be merged. 4 Clustering based on the combination of GMM and BIC Our system, shown in figure 1, is based on the combination of Gaussian model mixture and Bayesian information criterion. Input speech Chop into small segments gender classification Train a small GMM for each segment GMM reestimation Yes Merge clusters with the minimum distance and delta BIC above threshold Positive Delta BIC? No No Less clusters? Yes Train a large GMM for each cluster cluster recombination Re-classification Output diarization Fig. 1. Diarization system based on GMM and delta BIC

4.1 Segmentation The aim of this step is to detect the liely changing points in input speech [7]. A pair of sliding windows is applied to the audio feature vector stream extracted from input speech and the feature vectors within each window are modeled by two separate single Gaussian models. The distance of Gaussian models is calculated by Bhattachayya distance. The points with local maximum distance are detected as a liely changing point in the input speech. The input speech is segmented into small acoustically homogeneous segments by the liely changing points. 4. Gender classification Classification is done using maximum lielihood classification with GMMs for male, female, noise and music. The GMMs, each with 64 Gaussians, were trained on about 1 hour of acoustic data from CCTV broadcast News. 4.3 Clustering In the clustering process, each cluster is modeled by a GMM adapted from UBM as described in Section. At each iteration, the clusters with the minimum inter-cluster distance as well as the negative delta BIC are merged. Here the minimum inter-cluster distance is the smallest of inter-cluster distances with negative delta BIC as well, while the inter-cluster distances with positive delta BIC are not included. If the chosen minimum inter-cluster distance is above the threshold or all delta BIC values are positive, the clustering process is stopped. The threshold is determined on the development data and used on test set to mae sure the size of clusters.is large enough to support the large GMM model. 4.4 Re-clustering At the beginning of clustering process, GMM used to model clusters are trained by short duration segments with a limited set of parameter per cluster: a GMM with 3 diagonal components. With the increasing of clusters in the process, a more complex GMM is needed. Furthermore, the former clustering procedure tends to split a speaer s speech with different bacground conditions, thus cepstral mean normalization is used to mitigate the effects of bacground. Here we use a 18 GMM for each cluster, the clustering and stopping criterion are the same as in 4.3. While the stopping threshold is optimized on the development data and used on test set to determine the expected number of speaers in input speech. 4.5 Re-classification In the last step of this system, each speech segments is reclassified using maximum lielihood classification with the final Gaussian mixture models for each cluster.

5 Experimental Results The data for development and evaluation were drawn from mandarin broadcast news. Each set contains a total of six broadcasts corresponding to 1hour and 30 minutes of CCTV broadcast News and 1 hour of SiChuan Television (SCTV) broadcast News. 5.1 Error Measure Diarization performance is evaluated according to the performance measures defined by NIST for Rich Transcription 003 evaluation [8]. The output of a diarization system is a set of hypothesized speaers with the beginning and end times of the speaer s speech. An optimal mapping of the reference speaers to the hypothesis speaers is performed to maximize the overlap of the reference and mapped hypothesis speaers. 5. Results Table 1 shows the results of the two systems on the development set. The first column shows error rates based on adapted GMM distance, while the second column shows the error rates of our system. From the table, we can see that the combination of GMM and delta BIC yield significant improvements over the system only based on GMM: 7.% compared to 14.0%. Furthermore, the table also demonstrates that for the system only based on GMM, it is difficult to get a global threshold for all the input speech. As in the table, with the threshold we choose in the system, some speech has a low error rate, while some speech has a rather high error rate. With the combination of BIC criterion, the system could better determine to stop the agglomerative clustering properly. Table 1. Diarization error rates for each broadcasts on the development set achieved by the baseline system and proposed system CCTV1 CCTV CCTV3 SCTV1 SCTV SCTV3 All GMM 11.5 1 14.7 11.4 10.9 1.1 14.0 GMM&BIC 7.5 4.5 9.7 6.5 8.1 6.7 7. Table. Diarization error rates for each broadcasts on the evaluation set achieved by the baseline system and proposed system CCTV1 CCTV CCTV3 SCTV1 SCTV SCTV3 All GMM 30.4 13 19.0 19.7 1.6 13.9 19.8 GMM&BIC 11.1 16.0 13.9 9.9 9.7 8.9 1.0 Table shows the results of the two systems on the evaluation set. As on the development set, our system exhibits better performance than the system based on

GMM clustering: 1.0% compared to 19.8%. However, in the speech of CCTV, the GMM clustering performance shows better performance. This is because that the stopping criterion of our system is partly based on the minimum inter-cluster distance and the global threshold is also needed. Unfortunately, the global threshold we choose does not suit this speech well, causing the high error rate. 6 Conclusions We proposed an approach based on a novel combination of BIC criterion and adapted GMM clustering. This approach combines the advantage of these two methods with a stopping criterion applicable to general case and a low computational cost. The method was evaluated on a mandarin broadcast news corpus and the results show the advantage of the proposed algorithm. In the future, we will focus on integrating speech recognition information in our system to train more accurate model for speech segments, and try other inter-cluster distances, such as GLR and cross cluster distance, combined with the BIC criterion. Acnowledgments. This wor is (partly) supported by Chinese 973 program (004CB318106), National Natural Science Foundation of China (10574140, 60535030), and Beijing Municipal Science & Technology Commission (Z0005189040391) References 1. Ben, M. and Betser, M. and Bimbot, F. and Gravier, G., "Speaer Diarization using Bottomup clustering based on Parameter-derived Distance between adapted GMMs", Proceedings of the International Conference on Spoen Language Processing, 004. S.S Chen and P.S. Gopalarishnan, Speaer, Environment and Channel Change Detection and Clustring via Bayesian Information Criterion, Proceedings of DARPA Broadcast News Transcription and Understanding Worshop Landsdowne, VA, Feb. 1998 3. Barras, C., Zhu, X., Meignier, S., Gauvain, JL Claude Barras, Xuan Zhu, Sylvain Meignier, and Jean-Luc Gauvain. Improving Speaer Diarization Proc. DARPA RT04, 004. 4. H. Gish, M. Siu and R. Rohlice", Segregation of Speaers for Speech Recognition and Speaer Identification'', Proc. International Conference on Acoustics, Speech and Signal Processing, volume, pages 873-876, 1991. 5. D. A. Reynolds and P. Torres-Carrasquillo, The MIT Lincoln Laboratory RT-04F Diarization Systems: Applications to Broadcast Audio and Telephone Conversations, RT- 04F Worshop, Nov. 004 6. D. Reynolds, T. Quatieri, and R. Dunn, Speaer verification using adapted Gaussian mixture models, Digital Signal Processing, vol. 10, no. 1 3, 000. 7. Ran Xu, Jielin Pan, Yonghong Yan, Audio Segmentation Method Via Metric-based Bayesian Information Criterion, the 8th National Conference on Man-Machine Speech Communication,005 8. NIST, Rich transcription spring 03 evaluation plan, http://www.nist.gov/speech/tests/rt/rt003/spring/docs/rt03-spring-eval-plan-v4.pdf