Comparative Evaluation of Feature Normalization Techniques for Speaker Verification

Size: px
Start display at page:

Download "Comparative Evaluation of Feature Normalization Techniques for Speaker Verification"

Transcription

1 Comparative Evaluation of Feature Normalization Techniques for Speaker Verification Md Jahangir Alam 1,2, Pierre Ouellet 1, Patrick Kenny 1, Douglas O Shaughnessy 2, 1 CRIM, Montreal, Canada {Janagir.Alam, Pierre.Ouellet, Patrick.Kenny}@crim.ca 2 INRS-EMT, University of Quebec, Montreal, Canada dougo@emt.inrs.ca Abstract. This paper investigates several feature normalization techniques for use in an i-vector speaker verification system based on a mixture probabilistic linear discriminant analysis (PLDA) model. The objective of the feature normalization technique is to compensate for the effects of environmental mismatch. Here, we study short-time Gaussianization (STG), short-time mean and variance normalization (STMVN), and short-time mean and scale normalization (STMSN) techniques. Our goal is to compare the performance of the above mentioned feature normalization techniques on the telephone (det5) and microphone speech (det1, det2, det3 and det4) of the NIST SRE 2010 corpora. Experimental results show that the performances of the STMVN and STMSN techniques are comparable to that of the STG technique. Keywords: Speaker verification, feature normalization, STG, STMVN. 1 Introduction Most state-of the-art speaker verification systems perform well in controlled environments where data is collected from reasonably clean environments. Acoustic mismatch due to different training and testing environments can severely deteriorate the performance of the speaker verification systems. Degradation of performance due to mismatched environments has been a barrier for deployment of speaker recognition technologies. Feature normalization strategies are employed in speaker recognition systems to compensate for the effects of environmental mismatch. These techniques are preferred because a priori knowledge and adaptation are not required under any environment. Most of the normalization techniques are applied as a post-processing scheme on the Mel-frequency cepstral coefficient (MFCC) speech features. Normalization techniques can be classified as model-based or data distribution-based techniques. In model-based normalization technique, certain statistical properties of speech such as mean, variance, moments, are normalized to reduce the residual mismatch in feature vectors. Cepstral mean normalization (CMN), mean and variance normalization (MVN), STMVN and STMSN techniques fall into this category. Data distribution-based techniques aim at normalizing the feature distribution to the reference. STG and histogram normalization methods fall into this category. A

2 number of feature normalization techniques have been proposed in the past for speaker verification systems including feature warping [1], STG [2], CMN [3], and RASTA filtering [4-5]. In this paper, our goal is to perform a comparative evaluation of STG, STMVN and a new method called STMSN, which is similar to STMVN, feature normalization methods employing a mixture PLDA model in the i-vector space for speaker verification. For our experiments, we use the latest NIST 2010 SRE benchmark data. We use a gender independent i-vector extractor and then form a mixture PLDA model by training and combining two gender dependent models as in [6], where the gender label is treated as a latent (or hidden) variable. 2 Feature Normalization Technique In feature normalization techniques, the components of the fixed feature vector are scaled or warped so as to enable more effective modeling of speaker differences. Most of the normalization techniques are normally applied in the cepstral domain as shown in figure 1. Speaker verification systems generally make use of acoustic frontends similar to those used in speech recognition systems: pre-processing (includes dc removal and pre-emphasis), short-time spectrum estimation, Mel-filter bank integration, cepstral coefficients computation via DCT transform, appending delta and double delta coefficients, removing silence frames and then feature normalization. Speech signal Pre-processing Framing & windowing STFT & Spectrum estimation Mel-filter bank integration Features VAD labels Lag size Logarithmic nonlinearity Feature Normali -zation Silence removal MFCC Append delta & double delta DCT Fig. 1. Block diagram of MFCC feature extraction with the feature normalization as a postprocessing scheme. CMN (or MVN) is normally performed over the whole utterance with the assumption that the channel effect is constant over the entire utterance. Also, normalizing a feature vector over the entire utterance is not a feasible solution in real-time

3 applications as it causes unnecessarily long processing delay. To relax this assumption and to reduce the processing delay MFCC features are normalized over a sliding window of 3-5s duration. Here we use a 3s sliding window (i.e, 300 frames for a frame shift of 10 ms) for all three methods. The feature vector to be normalized is located at the centre of the sliding window. 2.1 Short-time Gaussianization (STG) STG [2] aims at modifying the short-time feature distribution to follow a reference distribution, for example standard normal distribution. It is initiated by a global linear transformation of features, followed by a short-time windowed cumulative distribution function (CDF) matching. Linear transformation in the feature space leads to local independence or decorrelation. If X is the original feature set and A is the transformation matrix, then STG can be implemented using following two steps [2]: Linearly transform the original feature using Y = AX; Apply short-time windowed feature warping T on Y as: ˆX=T( Y ). In STG each feature vector is warped independently [2]. STG perform better than the feature warping technique [1]. 2.2 Short-time Mean and Variance Normalization (STMVN) In the short-time mean and variance normalization (STMVN) technique, m-th frame and k-th feature space C(m,k) are normalized as C -μst Cstmvn ( m,k ) =, σst where m and k represent the frame index and cepstral coefficients index, respectively, L is the sliding window length in frames. μst and σst are the short-time mean and standard deviation, respectively, defined as: m+l/2 1 μst ( m,k ) = C( j,k) L j=m-l/2 m+l/2 1 2 σst ( m,k ) = ( C( j,k) -μ ) L j=m-l/2 2.3 Short-time Cepstral Mean and Scale Normalization (STMSN) Given a lower bound b l and upper bound b u of a feature component x, scale normalization can be defined as: x-bl ˆx =. (1) b -b sn u l

4 ˆx sn is in the range [0,1]. MVN transforms the features component x to a random variable with zero mean and unit variance as x-μ ˆx =, (2) σ mvn where μ and σ are the sample mean and standard deviation of the feature, respectively. From (2) and (3) we define a new normalization technique, called the mean and scale normalization (MSN) technique, as x-μ ˆx =. (3) b -b msn In the short-time cepstral gain normalization (STMSN) technique, the m-th frame and k-th feature space C(m,k) are normalized as C -μst Cstmsn ( m,k ) =, d m,k where μst and dst ( ) u st l ( ) m,k are the short-time mean and short-time difference between the upper and lower bound, respectively, defined as: m+l/2 1 μst ( m,k ) = C( j,k) L st j=m-l/2 ( ) ( ( )) ( ( )) d m,k = max C j,k - min C j,k (m-l/2) j (m+l/2) (m-l/2) j (m+l/2). 3 I-vector Framework for Speaker Verification The i-vector framework for speaker verification has set a new performance standard in the research field. The i-vector extractor converts an entire speech recording into a low dimensional feature vectors called i-vectors [7-9]. The i-vector extractors explained in [7-9] are gender dependent and are followed by gender dependent generative modeling stages. Similar to [6], we use a gender-independent i-vector extractor and a mixture of male and female Probabilistic Linear Discriminant Analysis (PLDA) models, where the gender label is treated as a latent variable, for speaker verification. The i-vector framework used in this paper consists of the following stages: i-vector extraction, generative modeling of i-vectors and scoring or likelihood ratio computation as described in [6]. A detailed description about the mixture PLDA model-based i-vector speaker verification can be found in [6].

5 4 Experiments 4.1 Experimental Setup We conducted experiments on the extended core-core condition of the NIST 2010 SRE extended list. The performance of the feature normalization techniques was evaluated using following the evaluation metrics: the Equal Error Rate (EER), the old normalized minimum detection cost function (DCF Old ) and the new normalized minimum detection cost function (DCF New ). DCF Old and DCF New correspond to the evaluation metric for the NIST SRE in 2008 and 2010, respectively Feature Extraction & UBM training For our experiments, we use 20 MFCC features (including log-energy) augmented with their delta and double delta coefficients, making 60 dimensional MFCC feature vectors. The analysis frame length is 25 ms with a frame shift of 10 ms. Delta and double coefficients were calculated using a 2-frame window. Then silence frames are removed according to the VAD labels. After that we apply feature normalization techniques (STG, STMVN and STMSN), which use a 300-frame window. We train a gender-independent, full covariance Universal Background Model (UBM) with 256- component Gaussian Mixture Models (GMMs). NIST SRE 2004 and 2005 telephone data were used for training the UBM for our system Training and extraction of i-vectors Our gender-independent i-vector extractor is of dimension 800. After training genderindependent GMM-UBM, we train the i-vector extractor using the Baum-Welch (BW) statistics extracted from the following data: LDC release of Switchboard II - phase 2 and phase 3, Switchboard Cellular - part 1 and part 2, Fisher data, NIST SRE 2004 and 2005 telephone data, NIST SRE 2005 and 2005 microphone data and NIST SRE 2008 interview development microphone data. In order to reduce the i-vectors dimension, a Linear Discriminant Analysis (LDA) projection matrix is estimated from the BW statistics by maximizing the following objective function: LDA P T P ΣbP P = argmax, T P ΣwP where Σ b and Σ w represent the between- and within-class scatter matrices, respectively. For the estimation of Σb we use all telephone training data excluding Fisher data and Σ w is estimated using all telephone and microphone training data excluding Fisher data. An optimal reduced dimension of 150 is determined empirically. Then we extract 150-dimensional i-vectors for all training data excluding Fisher data by applying this transformation matrix on the 800-dimensional i-vectors. For the test data, first BW statistics and then 150 dimensional i-vectors are extracted following the similar procedure using the same projection matrix. We also normalize the length of the i-vectors, as it has been found that normalizing the length of the i-vectors after mapping by the estimated LDA projection matrix helps the Gaussian PLDA model to

6 give the same results as the heavy-tailed PLDA model [10], i.e., PLDA model with heavy-tailed prior distributions [8] Training the PLDA model We train two PLDA models, one for the males and another for females. These models were trained using all the telephone and microphone training i-vectors; then we combine these PLDA models to form a mixture of PLDA models in i-vector space [6]. 4.2 Results Feature normalization techniques are evaluated on the latest NIST SRE 2010 corpora using the i-vector speaker verification system. Results are reported for five evaluation conditions correspond to det conditions 1-5 (as shown in table 1) in the evaluation plan [11]. Table 2 presents EERs for all normalization techniques. Tables 3 and 4 depict the DCF Old and DCF New, respectively. In terms of the EER, and DCF Old, STG perform better compared to STMVN and STMSN. In the case of DCF New, the STMSN technique is found to perform better. The execution time (on a 64-bit MATLAB) of the STG technique is seconds to normalize the features extracted from a speech signal of 139 seconds duration whereas the execution times of the STMSN and STMVN techniques are 1.71 and 3.01 seconds, respectively. STMVN and STMSN are very simple methods and take less time to normalize MFCC features whereas STG is considerably more complex to implement and take more time to normalize MFCC features compared to the STMVN and STMSN techniques. Table 1: Evaluation conditions (extended core-core) for the NIST 2010 SRE task. Condition det1 det2 det3 det4 det5 Task Interview in training and test, same Mic. Interview in training and test, different Mic. Interview in training and normal vocal effort phone call over Tel channel in test. Interview in training and normal vocal effort phone call over Mic channel in test Normal vocal effort phone call in training and test, different Tel

7 Table 2. Male and female det1 to det5 speaker verification results using a mixture PLDA model for the STG, STMVN and STMSN systems, measured by EER. For each row the best EER is in boldface. Female Male EER (%) STG STMVN STMSN det det det det det det det det det det Table 2. Male and female det1 to det5 speaker verification results using a mixture PLDA model for the STG, STMVN and STMSN systems, measured by the normalized minimun DCF (DCF Old ). For each row the best DCF Old is in boldface. Female Male DCF Old STG STMVN STMSN det det det det det det det det det det Table 3. Male and female det1 to det5 speaker verification results using a mixture PLDA model for the STG, STMVN and STMSN systems, measured by the normalized minimun DCF (DCF New ). For each row the best DCF New is in boldface. Female Male DCF New STG STMVN STMSN det det det det det det det det det det

8 5 Conclusion In this paper a simple feature normalization method, called STMSN, was introduced and its performance, in the context of an i-vector speaker verification system, was compared to the STG and STMVN techniques. Both the STMVN and STMSN methods provide comparable speaker verification results using i-vectors to that of STG. STG is considerable more complex and takes longer time to normalize MFCC feature vectors compared to STMSN and STMVN. References 1. Pelecanos, J., Sridharan, S.: Feature warping for robust speaker verification. In: Proc. Speaker Odyssey: the Speaker Recognition Workshop, Crete, Greece, pp (2001) 2. Xiang, B., Chaudhari, U., Navratil, J., Ramaswamy, G., Gopinath, R.: Short-time Gaussianization for robust speaker verification. In: IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, Orlando, Florida, USA, pp (2002). 3. Furui, S.: Cepstral analysis technique for automatic speaker verification, IEEE Trans. Acoustics, Speech Signal Process., 29 (2), (1981). 4. Atal, B.: Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification, JASA, 55 (6), (1974). 5. Hermansky, H., Morgan, N.: RASTA processing of speech. IEEE Trans. Speech Audio Process. 2 (4), (1994). 6. M. Senoussaoui, P. Kenny, N. Brummer, E. de Villiers, P. Dumouchel: Mixture of PLDA models in I-vector space for gender independent speaker recognition, In: Interspeech 2011 (to appear), Florence, Italy, August (2011). 7. N. Dehak, P. Kenny, R. Dehak, P. Dumouchel, P. Ouellet: Front-end factor analysis for speaker verification. IEEE Trans. on Audio, Speech and Language Processing, vol. 19(4), pp (2011). 8. P. Kenny: Bayesian speaker verification with heavy tailed priors, In: The Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, June (2010). 9. N. Brümmer, E. de Villiers: The speaker partitioning problem, In: the Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, June (2010). 10. D. Garcia-Romero, and Carol Y. Espy-Wilson: Analysis of i-vector length normalization in speaker recognition systems, In: Interspeech 2011 (to appear), Florence, Italy, Aug.(2011). 11. National Institute of Standards and Technology, NIST Speaker Recognition Evaluation,

The Approach of Mean Shift based Cosine Dissimilarity for Multi-Recording Speaker Clustering

The Approach of Mean Shift based Cosine Dissimilarity for Multi-Recording Speaker Clustering The Approach of Mean Shift based Cosine Dissimilarity for Multi-Recording Speaker Clustering 1 D. Jareena Begum, 2 K Rajendra Prasad, 3 M Suleman Basha 1 M.Tech in SE, RGMCET, Nandyal 2 Assoc Prof, Dept

More information

PDF hosted at the Radboud Repository of the Radboud University Nijmegen

PDF hosted at the Radboud Repository of the Radboud University Nijmegen PDF hosted at the Radboud Repository of the Radboud University Nijmegen The following full text is a publisher's version. For additional information about this publication click this link. http://hdl.handle.net/2066/94752

More information

SUT Submission for NIST 2016 Speaker Recognition Evaluation: Description and Analysis

SUT Submission for NIST 2016 Speaker Recognition Evaluation: Description and Analysis The 2017 Conference on Computational Linguistics and Speech Processing ROCLING 2017, pp. 276-286 The Association for Computational Linguistics and Chinese Language Processing SUT Submission for NIST 2016

More information

arxiv: v1 [cs.sd] 8 Jun 2017

arxiv: v1 [cs.sd] 8 Jun 2017 SUT SYSTEM DESCRIPTION FOR NIST SRE 2016 Hossein Zeinali 1,2, Hossein Sameti 1 and Nooshin Maghsoodi 1 1 Sharif University of Technology, Tehran, Iran 2 Brno University of Technology, Speech@FIT and IT4I

More information

IMPROVED SPEAKER RECOGNITION USING DCT COEFFICIENTS AS FEATURES. Mitchell McLaren, Yun Lei

IMPROVED SPEAKER RECOGNITION USING DCT COEFFICIENTS AS FEATURES. Mitchell McLaren, Yun Lei IMPROVED SPEAKER RECOGNITION USING DCT COEFFICIENTS AS FEATURES Mitchell McLaren, Yun Lei Speech Technology and Research Laboratory, SRI International, California, USA {mitch,yunlei}@speech.sri.com ABSTRACT

More information

Vulnerability of Voice Verification System with STC anti-spoofing detector to different methods of spoofing attacks

Vulnerability of Voice Verification System with STC anti-spoofing detector to different methods of spoofing attacks Vulnerability of Voice Verification System with STC anti-spoofing detector to different methods of spoofing attacks Vadim Shchemelinin 1,2, Alexandr Kozlov 2, Galina Lavrentyeva 2, Sergey Novoselov 1,2

More information

Speaker Verification with Adaptive Spectral Subband Centroids

Speaker Verification with Adaptive Spectral Subband Centroids Speaker Verification with Adaptive Spectral Subband Centroids Tomi Kinnunen 1, Bingjun Zhang 2, Jia Zhu 2, and Ye Wang 2 1 Speech and Dialogue Processing Lab Institution for Infocomm Research (I 2 R) 21

More information

Improving Robustness to Compressed Speech in Speaker Recognition

Improving Robustness to Compressed Speech in Speaker Recognition INTERSPEECH 2013 Improving Robustness to Compressed Speech in Speaker Recognition Mitchell McLaren 1, Victor Abrash 1, Martin Graciarena 1, Yun Lei 1, Jan Pe sán 2 1 Speech Technology and Research Laboratory,

More information

Trial-Based Calibration for Speaker Recognition in Unseen Conditions

Trial-Based Calibration for Speaker Recognition in Unseen Conditions Trial-Based Calibration for Speaker Recognition in Unseen Conditions Mitchell McLaren, Aaron Lawson, Luciana Ferrer, Nicolas Scheffer, Yun Lei Speech Technology and Research Laboratory SRI International,

More information

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION

SPEECH FEATURE EXTRACTION USING WEIGHTED HIGHER-ORDER LOCAL AUTO-CORRELATION Far East Journal of Electronics and Communications Volume 3, Number 2, 2009, Pages 125-140 Published Online: September 14, 2009 This paper is available online at http://www.pphmj.com 2009 Pushpa Publishing

More information

FOUR WEIGHTINGS AND A FUSION: A CEPSTRAL-SVM SYSTEM FOR SPEAKER RECOGNITION. Sachin S. Kajarekar

FOUR WEIGHTINGS AND A FUSION: A CEPSTRAL-SVM SYSTEM FOR SPEAKER RECOGNITION. Sachin S. Kajarekar FOUR WEIGHTINGS AND A FUSION: A CEPSTRAL-SVM SYSTEM FOR SPEAKER RECOGNITION Sachin S. Kajarekar Speech Technology and Research Laboratory SRI International, Menlo Park, CA, USA sachin@speech.sri.com ABSTRACT

More information

SRE08 system. Nir Krause Ran Gazit Gennady Karvitsky. Leave Impersonators, fraudsters and identity thieves speechless

SRE08 system. Nir Krause Ran Gazit Gennady Karvitsky. Leave Impersonators, fraudsters and identity thieves speechless Leave Impersonators, fraudsters and identity thieves speechless SRE08 system Nir Krause Ran Gazit Gennady Karvitsky Copyright 2008 PerSay Inc. All Rights Reserved Focus: Multilingual telephone speech and

More information

Multi-modal Person Identification in a Smart Environment

Multi-modal Person Identification in a Smart Environment Multi-modal Person Identification in a Smart Environment Hazım Kemal Ekenel 1, Mika Fischer 1, Qin Jin 2, Rainer Stiefelhagen 1 1 Interactive Systems Labs (ISL), Universität Karlsruhe (TH), 76131 Karlsruhe,

More information

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on

Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on Bo#leneck Features from SNR- Adap9ve Denoising Deep Classifier for Speaker Iden9fica9on TAN Zhili & MAK Man-Wai APSIPA 2015 Department of Electronic and Informa2on Engineering The Hong Kong Polytechnic

More information

Introducing I-Vectors for Joint Anti-spoofing and Speaker Verification

Introducing I-Vectors for Joint Anti-spoofing and Speaker Verification Introducing I-Vectors for Joint Anti-spoofing and Speaker Verification Elie Khoury, Tomi Kinnunen, Aleksandr Sizov, Zhizheng Wu, Sébastien Marcel Idiap Research Institute, Switzerland School of Computing,

More information

Client Dependent GMM-SVM Models for Speaker Verification

Client Dependent GMM-SVM Models for Speaker Verification Client Dependent GMM-SVM Models for Speaker Verification Quan Le, Samy Bengio IDIAP, P.O. Box 592, CH-1920 Martigny, Switzerland {quan,bengio}@idiap.ch Abstract. Generative Gaussian Mixture Models (GMMs)

More information

Supervector Compression Strategies to Speed up I-Vector System Development

Supervector Compression Strategies to Speed up I-Vector System Development Supervector Compression Strategies to Speed up I-Vector System Development Ville Vestman Tomi Kinnunen University of Eastern Finland Finland vvestman@cs.uef.fi tkinnu@cs.uef.fi system can be achieved [5

More information

Lec 08 Feature Aggregation II: Fisher Vector, Super Vector and AKULA

Lec 08 Feature Aggregation II: Fisher Vector, Super Vector and AKULA Image Analysis & Retrieval CS/EE 5590 Special Topics (Class Ids: 44873, 44874) Fall 2016, M/W 4-5:15pm@Bloch 0012 Lec 08 Feature Aggregation II: Fisher Vector, Super Vector and AKULA Zhu Li Dept of CSEE,

More information

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1)

GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS. Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1) GYROPHONE RECOGNIZING SPEECH FROM GYROSCOPE SIGNALS Yan Michalevsky (1), Gabi Nakibly (2) and Dan Boneh (1) (1) Stanford University (2) National Research and Simulation Center, Rafael Ltd. 0 MICROPHONE

More information

Comparison of Clustering Methods: a Case Study of Text-Independent Speaker Modeling

Comparison of Clustering Methods: a Case Study of Text-Independent Speaker Modeling Comparison of Clustering Methods: a Case Study of Text-Independent Speaker Modeling Tomi Kinnunen, Ilja Sidoroff, Marko Tuononen, Pasi Fränti Speech and Image Processing Unit, School of Computing, University

More information

LARGE-SCALE SPEAKER IDENTIFICATION

LARGE-SCALE SPEAKER IDENTIFICATION 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) LARGE-SCALE SPEAKER IDENTIFICATION Ludwig Schmidt MIT Matthew Sharifi and Ignacio Lopez Moreno Google, Inc. ABSTRACT

More information

A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition

A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition Special Session: Intelligent Knowledge Management A Novel Template Matching Approach To Speaker-Independent Arabic Spoken Digit Recognition Jiping Sun 1, Jeremy Sun 1, Kacem Abida 2, and Fakhri Karray

More information

NON-UNIFORM SPEAKER NORMALIZATION USING FREQUENCY-DEPENDENT SCALING FUNCTION

NON-UNIFORM SPEAKER NORMALIZATION USING FREQUENCY-DEPENDENT SCALING FUNCTION NON-UNIFORM SPEAKER NORMALIZATION USING FREQUENCY-DEPENDENT SCALING FUNCTION S. V. Bharath Kumar Imaging Technologies Lab General Electric - Global Research JFWTC, Bangalore - 560086, INDIA bharath.sv@geind.ge.com

More information

ON THE EFFECT OF SCORE EQUALIZATION IN SVM MULTIMODAL BIOMETRIC SYSTEMS

ON THE EFFECT OF SCORE EQUALIZATION IN SVM MULTIMODAL BIOMETRIC SYSTEMS ON THE EFFECT OF SCORE EQUALIZATION IN SVM MULTIMODAL BIOMETRIC SYSTEMS Pascual Ejarque and Javier Hernando TALP Research Center, Department of Signal Theory and Communications Technical University of

More information

Multifactor Fusion for Audio-Visual Speaker Recognition

Multifactor Fusion for Audio-Visual Speaker Recognition Proceedings of the 7th WSEAS International Conference on Signal, Speech and Image Processing, Beijing, China, September 15-17, 2007 70 Multifactor Fusion for Audio-Visual Speaker Recognition GIRIJA CHETTY

More information

Optimization of Observation Membership Function By Particle Swarm Method for Enhancing Performances of Speaker Identification

Optimization of Observation Membership Function By Particle Swarm Method for Enhancing Performances of Speaker Identification Proceedings of the 6th WSEAS International Conference on SIGNAL PROCESSING, Dallas, Texas, USA, March 22-24, 2007 52 Optimization of Observation Membership Function By Particle Swarm Method for Enhancing

More information

STC ANTI-SPOOFING SYSTEMS FOR THE ASVSPOOF 2015 CHALLENGE

STC ANTI-SPOOFING SYSTEMS FOR THE ASVSPOOF 2015 CHALLENGE STC ANTI-SPOOFING SYSTEMS FOR THE ASVSPOOF 2015 CHALLENGE Sergey Novoselov 1,2, Alexandr Kozlov 2, Galina Lavrentyeva 1,2, Konstantin Simonchik 1,2, Vadim Shchemelinin 1,2 1 ITMO University, St. Petersburg,

More information

A text-independent speaker verification model: A comparative analysis

A text-independent speaker verification model: A comparative analysis A text-independent speaker verification model: A comparative analysis Rishi Charan, Manisha.A, Karthik.R, Raesh Kumar M, Senior IEEE Member School of Electronic Engineering VIT University Tamil Nadu, India

More information

MULTIPLE WINDOWED SPECTRAL FEATURES FOR EMOTION RECOGNITION

MULTIPLE WINDOWED SPECTRAL FEATURES FOR EMOTION RECOGNITION MULTIPLE WIDOWED SPECTRAL FEATURES FOR EMOTIO RECOGITIO Yazid Attabi 1,, Md Jahangir Alam 1, 3, Pierre Dumouchel, Patrick Kenny 1, Douglas O'Shaughnessy 3 1 Centre de recherche informatique de Montréal,

More information

Discriminative training and Feature combination

Discriminative training and Feature combination Discriminative training and Feature combination Steve Renals Automatic Speech Recognition ASR Lecture 13 16 March 2009 Steve Renals Discriminative training and Feature combination 1 Overview Hot topics

More information

MULTIMODAL PERSON IDENTIFICATION IN A SMART ROOM. J.Luque, R.Morros, J.Anguita, M.Farrus, D.Macho, F.Marqués, C.Martínez, V.Vilaplana, J.

MULTIMODAL PERSON IDENTIFICATION IN A SMART ROOM. J.Luque, R.Morros, J.Anguita, M.Farrus, D.Macho, F.Marqués, C.Martínez, V.Vilaplana, J. MULTIMODAL PERSON IDENTIFICATION IN A SMART ROOM JLuque, RMorros, JAnguita, MFarrus, DMacho, FMarqués, CMartínez, VVilaplana, J Hernando Technical University of Catalonia (UPC) Jordi Girona, 1-3 D5, 08034

More information

AN I-VECTOR BASED DESCRIPTOR FOR ALPHABETICAL GESTURE RECOGNITION

AN I-VECTOR BASED DESCRIPTOR FOR ALPHABETICAL GESTURE RECOGNITION 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) AN I-VECTOR BASED DESCRIPTOR FOR ALPHABETICAL GESTURE RECOGNITION You-Chi Cheng 1, Ville Hautamäki 2, Zhen Huang 1,

More information

Bengt J. Borgström, Student Member, IEEE, and Abeer Alwan, Senior Member, IEEE

Bengt J. Borgström, Student Member, IEEE, and Abeer Alwan, Senior Member, IEEE 1 A Low Complexity Parabolic Lip Contour Model With Speaker Normalization For High-Level Feature Extraction in Noise Robust Audio-Visual Speech Recognition Bengt J Borgström, Student Member, IEEE, and

More information

LibrarY of Complex ICA Algorithms (LYCIA) Toolbox v1.2. Walk-through

LibrarY of Complex ICA Algorithms (LYCIA) Toolbox v1.2. Walk-through LibrarY of Complex ICA Algorithms (LYCIA) Toolbox v1.2 Walk-through Josselin Dea, Sai Ma, Patrick Sykes and Tülay Adalı Department of CSEE, University of Maryland, Baltimore County, MD 212150 Updated:

More information

Pair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification

Pair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification INTERSPEECH 2016 September 8 12, 2016, San Francisco, USA Pair-wise Distance Metric Learning of Neural Network Model for Spoken Language Identification 2 1 Xugang Lu 1, Peng Shen 1, Yu Tsao 2, Hisashi

More information

Aditi Upadhyay Research Scholar, Department of Electronics & Communication Engineering Jaipur National University, Jaipur, Rajasthan, India

Aditi Upadhyay Research Scholar, Department of Electronics & Communication Engineering Jaipur National University, Jaipur, Rajasthan, India Analysis of Different Classifier Using Feature Extraction in Speaker Identification and Verification under Adverse Acoustic Condition for Different Scenario Shrikant Upadhyay Assistant Professor, Department

More information

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing

Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing Acoustic to Articulatory Mapping using Memory Based Regression and Trajectory Smoothing Samer Al Moubayed Center for Speech Technology, Department of Speech, Music, and Hearing, KTH, Sweden. sameram@kth.se

More information

Novel Methods for Query Selection and Query Combination in Query-By-Example Spoken Term Detection

Novel Methods for Query Selection and Query Combination in Query-By-Example Spoken Term Detection Novel Methods for Query Selection and Query Combination in Query-By-Example Spoken Term Detection Javier Tejedor HCTLab, Universidad Autónoma de Madrid, Spain javier.tejedor@uam.es Igor Szöke Speech@FIT,

More information

An Introduction to Pattern Recognition

An Introduction to Pattern Recognition An Introduction to Pattern Recognition Speaker : Wei lun Chao Advisor : Prof. Jian-jiun Ding DISP Lab Graduate Institute of Communication Engineering 1 Abstract Not a new research field Wide range included

More information

Complex Identification Decision Based on Several Independent Speaker Recognition Methods. Ilya Oparin Speech Technology Center

Complex Identification Decision Based on Several Independent Speaker Recognition Methods. Ilya Oparin Speech Technology Center Complex Identification Decision Based on Several Independent Speaker Recognition Methods Ilya Oparin Speech Technology Center Corporate Overview Global provider of voice biometric solutions Company name:

More information

Further Studies of a FFT-Based Auditory Spectrum with Application in Audio Classification

Further Studies of a FFT-Based Auditory Spectrum with Application in Audio Classification ICSP Proceedings Further Studies of a FFT-Based Auditory with Application in Audio Classification Wei Chu and Benoît Champagne Department of Electrical and Computer Engineering McGill University, Montréal,

More information

Speaker Diarization System Based on GMM and BIC

Speaker Diarization System Based on GMM and BIC Speaer Diarization System Based on GMM and BIC Tantan Liu 1, Xiaoxing Liu 1, Yonghong Yan 1 1 ThinIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences Beijing 100080 {tliu, xliu,yyan}@hccl.ioa.ac.cn

More information

Improving Speaker Verification Performance in Presence of Spoofing Attacks Using Out-of-Domain Spoofed Data

Improving Speaker Verification Performance in Presence of Spoofing Attacks Using Out-of-Domain Spoofed Data INTERSPEECH 17 August 24, 17, Stockholm, Sweden Improving Speaker Verification Performance in Presence of Spoofing Attacks Using Out-of-Domain Spoofed Data Achintya Kr. Sarkar 1, Md. Sahidullah 2, Zheng-Hua

More information

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition

Audio-visual interaction in sparse representation features for noise robust audio-visual speech recognition ISCA Archive http://www.isca-speech.org/archive Auditory-Visual Speech Processing (AVSP) 2013 Annecy, France August 29 - September 1, 2013 Audio-visual interaction in sparse representation features for

More information

ANALYSING REPLAY SPOOFING COUNTERMEASURE PERFORMANCE UNDER VARIED CONDITIONS

ANALYSING REPLAY SPOOFING COUNTERMEASURE PERFORMANCE UNDER VARIED CONDITIONS 2018 IEEE INTERNATIONAL WORKSHOP ON MACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 17 20, 2018, AALBORG, DENMARK ANALYSING REPLAY SPOOFING COUNTERMEASURE PERFORMANCE UNDER VARIED CONDITIONS Bhusan Chettri

More information

EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition

EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition EM Algorithm with Split and Merge in Trajectory Clustering for Automatic Speech Recognition Yan Han and Lou Boves Department of Language and Speech, Radboud University Nijmegen, The Netherlands {Y.Han,

More information

Input speech signal. Selected /Rejected. Pre-processing Feature extraction Matching algorithm. Database. Figure 1: Process flow in ASR

Input speech signal. Selected /Rejected. Pre-processing Feature extraction Matching algorithm. Database. Figure 1: Process flow in ASR Volume 5, Issue 1, January 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Feature Extraction

More information

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi Outline Background Motivation Speech recognition algorithm Implementation steps GPU implementation strategies

More information

Chapter 3. Speech segmentation. 3.1 Preprocessing

Chapter 3. Speech segmentation. 3.1 Preprocessing , as done in this dissertation, refers to the process of determining the boundaries between phonemes in the speech signal. No higher-level lexical information is used to accomplish this. This chapter presents

More information

Voice Conversion Using Dynamic Kernel. Partial Least Squares Regression

Voice Conversion Using Dynamic Kernel. Partial Least Squares Regression Voice Conversion Using Dynamic Kernel 1 Partial Least Squares Regression Elina Helander, Hanna Silén, Tuomas Virtanen, Member, IEEE, and Moncef Gabbouj, Fellow, IEEE Abstract A drawback of many voice conversion

More information

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri

Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute Slide Credit: Mehryar Mohri Speech Recognition Lecture 8: Acoustic Models. Eugene Weinstein Google, NYU Courant Institute eugenew@cs.nyu.edu Slide Credit: Mehryar Mohri Speech Recognition Components Acoustic and pronunciation model:

More information

Dynamic Time Warping

Dynamic Time Warping Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. Dynamic Time Warping Dr Philip Jackson Acoustic features Distance measures Pattern matching Distortion penalties DTW

More information

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery

Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery Pitch Prediction from Mel-frequency Cepstral Coefficients Using Sparse Spectrum Recovery Achuth Rao MV, Prasanta Kumar Ghosh SPIRE LAB Electrical Engineering, Indian Institute of Science (IISc), Bangalore,

More information

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV

Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV Gender-dependent acoustic models fusion developed for automatic subtitling of Parliament meetings broadcasted by the Czech TV Jan Vaněk and Josef V. Psutka Department of Cybernetics, West Bohemia University,

More information

Inter-session Variability Modelling and Joint Factor Analysis for Face Authentication

Inter-session Variability Modelling and Joint Factor Analysis for Face Authentication Inter-session Variability Modelling and Joint Factor Analysis for Face Authentication Roy Wallace Idiap Research Institute, Martigny, Switzerland roy.wallace@idiap.ch Mitchell McLaren Radboud University

More information

Authentication of Fingerprint Recognition Using Natural Language Processing

Authentication of Fingerprint Recognition Using Natural Language Processing Authentication of Fingerprint Recognition Using Natural Language Shrikala B. Digavadekar 1, Prof. Ravindra T. Patil 2 1 Tatyasaheb Kore Institute of Engineering & Technology, Warananagar, India 2 Tatyasaheb

More information

arxiv: v1 [cs.mm] 23 Jan 2019

arxiv: v1 [cs.mm] 23 Jan 2019 GENERALIZATION OF SPOOFING COUNTERMEASURES: A CASE STUDY WITH ASVSPOOF 215 AND BTAS 216 CORPORA Dipjyoti Paul 1, Md Sahidullah 2, Goutam Saha 1 arxiv:191.825v1 [cs.mm] 23 Jan 219 1 Department of E & ECE,

More information

ABC submission for NIST SRE 2016

ABC submission for NIST SRE 2016 ABC submission for NIST SRE 2016 Agnitio+BUT+CRIM Oldrich Plchot, Pavel Matejka, Ondrej Novotny, Anna Silnova, Johan Rohdin, Mireia Diez, Ondrej Glembek, Xiaowei Jiang, Lukas Burget, Martin Karafiat, Lucas

More information

Archetypal Analysis Based Sparse Convex Sequence Kernel for Bird Activity Detection

Archetypal Analysis Based Sparse Convex Sequence Kernel for Bird Activity Detection Archetypal Analysis Based Sparse Convex Sequence Kernel for Bird Activity Detection V. Abrol, P. Sharma, A. Thakur, P. Rajan, A. D. Dileep, Anil K. Sao School of Computing and Electrical Engineering, Indian

More information

Variable-Component Deep Neural Network for Robust Speech Recognition

Variable-Component Deep Neural Network for Robust Speech Recognition Variable-Component Deep Neural Network for Robust Speech Recognition Rui Zhao 1, Jinyu Li 2, and Yifan Gong 2 1 Microsoft Search Technology Center Asia, Beijing, China 2 Microsoft Corporation, One Microsoft

More information

Two-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments

Two-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments The 2 IEEE/RSJ International Conference on Intelligent Robots and Systems October 8-22, 2, Taipei, Taiwan Two-Layered Audio-Visual Speech Recognition for Robots in Noisy Environments Takami Yoshida, Kazuhiro

More information

Voice Command Based Computer Application Control Using MFCC

Voice Command Based Computer Application Control Using MFCC Voice Command Based Computer Application Control Using MFCC Abinayaa B., Arun D., Darshini B., Nataraj C Department of Embedded Systems Technologies, Sri Ramakrishna College of Engineering, Coimbatore,

More information

Unsupervised Representation Learning Using Convolutional Restricted Boltzmann Machine for Spoof Speech Detection

Unsupervised Representation Learning Using Convolutional Restricted Boltzmann Machine for Spoof Speech Detection INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Unsupervised Representation Learning Using Convolutional Restricted Boltzmann Machine for Spoof Speech Detection Hardik B. Sailor, Madhu R. Kamble,

More information

2014, IJARCSSE All Rights Reserved Page 461

2014, IJARCSSE All Rights Reserved Page 461 Volume 4, Issue 1, January 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Real Time Speech

More information

Xing Fan, Carlos Busso and John H.L. Hansen

Xing Fan, Carlos Busso and John H.L. Hansen Xing Fan, Carlos Busso and John H.L. Hansen Center for Robust Speech Systems (CRSS) Erik Jonsson School of Engineering & Computer Science Department of Electrical Engineering University of Texas at Dallas

More information

I-VECTORS FOR TIMBRE-BASED MUSIC SIMILARITY AND MUSIC ARTIST CLASSIFICATION

I-VECTORS FOR TIMBRE-BASED MUSIC SIMILARITY AND MUSIC ARTIST CLASSIFICATION I-VECTORS FOR TIMBRE-BASED MUSIC SIMILARITY AND MUSIC ARTIST CLASSIFICATION Hamid Eghbal-zadeh Bernhard Lehner Markus Schedl Gerhard Widmer Department of Computational Perception, Johannes Kepler University

More information

Towards PLDA-RBM based Speaker Recognition in Mobile Environment: Designing Stacked/Deep PLDA-RBM Systems

Towards PLDA-RBM based Speaker Recognition in Mobile Environment: Designing Stacked/Deep PLDA-RBM Systems Nautch, Hao, Stafylaki, Rathgeb, Buch PLDA-RBM mobile data / Shanghai, 23.03.2016 1/14 Toward PLDA-RBM baed Speaker Recognition in Mobile Environment: Deigning Stacked/Deep PLDA-RBM Sytem A. Nautch, H.

More information

Textural Features for Image Database Retrieval

Textural Features for Image Database Retrieval Textural Features for Image Database Retrieval Selim Aksoy and Robert M. Haralick Intelligent Systems Laboratory Department of Electrical Engineering University of Washington Seattle, WA 98195-2500 {aksoy,haralick}@@isl.ee.washington.edu

More information

Discriminate Analysis

Discriminate Analysis Discriminate Analysis Outline Introduction Linear Discriminant Analysis Examples 1 Introduction What is Discriminant Analysis? Statistical technique to classify objects into mutually exclusive and exhaustive

More information

HANDSET-DEPENDENT BACKGROUND MODELS FOR ROBUST. Larry P. Heck and Mitchel Weintraub. Speech Technology and Research Laboratory.

HANDSET-DEPENDENT BACKGROUND MODELS FOR ROBUST. Larry P. Heck and Mitchel Weintraub. Speech Technology and Research Laboratory. HANDSET-DEPENDENT BACKGROUND MODELS FOR ROBUST TEXT-INDEPENDENT SPEAKER RECOGNITION Larry P. Heck and Mitchel Weintraub Speech Technology and Research Laboratory SRI International Menlo Park, CA 9 ABSTRACT

More information

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders CSC 411: Lecture 14: Principal Components Analysis & Autoencoders Richard Zemel, Raquel Urtasun and Sanja Fidler University of Toronto Zemel, Urtasun, Fidler (UofT) CSC 411: 14-PCA & Autoencoders 1 / 18

More information

Neetha Das Prof. Andy Khong

Neetha Das Prof. Andy Khong Neetha Das Prof. Andy Khong Contents Introduction and aim Current system at IMI Proposed new classification model Support Vector Machines Initial audio data collection and processing Features and their

More information

VIDEO-BASED HUMAN MOTION CAPTURE DATA RETRIEVAL VIA NORMALIZED MOTION ENERGY IMAGE SUBSPACE PROJECTIONS

VIDEO-BASED HUMAN MOTION CAPTURE DATA RETRIEVAL VIA NORMALIZED MOTION ENERGY IMAGE SUBSPACE PROJECTIONS VIDEO-BASED HUMAN MOTION CAPTURE DATA RETRIEVAL VIA NORMALIZED MOTION ENERGY IMAGE SUBSPACE PROJECTIONS Wei Li 1, Yan Huang 2, C.-C. Jay Kuo 3, Jingliang Peng 1 1 School of Computer Science and Technology,

More information

Applications of Keyword-Constraining in Speaker Recognition. Howard Lei. July 2, Introduction 3

Applications of Keyword-Constraining in Speaker Recognition. Howard Lei. July 2, Introduction 3 Applications of Keyword-Constraining in Speaker Recognition Howard Lei hlei@icsi.berkeley.edu July 2, 2007 Contents 1 Introduction 3 2 The keyword HMM system 4 2.1 Background keyword HMM training............................

More information

Detection of Replay Attacks using Single Frequency Filtering Cepstral Coefficients

Detection of Replay Attacks using Single Frequency Filtering Cepstral Coefficients INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Detection of Replay Attacks using Single Frequency Filtering Cepstral Coefficients K N R K Raju Alluri, Sivanand Achanta, Sudarsana Reddy Kadiri,

More information

Software/Hardware Co-Design of HMM Based Isolated Digit Recognition System

Software/Hardware Co-Design of HMM Based Isolated Digit Recognition System 154 JOURNAL OF COMPUTERS, VOL. 4, NO. 2, FEBRUARY 2009 Software/Hardware Co-Design of HMM Based Isolated Digit Recognition System V. Amudha, B.Venkataramani, R. Vinoth kumar and S. Ravishankar Department

More information

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction

Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction Invariant Recognition of Hand-Drawn Pictograms Using HMMs with a Rotating Feature Extraction Stefan Müller, Gerhard Rigoll, Andreas Kosmala and Denis Mazurenok Department of Computer Science, Faculty of

More information

Optimizing feature representation for speaker diarization using PCA and LDA

Optimizing feature representation for speaker diarization using PCA and LDA Optimizing feature representation for speaker diarization using PCA and LDA itsikv@netvision.net.il Jean-Francois Bonastre jean-francois.bonastre@univ-avignon.fr Outline Speaker Diarization what is it?

More information

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology

2-2-2, Hikaridai, Seika-cho, Soraku-gun, Kyoto , Japan 2 Graduate School of Information Science, Nara Institute of Science and Technology ISCA Archive STREAM WEIGHT OPTIMIZATION OF SPEECH AND LIP IMAGE SEQUENCE FOR AUDIO-VISUAL SPEECH RECOGNITION Satoshi Nakamura 1 Hidetoshi Ito 2 Kiyohiro Shikano 2 1 ATR Spoken Language Translation Research

More information

Probabilistic scoring using decision trees for fast and scalable speaker recognition

Probabilistic scoring using decision trees for fast and scalable speaker recognition Probabilistic scoring using decision trees for fast and scalable speaker recognition Gilles Gonon, Frédéric Bimbot, Rémi Gribonval To cite this version: Gilles Gonon, Frédéric Bimbot, Rémi Gribonval. Probabilistic

More information

Detector. Flash. Detector

Detector. Flash. Detector CLIPS at TRECvid: Shot Boundary Detection and Feature Detection Georges M. Quénot, Daniel Moraru, and Laurent Besacier CLIPS-IMAG, BP53, 38041 Grenoble Cedex 9, France Georges.Quenot@imag.fr Abstract This

More information

A Comparison of Visual Features for Audio-Visual Automatic Speech Recognition

A Comparison of Visual Features for Audio-Visual Automatic Speech Recognition A Comparison of Visual Features for Audio-Visual Automatic Speech Recognition N. Ahmad, S. Datta, D. Mulvaney and O. Farooq Loughborough Univ, LE11 3TU Leicestershire, UK n.ahmad@lboro.ac.uk 6445 Abstract

More information

Why DNN Works for Speech and How to Make it More Efficient?

Why DNN Works for Speech and How to Make it More Efficient? Why DNN Works for Speech and How to Make it More Efficient? Hui Jiang Department of Electrical Engineering and Computer Science Lassonde School of Engineering, York University, CANADA Joint work with Y.

More information

CHROMA AND MFCC BASED PATTERN RECOGNITION IN AUDIO FILES UTILIZING HIDDEN MARKOV MODELS AND DYNAMIC PROGRAMMING. Alexander Wankhammer Peter Sciri

CHROMA AND MFCC BASED PATTERN RECOGNITION IN AUDIO FILES UTILIZING HIDDEN MARKOV MODELS AND DYNAMIC PROGRAMMING. Alexander Wankhammer Peter Sciri 1 CHROMA AND MFCC BASED PATTERN RECOGNITION IN AUDIO FILES UTILIZING HIDDEN MARKOV MODELS AND DYNAMIC PROGRAMMING Alexander Wankhammer Peter Sciri introduction./the idea > overview What is musical structure?

More information

THE PERFORMANCE of automatic speech recognition

THE PERFORMANCE of automatic speech recognition IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 6, NOVEMBER 2006 2109 Subband Likelihood-Maximizing Beamforming for Speech Recognition in Reverberant Environments Michael L. Seltzer,

More information

TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University

TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION. Prateek Verma, Yang-Kai Lin, Li-Fan Yu. Stanford University TWO-STEP SEMI-SUPERVISED APPROACH FOR MUSIC STRUCTURAL CLASSIFICATION Prateek Verma, Yang-Kai Lin, Li-Fan Yu Stanford University ABSTRACT Structural segmentation involves finding hoogeneous sections appearing

More information

Query-by-example spoken term detection based on phonetic posteriorgram Query-by-example spoken term detection based on phonetic posteriorgram

Query-by-example spoken term detection based on phonetic posteriorgram Query-by-example spoken term detection based on phonetic posteriorgram International Conference on Education, Management and Computing Technology (ICEMCT 2015) Query-by-example spoken term detection based on phonetic posteriorgram Query-by-example spoken term detection based

More information

Speech Recognition on DSP: Algorithm Optimization and Performance Analysis

Speech Recognition on DSP: Algorithm Optimization and Performance Analysis Speech Recognition on DSP: Algorithm Optimization and Performance Analysis YUAN Meng A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Philosophy in Electronic Engineering

More information

A ROBUST SPEAKER CLUSTERING ALGORITHM

A ROBUST SPEAKER CLUSTERING ALGORITHM A ROBUST SPEAKER CLUSTERING ALGORITHM J. Ajmera IDIAP P.O. Box 592 CH-1920 Martigny, Switzerland jitendra@idiap.ch C. Wooters ICSI 1947 Center St., Suite 600 Berkeley, CA 94704, USA wooters@icsi.berkeley.edu

More information

Simultaneous Design of Feature Extractor and Pattern Classifer Using the Minimum Classification Error Training Algorithm

Simultaneous Design of Feature Extractor and Pattern Classifer Using the Minimum Classification Error Training Algorithm Griffith Research Online https://research-repository.griffith.edu.au Simultaneous Design of Feature Extractor and Pattern Classifer Using the Minimum Classification Error Training Algorithm Author Paliwal,

More information

HIERARCHICAL LARGE-MARGIN GAUSSIAN MIXTURE MODELS FOR PHONETIC CLASSIFICATION. Hung-An Chang and James R. Glass

HIERARCHICAL LARGE-MARGIN GAUSSIAN MIXTURE MODELS FOR PHONETIC CLASSIFICATION. Hung-An Chang and James R. Glass HIERARCHICAL LARGE-MARGIN GAUSSIAN MIXTURE MODELS FOR PHONETIC CLASSIFICATION Hung-An Chang and James R. Glass MIT Computer Science and Artificial Intelligence Laboratory Cambridge, Massachusetts, 02139,

More information

SAS: A speaker verification spoofing database containing diverse attacks

SAS: A speaker verification spoofing database containing diverse attacks SAS: A speaker verification spoofing database containing diverse attacks Zhizheng Wu 1, Ali Khodabakhsh 2, Cenk Demiroglu 2, Junichi Yamagishi 1,3, Daisuke Saito 4, Tomoki Toda 5, Simon King 1 1 University

More information

Replay Attack Detection using DNN for Channel Discrimination

Replay Attack Detection using DNN for Channel Discrimination INTERSPEECH 2017 August 20 24, 2017, Stockholm, Sweden Replay Attack Detection using DNN for Channel Discrimination Parav Nagarsheth, Elie Khoury, Kailash Patil, Matt Garland Pindrop, Atlanta, USA {pnagarsheth,ekhoury,kpatil,matt.garland}@pindrop.com

More information

Pattern Recognition Letters

Pattern Recognition Letters Pattern Recognition Letters 3 () 604 67 Contents lists available at ScienceDirect Pattern Recognition Letters journal homepage: www.elsevier.com/locate/patrec Comparison of clustering methods: A case study

More information

Robust speech recognition using features based on zero crossings with peak amplitudes

Robust speech recognition using features based on zero crossings with peak amplitudes Robust speech recognition using features based on zero crossings with peak amplitudes Author Gajic, Bojana, Paliwal, Kuldip Published 200 Conference Title Proceedings of the 200 IEEE International Conference

More information

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information

Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information Probabilistic Facial Feature Extraction Using Joint Distribution of Location and Texture Information Mustafa Berkay Yilmaz, Hakan Erdogan, Mustafa Unel Sabanci University, Faculty of Engineering and Natural

More information

ACEEE Int. J. on Electrical and Power Engineering, Vol. 02, No. 02, August 2011

ACEEE Int. J. on Electrical and Power Engineering, Vol. 02, No. 02, August 2011 DOI: 01.IJEPE.02.02.69 ACEEE Int. J. on Electrical and Power Engineering, Vol. 02, No. 02, August 2011 Dynamic Spectrum Derived Mfcc and Hfcc Parameters and Human Robot Speech Interaction Krishna Kumar

More information

Mouth Region Localization Method Based on Gaussian Mixture Model

Mouth Region Localization Method Based on Gaussian Mixture Model Mouth Region Localization Method Based on Gaussian Mixture Model Kenichi Kumatani and Rainer Stiefelhagen Universitaet Karlsruhe (TH), Interactive Systems Labs, Am Fasanengarten 5, 76131 Karlsruhe, Germany

More information

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders

CSC 411: Lecture 14: Principal Components Analysis & Autoencoders CSC 411: Lecture 14: Principal Components Analysis & Autoencoders Raquel Urtasun & Rich Zemel University of Toronto Nov 4, 2015 Urtasun & Zemel (UofT) CSC 411: 14-PCA & Autoencoders Nov 4, 2015 1 / 18

More information

Decision trees with improved efficiency for fast speaker verification

Decision trees with improved efficiency for fast speaker verification Decision trees with improved efficiency for fast speaker verification Gilles Gonon, Rémi Gribonval, Frédéric Bimbot To cite this version: Gilles Gonon, Rémi Gribonval, Frédéric Bimbot. Decision trees with

More information