Feature Extraction and Dimensionality Reduction in SVM Speaker Recognition

Feature Extracton and Dmensonalty Reducton n SVM Speaker Recognton hembsle Mazbuko, themb@crg.ee.uct.ac.za Danel J. Mashao, danel@ebe.uct.ac.za Department of Electrcal Engneerng, Unversty of Cape own Abstract he Support Vector Machne s a dscrmnatve classfer whch has acheved mpressve results n several pattern recognton tasks. Its applcablty s however lmted by the computatonal expense. he SVM tranng and testng tmes ncreases wth ncreasng amounts of data. A possble soluton s the reducton of the number of computatons wthn the SVM kernel by reducng the dmensonalty of the observaton vectors. In ths paper we apply ths concept of reducng the data dmensonalty to decrease the complexty to an SVM-based speaker verfcaton task, carred out on a subset of the IS 2000 speaker recognton evaluaton. he dmensonalty reducton s performed by applyng Prncpal Component Analyss to the feature vector. he results show degradaton n the performance of SVM when feature extracton s appled. here s however also a sgnfcant decrease n the tranng and testng tmes. Key erms speaker verfcaton, support vector machne, feature transformaton A UOMAIC speaker recognton s the task of usng machnes to recognze people usng speech as a bometrc. he speech nput s converted nto a feature vector representaton. Lnear Predctve Coeffcents (LPC) and Mel- Frequency Cepstral Coeffcents (MFCC) are popular feature sets. hese feature sets are used to create models whch represent each speaker. In the testng phase the nput speech s compared to the model and a classfcaton engne s used to decde who the test speaker s. he Support Vector Machne (SVM) s a dscrmnatve classfer whch has been successfully appled dfferent pattern recognton felds. In the speech processng feld the authors of [] appled SVM to a speaker and language recognton task on the IS 2003 database wth good results. In [2] the author appled SVM to a speaker verfcaton task and found that when combned wth a Gaussan Mxture Model (GMM) classfer, SVM mproved on the performance of the more popular classfer. he power of SVM les n the mplct transformaton of the nput space to a hgher dmensonal space. Because the transformaton s performed mplctly, there are no computatons explctly performed n the hgh dmensonal space. Intutvely, ths means that the dmenson of the nput dataset does not affect the performance of SVM. However, accordng to [3], the performance of SVM s dependent on the radus of the data, whch ncreases wth the number of features. A major hndrance n the applcaton of SVM s that the classfer suffers from extremely long tranng tme especally as the sze of the data, or number of observaton vectors, ncreases. Research studes have shown an mprovement n the accuracy of dscrmnatve classfers by applyng feature extracton technques [4,5]. he bass of these studes s that by applyng feature extracton we can remove redundances n the feature vectors by extractng and usng only those features whch are most relevant. Prncpal component analyss (PCA) s a relatvely old and well developed lnear ndependent feature extracton technque whch has been appled to several pattern recognton tasks. Pattern recognton tasks are dvded nto two phases; feature analyss and classfcaton. Feature extracton s a part of the feature analyss phase where we attempt to reduce redundancy n the feature vectors. An attracton of technques lkes PCA (others nclude Lnear Dscrmnant Analyss and Independent Component Analyss) s ther data dmensonalty reducton capabltes. In [5] the author proposed what he called the Reduced Dmensonal Support Vector Machne whch apples feature extracton and dmensonalty reducton technques. he concept behnd RDSVM s that the computatonal burden on SVM classfcaton can be decreased by reducng the number of computatons wthn the classfer. It s generally not possble nor desrable to reduce the number of observaton vectors as ths mght lead to loss of mportant nformaton. However, SVM performance suffers from long tranng and testng tmes when the number of observaton vectors s too hgh. So, RDSVM reduces the dmensonalty of the observaton vectors as a way of reducng the total number of SVM computatons. In the experments dscussed n ths paper, the effect of feature extracton on the system accuracy and speed of SVM tranng and classfcaton on a speaker verfcaton task was studed. he rest of ths paper s organzed as follows, Secton I gves an ntroducton to the SVM classfer. In Secton II we present the mathematcal formulaton of Prncpal Component Analyss. In Secton III a dscusson of speaker recognton s gven. Secton IV s the expermental setup and the results of the experments conducted. Secton V s a dscusson of the results and Secton VI s the concluson and a short dscusson of future work. I. SUPPOR VECOR MACHIE CLASSIFIER he Support Vector Machne s a powerful dscrmnatve classfer whch maps the nput onto a hgh dmensonal space d by Φ : R I ; and then fnds an optmal hyperplane to separate the data n that space. hs separatng hyperplane s found by maxmzng the dstance of the closest patterns [6]. he new space s often referred to as the feature space. Here we wll present a smplfed, general explanaton of SVM.

Suppose we have a bnary classfcaton problem as shown n fgure where each example belongs to ether class + or -. SVM seeks to maxmze the margn between the two classes by fndng the separatng hyperplane whch les halfway between the data classes. We can, wthout loss of generalty, consder the case of data that are non-lnearly separable. When the data are transformed by some non-lnear transformaton onto a hgher dmensonal space, they spread out allowng a separatng hyperplane to be found n the feature space. Fgure : Fndng the separatng hyperplane for non-lnearly separable data he support vectors are those data ponts that le on S and S 2, le wthn the margn, or are msclassfed. he hyperplane n the hgh dmensonal transform space results n a complex decson surface n the nput data space. SVM allows for these msclassfcatons n tranng wth a user-defned cost parameter C. hs way the msclassfcatons are lmted whle stll avodng over-fttng the tranng data [2,7]. In general the SVM s computed by usng the kernel trck so f ( x) = α y K( s,x) = b = where K s some kernel functon such that and s are the support vectors. a.) ( x,x ) = Φ( x ) Φ( x ) j j () K (2) SVM Characterstcs A dstngushng characterstc of SVM s ts strong foundaton on statstcal learnng theory whch establshes a bound on the generalzaton error (the error rate of a learnng machne on unseen data) thus mprovng the classfcaton results for unseen patterns [6]. he SVM mnmzes ths bound by maxmzng the margn. Also, snce SVM projects the nput onto a hgher dmenson space, the margn maxmzaton s ndependent of the orgnal dmenson of the nput space, thus SVM successfully avods the curse of dmensonalty whch some classfers suffer from. Determnng most approprate choce of kernel for a partcular task. Once a kernel has been chosen, there stll remans the ssue of optmzng the parameters of the kernel. For nstance, n [8] the authors conclude that there s an optmum C value for each dataset. he desgn of SVM s optmzed for bnary classfcaton whch can lmt ts applcablty to mult-class classfcaton tasks. Although the SVM successfully avods the curse of dmensonalty, the technque can be ncredbly slow, especally as the sze of the data ncreases. hs s true for both the tranng and test phases. here are stll several challenges whch need o be addressed n SVM applcatons. In [7] the author suggested that there mght be a possblty for lmtng the computatonal load of SVM by reducng the data dmensonalty whch would decrease the number of computatons that have to be performed wthn the SVM. he followng secton presents an overvew of a popular dmensonalty reducton technque. II. PRICIPAL COMPOE AALYSIS A possble approach to mprove the classfcaton performance of SVM s to operate the classfer n a feature space n whch the classes are nherently separated [7]. hs feature space s typcally a mult-dmensonal space resultng from transformng the nput space va some lnear or non-lnear transformaton. Prncpal Component Analyss (PCA) s one such transform. PCA s also referred to as the dscrete Karhunen-Loève ransform (KL) or the Hotellng transform. he central prncple n PCA s to transform the nput space onto a feature space where the data show maxmal varance. he followng s a bref dscusson of the mathematcal formulaton of PCA. he PCA formulaton gven here closely follows that of [2,5]. Let x be an m-dmensonal nput data vector such that x = ( x,,x )... m. We estmate the sample mean by µ ( x ) = (3) where s the number of samples. he sample covarance matrx C becomes b.) Challenges n SVM SVM s; however, not wthout ts lmtatons. Some of these S are: margn S 2 on-lnear transformaton on-lnearly separable n nput space Separatng hyperplane

C = = = = ( x µ ) ( x ) µ x x µµ (4) (5) Input speech Parameter extracton Feature analyss Feature extracton Speaker models Classfyn g Pattern classfcaton o perform PCA we fnd the egenvalues and egenvectors of the sample covarance matrx. Rearrangng the egenvectors n descendng order accordng to the correspondng egenvalues, a lnear transformaton matrx s formed whch generates new vectors from x by ( x µ ) x ' = (6) he egenvectors of C are the prncpal components. In the projected space, the new vectors x ' are mnmally correlated. In order to explot the dmensonalty reducton of PCA we would smply choose the top k (k<m) egenvectors of C to form. hs s the common way of choosng the egenvectors to nclude n the transformaton matrx. An assumpton that s made n PCA dmensonalty reducton s that most of the nformaton contaned n the observaton vectors can be adequately represented n the subspace spanned by the frst k prncpal components. a.) Shortcomngs of PCA PCA s a well-establshed technque n pattern recognton and research nto ts applcaton s worthwhle. III. AUOMAIC SPEAKER RECOGIIO Automatc speaker recognton s task of usng a computer to determne who an utterance was spoken by usng a sample of ther speech as a bometrc measure. Speaker recognton dvdes nto two categores, speaker dentfcaton and speaker verfcaton. In ths paper we are concerned wth the latter whch, as the name suggests, s the process of authentcatng whether a speaker s who they clam to be. he degree of smlarty between the test speech sample and the clamed speaker s model s compared to some predefned threshold and the result used to decde whether to accept or reject the clamant. hs task then also dvdes nto text-dependent and text-ndependent speaker verfcaton. In text-ndependent speaker verfcaton the content of the speech used for the verfcaton s a pror unknown. hs type of authentcaton system s more suted to hghly securty senstve applcatons as the rsk of an mpostor fndng out the requred text and ganng access to the system. Fgure 2 shows a smplfed speaker verfcaton system. he detaled dscusson of all the components of the system s beyond the scope of ths paper. However, the nterested reader s referred to [9,0]. Fgure 2: A smplfed speaker verfcaton system he speaker verfcaton s a two class problem n that we are tryng to determne whether the test speech belongs to the clamed speaker (class +) or to an mpostor set (class -). he mpostor set s generally approxmated by speech data from several possble mpostors. A test utterance s compared to the speakers model and the mpostor set and the classfer then decdes whether the clamant s who they clam to be or an mpostor. a.) Measurng System Performance In speaker verfcaton there are two types of errors that may occur; false acceptance (FA) and false rejecton (FR). An FA error refers to the case when an mpostor s classfed as a authentc system user. Applcatons that requre hgh securty am to keep these errors at a mnmum n order to protect the system from unauthorzed use. he FR error occurs when an authorzed user s wrongly classfed as an mpostor and thus dened access to the system. here s generally a tradeoff between FA and FR errors. here a common way of measurng the performance of a speaker verfcaton system s by defnng the pont where rate of FR s s equal to the rate of FA s. hs pont s known as the Equal Error Rate (ERR). he Detecton Error radeoff (DE) [] curve s a popular way of graphcally representng the expected performance of speaker verfcaton system. he curve s a plot of the rate of the errors mentoned above, on a devate scale. An advantage of the DE curve s that t lends tself to easy nterpretaton, the close the system s DE curve moves to the orgn, the better the system performance. IV. Identty clam EXPERIMEAL SEUP AD RESULS As mentoned, speaker verfcaton s a two class problem. hs suggests that SVM, whch s nherently a two-class classfer, would be a natural choce for ths task. It has already been successfully appled to ths task n [,2,2]. In ths secton present the expermental setup and a dscusson of the results obtaned. a.) Expermental Setup For the purposes of our study we followed very closely the expermental setup used n [] when buldng up our baselne.

hat s, we used2 lnear predctve coeffcents (LPC) from whch 8 cepstral coeffcents (LPCC) and ther deltas were computed whch resulted n a 36-dmensonal feature vector. A smple, energy based voce actvty detecton was used to remove those frames wth energy levels below a certan threshold and mean and varance normalzaton was appled. We dd however make use of a RBF kernel functon for the SVM classfer whch dffers from []. he SVM classfer used s part of the orch machne learnng lbrary from the IDIAP Research Insttute. he speaker verfcaton experments were carred out on the IS 2000 database. he mpostor model was created usng the IS 999 database so that none of the testng data was used n the mpostor model. hs way we avod ntroducng any sort of bas to the system. All the experments were conducted on a 3.2GHz Pentum 4 processor. b.) Results he results below compare the performance of the baselne system (whch uses a 36 dmensonal feature vector) to the results acheved when Prncpal Component Analyss feature extracton and dmensonalty reducton was appled. ABLE : Results of feature extracton and dmensonalty reducton on SVM Speaker Verfcaton System Average ranng me [seconds] Average estng me [seconds] Baselne 538.45 2.2 36-d PCA 7786.45 9.92 32-d PCA 244.39 4.22 24-d PCA 300.57 4.90 Mss probablty (n %) 95 90 80 60 40 20 0 Result Comparson 36-d PCA Baselne 32-d PCA 24-d PCA 5 5 0 20 40 60 80 90 95 False Alarm probablty (n %) Fgure 3: DE plot comparng results V. DISCUSSIO OF RESULS able shows the average tranng and testng tmes n seconds. he baselne system had the longest tranng tme whle the systems whch employed PCA feature extracton showed a sgnfcant decrease n both tranng and testng tmes. Fgure 3 shows the resultant DE curves. he baselne outperforms the applcaton of PCA. A possble reason of ths could be that by applyng the PCA transformaton we change the structure of the data. As mentoned, the choce of kernel and kernel parameters for SVM s data dependent. hus t s lkely that the transformed data could be better classfed by a dfferent kernel functon. VI. COCLUSIOS AD FUURE WORK he results show that applyng feature extracton, even wthout dmensonalty reducton, decreases the tranng and testng tmes. However there s also degradaton n performance when PCA s appled. he savngs n processng tme make t worthwhle to explore the possblty of mprovng the system performance whle applyng feature extracton. A few possbltes for achevng are noted below. he choce of optmal kernel for any partcular SVM task remans a matter of tral and error. here s no formula to determne whch kernel s most approprate for whch task. hus t may be necessary to change the kernel parameters or maybe even the kernel tself n order to obtan optmum performance on data that has been transformed by PCA or any other transformaton algorthm. In ths study we chose to use the tradtonal approach of usng the egenvalues as the crtera for choosng the egenvectors wth whch to buld the transformaton matrx for PCA. In future work we am to repeat the experment, havng optmzed the system parameters, usng the Fsher rato n order to determne whether ths method wll perform as well on an SVM based system as t dd on the VQ based system n [4]. Other future work ncludes nvestgatng the performance of other feature extracton technques such as Independent Component Analyss when appled to ths task. VII. REFERECES [.] W. M. Campbell, J. P. Campbell, D. A. Reynolds, E. Snger and P. A. orres-carrasqullo, Support Vector Machnes for Speaker and Language Recognton. Computer Speech and Language, August 2005. [2.] V. Wan, Speaker Verfcaton wth Support Vector Machnes. PhD thess, Department of Computer Scence, Department of Sheffeld, June 2003. [3.] L. Wolf and S. Blesch, Combnng Varable Selecton

wth Dmensonalty Reducton. Massachusetts Insttute of echnology, Computer Scence and Artfcal Intellgence Laboratory, CBCL Memo 247, March 2005 [4.] P. Dng and L. Zhang, Speaker Recognton usng Prncpal Component Analyss. In Proceedngs of ICOIP 200. Shangha, Chna. ovember 200 [5.] X. Wang, Feature Extracton and Dmensonalty Reducton n Pattern Recognton and her Applcaton n Speech Recognton. PhD thess, School of Mcroelectroncal Engneerng, Grffth Unversty, ovember 2002. [6.] M. Awad and L.Khan, Applcatons and Lmtatons of Support Vector Machnes. Department of Computer Scence, Unversty of exas at Dallas, USA. [7.] A. Ganapathraju, Support Vector Machnes for Speech Recognton. PhD thess, Department of Electrcal and Computer Engneerng, Mssssp State Unversty, May 2002. [8.] P. Watanachaturaporn, P.K. Varshney and M. K. Arora, Evaluaton of Factors Affectng Support Vector Machnes for Hyperspectral Classfcaton. In Proceedngs of ASPRS 2005. Baltmore, USA, March 2005. [9.] C. J. C. Burges, A utoral on Support Vector Machnes for Pattern Recognton. Data Mnng and Knowledge Dscovery,998, vol.2, pg 2-67. [0.]F. Bmbot, J-F. Bonastre, C. Fredoulle, G. Graver, I Magrn-Chagnolleau, S. Megner,. Merln, J. Ortega- Garca, D. Petrovska-Delacrétaz and D. A. Reynolds, A utoral on ext-independent Speaker Verfcaton. Journal on Appled Sgnal Processng 2004 vol 4, pg 430-45. [.]A. Martn, G. Doddngton,. Kamm, M. Ordowsk and M. Przybock, he DE Curve n Assesment of Detecton ask Performance. [2.]W. M. Campbell, J. P. Campbell, D. A. Reynolds, D. A. Jones and. R. Leek, Phonetc Speaker Recognton wth Support Vector Machnes. MI Lncoln Laboratory, Lexngton. [3.]M.E. Wall, A. Rechtstener and L.M. Rocha, Sngular Value Decomposton and Prncpal Component Analyss. In A Practcal Approach to Mcroarray Data Analyss, (edtors D.P. Berrar, W. Dubtzky and M. Granzow. Kluwer; orwell, MA, 2003, pg 9-09. [4.]R. Collobert and S. Bengo, SVMorch: Support Vector Machnes for Large-Scale Regresson Problems. Journal of Machne Learnng Research, 200, vol., pg 43-60.