Stability Region based Expectation Maximization for Model-based Clustering

Size: px

Start display at page:

Download "Stability Region based Expectation Maximization for Model-based Clustering"

Collin Willis Riley
5 years ago
Views:

1 Stablty Regon based Expectaton Maxmzaton for Model-based Clusterng Chandan K. Reddy, Hsao-Dong Chang School of Electrcal and Computer Engneerng, Cornell Unversty, Ithaca, NY Bala Rajaratnam Department of Statstcal Scence Cornell Unversty, Ithaca, NY Abstract In spte of the ntalzaton problem, the Expectaton- Maxmzaton (EM) algorthm s wdely used for estmatng the parameters n several data mnng related tasks. Most popular model-based clusterng technques mght yeld poor clusters f the parameters are not ntalzed properly. To reduce the senstvty of ntal ponts, a novel algorthm for learnng mxture models from multvarate data s ntroduced n ths paper. The proposed algorthm takes advantage of TRUST-TECH (TRansformaton Under STabltyreTanng Equlbra CHaracterzaton) to compute neghborhood local maxma on lkelhood surface usng stablty regons. Bascally, our method coalesces the advantages of the tradtonal EM wth that of the dynamc and geometrc characterstcs of the stablty regons of the correspondng nonlnear dynamcal system of the log-lkelhood functon. Two phases namely, the EM phase and the stablty regon phase, are repeated alternatvely n the parameter space to acheve mprovements n the maxmum lkelhood. Though appled to Gaussan mxtures n ths paper, our technque can be easly generalzed to any other parametrc fnte mxture model. The algorthm has been tested on both synthetc and real datasets and the mprovements n the performance compared to other approaches are demonstrated. The robustness wth respect to ntalzaton s also llustrated expermentally. 1 Introducton Fnte mxtures allow a probablstc model-based approach to unsupervsed learnng [10] whch plays an mportant role n predctve data mnng applcatons. One of the most popular methods used for fttng mxture models to the observed data s the Expectaton-Maxmzaton (EM) algorthm whch converges to the maxmum lkelhood estmate of the mxture parameters locally [4, 6]. The usual steepest Correspondng author : emal - ckr6@cornell.edu descent, conjugate gradent, or Newton-Raphson methods are too complcated for use n solvng ths problem [19]. EM has become a popular method snce t takes advantages of problem specfc propertes. EM based approaches have been successfully used to solve problems that arse n varous other applcatons [12, 2]. In ths paper, we consder the problem of learnng parameters of Gaussan Mxture Models (GMM). Fg 1 shows data generated by three Gaussan components wth dfferent mean and varance. Note that every data pont has a probablstc (or soft) membershp that gves the probablty wth whch t belongs to each of the components. Ponts that belong to component 1 wll have hgh probablty of membershp for component 1. On the other hand, data ponts belongng to components 2 and 3 are not well separated. The problem of learnng mxture models nvolves not only estmatng the parameters of these components but also fndng the probabltes wth whch each data pont belongs to these components. Gven the number of components and an ntal set of parameters, EM algorthm can be appled to compute the optmal estmates of the parameters that maxmze the lkelhood of the data gven the estmates of these components. However, the man problem wth the EM algorthm s that t s a greedy method whch s very senstve to the gven ntal set of parameters. To overcome ths problem, a novel two phase algorthm based on stablty regon analyss s proposed. The man research concerns that motvated the new algorthm presented n ths paper are : EM algorthm for mxture modelng converges to a local maxmum of the lkelhood functon very quckly. There are many other promsng local optmal solutons n the close vcnty of the solutons obtaned from the methods that provde good ntal guesses of the soluton. Model selecton crtera usually assumes that the global optmal soluton of the log-lkelhood functon can be obtaned. However, achevng ths s computatonally ntractable.

2 2 Relevant Background Fgure 1. Data generated by three Gaussan components. The problem of learnng mxture models s to obtan the parameters of these Gaussan components and the membershp probabltes of each datapont. Some regons n the search space do not contan any promsng solutons. The promsng and nonpromsng regons coexst and t becomes challengng to avod wastng computatonal resources to search n non-promsng regons. Of all the concerns mentoned above, the fact that most of the local maxma are not dstrbuted unformly [16] makes t mportant for us to develop algorthms that not only help us to avod searchng n the low-lkelhood regons but also emphasze the mportance of explorng promsng subspaces more thoroughly. Ths subspace search wll also be useful for makng the soluton less senstve to the ntal set of parameters. In ths paper, we propose a novel two phase algorthm for estmatng the parameters of mxture models. Usng concepts of dynamcal systems and EM algorthm smultaneously to explot the problem specfc features of the mxture models, our algorthm obtans the optmal set of parameters by searchng for the global maxmum on the lkelhood surface n a systematc manner. The rest of ths paper s organzed as follows: Secton 2 gves some relevant background about varous methods proposed n the lterature for solvng the problem of learnng mxture models. Secton 3 dscusses some prelmnares about mxture models, EM algorthm and stablty regons. Secton 4 dscusses our new framework and the detals of our mplementaton are gven n Secton 5. Secton 6 shows the expermental results of our algorthm on synthetc and real datasets. Fnally, Secton 7 concludes our dscusson wth future research drectons. Although EM and ts varants have been extensvely used for learnng mxture models, several researchers have approached the problem by dentfyng new technques that gve good ntalzaton. More generc technques lke determnstc annealng [16], genetc algorthms [13] have been appled to obtan a good set of parameters. Though, these technques have asymptotc guarantees, they are very tme consumng and hence cannot be used for most of the practcal applcatons. Some problem specfc algorthms lke splt and merge EM [17], component-wse EM [6], greedy learnng [18], ncremental verson for sparse representatons[11], parameter space grd [8] are also proposed n the lterature. Some of these algorthms are ether computatonally very expensve or nfeasble when learnng mxtures n hgh dmensonal spaces [8]. Inspte of all the expense n these methods, very lttle effort has been taken to explore promsng subspaces wthn the larger parameter space. Most of these algorthms eventually apply the EM algorthm to move to a locally maxmal set of parameters on the lkelhood surface. Smpler practcal approaches lke runnng EM from several random ntalzatons, and then choosng the fnal estmate that leads to the local maxmum wth hgher value of the lkelhood are also successful to certan extent [15]. Though some of these methods apply other addtonal mechansms (lke perturbatons [5]) to escape out of the local optmal solutons, systematc methods are yet to be developed for searchng the subspace. The dynamcal system of the log-lkelhood functon reveals more nformaton on the neghborhood stablty regons and ther correspondng local maxma [3]. Hence, the dffcultes of fndng good solutons when the error surface s very rugged can be overcome by addng stablty regon based mechansms to escape out of the convergence zone of the local maxma. Though ths method mght ntroduce some addtonal cost, one has to realze that exstng approaches are much more expensve due to ther stochastc nature. Specfcally, for a problem n ths context, where there s a non-unform dstrbuton of local maxma, t s dffcult for most of the methods to search neghborng regons [20]. For ths reason, t s more desrable to apply TRUST-TECH based Expectaton Maxmzaton (TT-EM) algorthm after obtanng some pont n a promsng regon. The man advantages of the proposed algorthm are that t: Explores most of the neghborhood local optmal solutons unlke the tradtonal stochastc algorthms. Acts as a flexble nterface between the EM algorthm and other global methods. Allows the user to work wth exstng clusters obtaned from the tradtonal approaches and mproves the qual-

3 ty of the solutons based on the maxmum lkelhood crtera. Helps the expensve global methods to truncate early. Explots the fact that promsng solutons are obtaned by faster convergence of the EM algorthm. 3 Prelmnares We now ntroduce some necessary prelmnares on mxture models, EM algorthm and stablty regons. Frst, we descrbe the notaton used n the rest of the paper: Table 1. Descrpton of the Notatons used n the paper Notaton Descrpton d number of features n number of data ponts k number of components s total number of parameters Θ parameter set θ parameters of th component α mxng weghts for th component X observed data Z mssng data Y complete data t tmestep for the estmates 3.1 Mxture Models Lets assume that there are k Gaussans n the mxture model. The form of the probablty densty functon s as follows: p(x Θ) = k α p(x θ ) (1) =1 where x = [x 1,x 2,..., x d ] T s the feature vector of d dmensons. The α k s represent the mxng weghts. Θ represents the parameter set (α 1,α 2,...α k,θ 1,θ 2,...θ k )andp s a unvarate Gaussan densty parameterzed by θ (.e. µ and σ ): 1 p(x θ )= e (x µ ) (2π)σ 2 2σ 2 (2) Also, t should be notced that beng probabltes α must satsfy 0 α 1, =1,.., k, and k α =1 (3) =1 Gven a set of n..d samples X = {x (1),x (2),.., x (n) }, the log-lkelhood correspondng to a mxture s log p(x Θ) = log = n log p(x (j) Θ) k α p(x (j) θ ) =1 The goal of learnng mxture models s to obtan the parameters Θ from a set of n data ponts whch are the samples of a dstrbuton wth densty gven by (1). The Maxmum Lkelhood Estmate (MLE) s gven by : Θ MLE = arg max { log p(x Θ) } (5) Θ where Θ ndcates the entre parameter space. Snce, ths MLE cannot be found analytcally for mxture models, one has to rely on teratve procedures that can fnd the global maxmum of log p(x Θ). The EM algorthm descrbed n the next secton has been used successfully to fnd the local maxmum of such a functon [9]. 3.2 Expectaton Maxmzaton The EM algorthm assumes X to be observed data. The mssng part, termed as hdden data, s a set of n labels Z = {z (1), z (2),.., z (n) } assocated wth n samples, ndcatng whch component produced each sample [9]. Each label z (j) = [z (j),.., z(j) ] s a bnary vector where z (j) =1and z (j) m 1,z(j) 2 k (4) =0 m, means the sample x (j) was produced by the th component. Now, the complete log-lkelhood (.e. the one from whch we would estmate Θ f the complete data Y = {X, Z}s log p(x, Z Θ) = log p(y Θ) = k log k =1 =1 [ α p(x (j) θ )] z(j) z (j) log [ α p(x (j) θ )] (6) The EM algorthm produces a sequence of estmates { Θ(t),t =0, 1, 2,...} by alternately applyng the followng two steps untl convergence: E-Step : Compute the condtonal expectaton of the hdden data, gven X and the current estmate Θ(t). Snce log p(x, Z Θ) s lnear wth respect to the mssng data Z, we smply have to compute the condtonal expectaton W E[Z X, Θ(t)], and plug t nto log p(x, Z Θ). ThsgvestheQ-functon as follows:

4 Q(Θ Θ(t)) E Z [log p(x, Z) X, Θ(t)] (7) Snce Z s a bnary vector, ts condtonal expectaton sgvenby: E [ z (j) X, Θ(t)] = Pr [ z (j) =1 x (j), Θ(t)] = α (t)p(x (j) θ (t)) k =1 α (t)p(x (j) θ (t)) (8) where the last equalty s smply the Bayes law (α s the a pror probablty that z (j) =1), whle s the a posteror probablty that z (j) =1gven the observaton x (j). M-Step : The estmates of the new parameters are updated usng the followng equaton : 3.3 EM for GMMs Θ(t +1)=arg max Θ {Q(Θ, Θ(t))} (9) Several varants of the EM algorthm have been extensvely used to solve ths problem. The convergence propertes of the EM algorthm for Gaussan mxtures are thoroughly dscussed n [19]. The Q functon for GMM s gven by: Q(Θ Θ(t)) = where = k =1 k =1 1 [log σ 2π (x(j) µ ) 2 2σ 2 α (t) 1 σ (t) e 2σ (t) 2 (x(j) µ (t)) 2 + log α ] (10) (11) α (t) 1 σ (t) e 2σ (t) 2 (x(j) µ (t)) 2 The maxmzaton step s gven by the followng equaton : Q(Θ Θ(t)) = 0 (12) Θ k where Θ k s the parameters for the k th component. Because of the assumpton made that each data pont comes from a sngle component, solvng the above equaton becomes trval. The updates for the maxmzaton step n the case of GMMs are gven as follows: µ (t +1)= σ 2 (t +1)= α (t +1)= 1 n n w(j) x (j) n w(j) n w(j) (x (j) µ (t +1)) 2 n w(j) 3.4 Stablty Regons Ths secton manly deals wth the transformaton of the orgnal log-lkelhood functon nto ts correspondng nonlnear dynamcal system and ntroduces some termnology pertnent to comprehend our algorthm. Ths transformaton gves the correspondence between all the crtcal ponts of the s-dmensonal lkelhood surface and that of ts dynamcal system. For the case of sphercal Gaussan mxtures wth k components, we have the number of unknown parameters s =3k 1. For convenence, the maxmzaton problem s transformed nto a mnmzaton problem defned by the followng objectve functon : mn Θ f(θ) = mn { log p(y Θ) } Θ = max Θ { log p(y Θ) } (13) where f(θ) s assumed to be n C 2 (R s, R). Defnton 1 Θ s sad to be a crtcal pont of (13) f t satsfes the followng condton f( Θ) = 0 (14) A crtcal pont s sad to be nondegenerate f at the crtcal pont Θ R s, d T 2 f( Θ)d 0 ( d 0). We construct the followng gradent system n order to locate crtcal ponts of the objectve functon (13): Θ(t) = f(θ) (15) where the state vector Θ belongs to the Eucldean space R s, and the vector feld f : R s R s satsfes the suffcent condton for the exstence and unqueness of the solutons. The soluton curve of Eq. (15) startng from Θ at tme t =0 s called a trajectory and t s denoted by Φ(Θ, ) :R R s. A state vector Θ s called an equlbrum pont of Eq. (15) f f(θ) = 0. An equlbrum pont s sad to be hyperbolc f the Jacoban of f at pont Θ has no egenvalues wth zero

5 real part. The gradent system for the log-lkelhood functon n the case of sphercal Gaussans s constructed as follows : [ µ 1 (t).. µ k (t) σ 1 (t).. σ k (t) α 1 (t).. α k 1 (t)] T [ f f f f =.... µ 1 µ k σ 1 σ k where f µ = f σ = f α = 1 α (x (j) µ ) 2σ 2 [ 1σ + (x(j) µ ) 2 ] σ 3 f α 1.. f α k 1 =1,.., k ] T =1,.., k (16) =1,.., k 1 For smplcty, we show the constructon of the gradent system for the case of sphercal Gaussans. It can be easly extended to the full covarance Gaussan mxture case. It should be noted that only (k-1) α values are consdered n the gradent system because of the unty constrant. The dependent varable α k s wrtten as follows: k 1 α k =1 α j (17) Defnton 2 A hyperbolc equlbrum pont s called a (asymptotcally) stable equlbrum pont (SEP) f all the egenvalues of ts correspondng Jacoban have negatve real part. Conversely, t s an unstable equlbrum pont f some egenvalues have a postve real part. An equlbrum pont s called a type-k equlbrum pont f ts correspondng Jacoban has exact k egenvalues wth postve real part. The stable (W s ( x)) andunstable (W u ( x)) manfolds of an equlbrum pont, say x, sdefned as: W s ( x) ={x R s : lm Φ(x, t) = x} (18) t W u ( x) ={x R s : lm Φ(x, t) = x} (19) t The task of fndng multple local maxma on the loglkelhood surface s transformed nto the task of fndng multple stable equlbrum ponts on ts correspondng gradent system. The advantage of our approach s that ths transformaton nto the correspondng dynamcal system wll yeld more knowledge about the varous dynamc and geometrc characterstcs of the orgnal surface and leads to the development a powerful method for fndng mproved solutons. In ths paper, we are partcularly nterested n the propertes of the local maxma and ther one-to-one correspondence to the stable equlbrum ponts. To comprehend the transformaton, we need to defne energy functon. A smooth functon V ( ) : R s R s satsfyng V (Φ(Θ,t)) < 0, x / {set of equlbrum ponts (E)} and t R + s termed as energy functon. Theorem 3.1 [3]: f(θ) s a energy functon for the gradent system (15). Defnton 3 A type-1 equlbrum pont x d (k=1) on the practcal stablty boundary of a stable equlbrum pont x s s called a decomposton pont. Defnton 4 The practcal stablty regon of a stable equlbrum pont x s of a nonlnear dynamcal system (15), denoted by A p (x s ) and s the nteror of closure of the stablty regon A(x s ) whch s gven by : A(x s )={x R s : lm t Φ(x, t) =x s } (20) The boundary of practcal stablty regon s called the practcal stablty boundary of x s and wll be denoted by A p (x s ). Theorem 3.2 asserts that the practcal stablty boundary s contaned n the unon of the closure of the stable manfolds of all the decomposton ponts on the practcal stablty boundary. Hence, f the decomposton ponts can be dentfed, then an explct characterzaton of the practcal stablty boundary can be establshed usng (21). Ths theorem gves an explct descrpton of the geometrcal and dynamcal structure of the practcal stablty boundary. Theorem 3.2 (Characterzaton of practcal stablty boundary)[7]: Consder a negatve gradent system descrbed by (15). Let σ, =1,2,... be the decomposton ponts on the practcal stablty boundary A p (x s ) of a stable equlbrum pont, say x s.then A p (x s )= σ A p W s (σ ). (21)

6 (a) Parameter Space (b) Functon Space Fgure 2. Varous stages of our algorthm n (a) Parameter space - the sold lnes ndcate the practcal stablty boundary. Ponts hghlghted on the stablty boundary (σ 1,σ 2 ) are the decomposton ponts. The dotted lnes ndcate the convergence of the EM algorthm. d j are the promsng drectons generated at the local maxmum LM. The dashed lnes ndcate the stablty regon phase. x 1, x 2 and x 3 are the ext ponts on the practcal stablty boundary (b) Dfferent varables n the functon space and ther correspondng log-lkelhood values. Our approach takes advantage of TRUST-TECH (TRansformaton Under STablty-reTanng Equlbra CHaracterzaton) to compute neghborhood local maxma on lkelhood surface usng stablty regons. Orgnally, the basc dea of our algorthm was to fnd decomposton ponts on the practcal stablty boundary. Snce, each decomposton pont connects two local maxma unquely, t s mportant to obtan the saddle ponts from the gven local maxmum and then move to the next local maxmum through ths decomposton pont [14]. Though, ths procedure gves a guarantee that the local maxmum s not revsted, the computatonal expense for tracng the stablty boundary and dentfyng the decomposton pont s hgh compared to the cost of applyng the EM algorthm drectly usng the ext pont wthout consderng the decomposton pont. One can use the saddle pont tracng procedure descrbed n [14] for applcatons where the local methods lke EM are much expensve. 4 Our Algorthm Our framework conssts manly of two phases whch are repeated n the promsng subspaces of the parameter search space. It s more effectve to use our algorthm at only these promsng subspaces whch are usually obtaned by stochastc global methods. The frst phase s the local phase (or the EM phase) where the promsng solutons are refned to the correspondng locally optmal parameter set. The second phase whch s the man contrbuton of ths paper, s the stablty regon phase, where the ext ponts are computed and the neghborhood solutons are systematcally explored through these ext ponts. Fg. 2 shows the dfferent steps of our algorthm both n (a) the parameter space and (b) the functon space. Ths approach can be treated as a hybrd between global methods for ntalzaton and the EM algorthm whch gves the local maxma. One of the man advantages of our approach s that t searches the parameter space more determnstcally. Ths approach dffers from tradtonal local methods by computng multple local solutons n the neghborhood regon. Ths also enhances user flexblty by allowng the users to choose between dfferent sets of good clusterngs. Though global methods gve promsng subspaces, t s mportant to explore ths subspace more thoroughly especally n problems lke parameter estmaton. Algorthm 1 descrbes our approach. In order to escape out of ths local maxmum, our methods needs to compute certan promsng drectons based on the local behavour of the functon. One can realze that generatng these promsng drectons s one of the mportant aspects of our algorthm. Surprsngly, choosng random drectons to move out of the local maxmum works

7 Algorthm 1 Stablty Regon based EM Algorthm Input: Parameters Θ,DataX, tolerance τ, StepS p Output: Θ MLE Algorthm: Apply global method and store the q promsng solutons Θ nt = {Θ 1, Θ 2,.., Θ q } Intalze E= φ whle Θ nt φ do Choose Θ Θ nt,setθ nt =Θ nt \{Θ } LM = EM(Θ, X,τ) E = E {LM } Generate promsng drecton vectors d j from LM for each d j do Compute Ext Pont (X j ) along d j startng from LM by evaluatng the log-lkelhood functon gven by (4) New j = EM(X j + ɛ d j, X,τ) f new j / E then E = E New j end f end for end whle Θ MLE = max{val(e )} well for ths problem. One mght also use other drectons lke egenvectors of the Hessan or ncorporate some doman-specfc knowledge (lke nformaton about prors, approxmate locaton of cluster means, user preferences on the fnal clusters) dependng on the applcaton that they are workng on and the level of computatonal expense that they can afford. We used random drectons n our work because they are very cheap to compute. Once the promsng drectons are generated, ext ponts are computed along these drectons. Ext ponts are ponts of ntersecton between any gven drecton and the practcal stablty boundary of that local maxmum along that partcular drecton. If the stablty boundary s not encountered along a gven drecton, t s very lkely that one mght not fnd any new local maxmum n that drecton. Wth a new ntal guess n the vcnty of the ext ponts, EM algorthm s appled agan to obtan a new local maxmum. 5 Implementaton Detals Our program s mplemented n MATLAB and runs on Pentum IV 2.8 GHz machne. The man procedure mplemented s TT EM descrbed n Algorthm 2. The algorthm takes the mxture data and the ntal set of parameters as nput along wth step sze for movng out and tolerance for convergence n the EM algorthm. It returns the set of parameters that correspond to the Ter-1 neghborng local optmal solutons. The procedure eval returns the log-lkelhood score gven by (4). The Gen Dr procedure generates promsng drectons from the local maxma. Ext ponts are obtaned along these generated drectons. The procedure update moves the current parameter to the next parameter set along a gven k th drecton Dr[k]. Someof the drectons mght have one of the followng two problems: () Ext ponts mght not be obtaned n these drectons. () Even f the ext pont s obtaned t mght converge to a less promsng soluton. If the ext ponts are not found along these drectons, search wll be termnated after Eval MAX number of evaluatons. For all ext ponts that are successfully found, EM procedure s appled and all the correspondng neghborhood set of parameters are stored n the Params[] 1. Snce, dfferent parameters wll be of dfferent range, care must be taken whle multplyng wth the step szes. It s mportant to use the current estmates to get an approxmaton of the step sze wth whch one should move out along each parameter n the search space. Fnally, the soluton wth the hghest lkelhood score amongst the orgnal set of parameters and the Ter-1 solutons s returned. Algorthm 2 Params[ ] TT EM(Pset, Data, T ol, Step) Val= eval(pset) Dr[ ]=Gen Dr(Pset) Eval MAX = 500 for k =1to sze(dr) do Params[k] =Pset ExtPt = OFF Prev Val= Val Cnt=0 whle (! ExtPt)&&(Cnt < Eval MAX) do Params[k] =update(params[k],dr[k],step) Cnt = Cnt +1 Next Val = eval(params[k]) f (Next Val > Prev Val) then ExtPt = ON end f Prev Val= Next Val end whle f count < Eval MAX then Params[k] =update(params[k],dr[k],asc) Params[k] =EM(Params[k], Data, T ol) else Params[k] =NULL end f end for Return max(eval(params[])) 6 Results and Dscusson Our algorthm has been tested on both synthetc and real datasets. The ntal values for the centers and the covar- 1 To ensure that the new ntal ponts are n the dfferent stablty regons, one should move along the drectons ɛ away from the ext ponts.

(c) Ext pont obtaned by our algorthm (d) The fnal soluton obtaned by applyng the EM algorthm to the ntal pont n the neghborng stablty regon. ances were chosen unformly random.

8 (a) (b) (c) (d) Fgure 3. Parameter estmates at varous stages of our algorthm on the three component Gaussan mxture model (a) Poor random ntal guess (b) Local maxmum obtaned after applyng EM algorthm wth the poor ntal guess (c) Ext pont obtaned by our algorthm (d) The fnal soluton obtaned by applyng the EM algorthm to the ntal pont n the neghborng stablty regon. ances were chosen unformly random. Unform prors were chosen for ntalzng the components. For real datasets, the centers were chosen randomly from the sample ponts. Fgure 4. Graph showng lkelhood vs Evaluatons. A corresponds to the orgnal local maxmum (L= ). B corresponds to the ext pont (L= ). C corresponds to the new ntal pont n the neghborng stablty regon (L= ) after movng out by ɛ. D corresponds to the new local maxmum (L= ). 6.1 Synthetc Datasets A smple synthetc data wth 40 samples and 5 sphercal Gaussan components was generated and tested wth our algorthm. Prors were unform and the standard devaton was The centers for the fve components are gven as follows: µ 1 =[0.30.3] T, µ 2 =[0.50.5] T, µ 3 =[0.70.7] T, µ 4 =[0.30.7] T and µ 5 =[0.70.3] T. The second dataset was that of a dagonal covarance case contanng n = 900 data ponts. The data generated from a two-dmensonal, three-component Gaussan mxture dstrbuton wth mean vectors at [0 2] T, [0 0] T, [0 2] T and same dagonal covarance matrx wth values 2 and 0.2 along the dagonal [16]. All the three mxtures have unform prors. Fg. 3 shows varous stages of our algorthm and demonstrates how the clusters obtaned from exstng algorthms are mproved usng our algorthm. The ntal clusters obtaned are of low qualty because of the poor ntal set of parameters. Our algorthm takes these clusters and apples the stablty regon step and the EM step smultaneously to obtan the fnal result. Fg. 4 shows the value of the log-lkelhood durng the stablty regon phase and the EM teratons. In the thrd synthetc dataset, a more complcated overlappng Gaussan mxtures are consdered [6]. The parameters are as follows: µ 1 = µ 2 =[ 4 4] T, µ 3 =[22] T and µ 4 =[ 1 6] T. α 1 = α 2 = α 3 =0.3and α 4 =0.1. Σ 1 = Σ 3 = [ [ ] ] 6.2 Real Datasets Σ 2 = Σ 4 = [ ] [ Two real datasets obtaned from the UCI Machne Learnng repostory [1] were also used for testng the performance of our algorthm. Most wdely used Irs data wth 150 samples, 3 classes and 4 features was used. Wne data set wth ]

9 Table 2. Performance of our algorthm on an average of 100 runs on varous synthetc and real datasets Dataset Samples Clusters Features EM (mean ± std) TRUST-TECH-EM (mean ± std) Sphercal ± ±0.6 Ellptcal ± ±0.03 Full covarance ± ± Full covarance ± ±37.02 Irs ± ±11.72 Wne ± ± samples was also used for testng. Wne data had 3 classes and 13 features. For these real data sets, the class labels were deleted thus treatng t as unsupervsed learnng problem. Table 2 summarzes our results over 100 runs. The mean and the standard devatons of the log-lkelhood values are reported. The tradtonal EM algorthm wth random starts s compared aganst our algorthm on both synthetc and real data sets. Our algorthm not only obtans hgher lkelhood value but also produces t wth hgh confdence. The low standard devaton of our results ndcates the robustness of obtanng the global maxmum. In the case of the wne data, the mprovements wth our algorthm are not much sgnfcant compared to the other datasets. Ths mght be due to the fact that the dataset mght not have Gaussan components. Our method assumes that the underlyng dstrbuton of the data s mxture of Gaussans. Table 3 gves the results of TRUST-TECH compared wth other methods lke splt and merge EM and k-means+em proposed n the lterature. Table 3. Comparson of TRUST-TECH-EM wth other methods Method Ellptcal Irs RS+EM ± ± 27 K-Means+EM ± ± 10 SMEM ± ± 6 TRUST-TECH-EM ± ± 11 clusters obtaned. Most of the focus n the lterature was on new methods for ntalzaton or new clusterng technques whch often do not take advantage of the exstng results and completely start the clusterng procedure from scratch. Though shown only for the case of multvarate Gaussan mxtures, our technque can be effectvely appled to any parametrc fnte mxture model. Table 4 summarzes the average number of teratons taken by the EM algorthm for the convergence to the local optmal soluton. We can see that the most promsng soluton produced by our TRUST-TECH methodology converges much faster. In other words, our method can effectvely take advantage of the fact that the convergence of the EM algorthm s much faster for hgh qualty solutons. Ths s an nherent property of the EM algorthm when appled to the mxture modelng problem. We explot ths property of the EM for mprovng the effcency of our algorthm. Hence, for obtanng the Ter-1 solutons usng our algorthm, the threshold for the number of teratons can be sgnfcantly lowered. Table 4. Number of teratons taken for the convergence of the best soluton. Dataset Avg. no. of No. of teratons teratons for the best soluton Sphercal Ellptcal Full covarance Dscusson It wll be effectve to use our algorthm for those solutons that appear to be promsng. Due to the nature of the problem, t s very lkely that the nearby solutons surroundng the exstng soluton wll be more promsng. One of the prmary advantages of our method s that t can be used along wth other popular methods avalable and mprove the qualty of the exstng solutons. In clusterng problems, t s an added advantage to perform refnement of the fnal 7 Concluson and Future Work A novel stablty regon based EM algorthm has been ntroduced for estmatng the parameters of mxture models. The EM phase and the stablty regon phase are appled alternatvely n the context of the well-studed mxture model parameter estmaton problem. The concept of stablty regon helps us to understand the topology of the orgnal loglkelhood surface. Our method computes the neghborhood

10 local maxma on lkelhood surface usng stablty regons of the correspondng nonlnear dynamcal system. The algorthm has been tested successfully on varous synthetc and real datasets and the mprovements n the performance are clearly manfested. Some propertes of the EM algorthm about the rate of convergence have been exploted effcently. Our algorthm can be easly extended to popularly used k-means clusterng technque. In the future, we plan to work on applyng these stablty regon based methods for other wdely used EM related parameter estmaton problems lke tranng Hdden Markov Models, Mxture of Factor Analyzers, Probablstc Prncpal Component Analyss, Bayesan Networks etc. We would also plan to extend our technque to Markov Chan Monte Carlo strateges lke Gbbs samplng for the estmaton of mxture models. References [1] C.L. Blake and C.J. Merz. UCI repostory of machne learnng databases. Unversty of Calforna, Irvne, Dept. of Informaton and Computer Scences, [2] C. Carson, S. Belonge, H. Greenspan, and J. Malk. Blobworld: Image segmentaton usng expectatonmaxmzaton and ts applcaton to mage queryng. IEEE Transactons on Pattern Analyss and Machne Intellgence, 24(8): , [3] H.D. Chang and C.C. Chu. A systematc search method for obtanng multple local optmal solutons of nonlnear programmng problems. IEEE Transactons on Crcuts and Systems: I Fundamental Theory and Applcatons, 43(2):99 109, [4] A. P. Demspter, N. A. Lard, and D. B. Rubn. Maxmum lkelhood from ncomplete data va the EM algorthm. Journal of the Royal Statstcal Socety Seres B, 39(1):1 38, [5] G. Eldan, M. Nno, N. Fredman, and D. Schuurmans. Data perturbaton for escapng local maxma n learnng. In Proceedngs of the Eghteenth Natonal Conference on Artfcal Intellgence, pages , [6] M. Fgueredo and A.K. Jan. Unsupervsed learnng of fnte mxture models. IEEE Transactons on Pattern Analyss and Machne Intellgence, 24(3): , [7] J. Lee and H.D. Chang. A dynamcal trajectory-based methodology for systematcally computng multple optmal solutons of general nonlnear programmng problems. IEEE Transactons on Automatc Control, 49(6): , [8] J.Q.L. Estmaton of Mxture Models. PhD thess, Department of Statstcs,Yale Unversty, [9] G. McLachlan and T. Krshnan. The EM Algorthm and Extensons. John Wley and Sons, New York, [10] G. J. McLachlan and K. E. Basford. Mxture models: Inference and applcatons to clusterng. Marcel Dekker, New York, [11] R. M. Neal and G. E. Hnton. A new vew of the EM algorthm that justfes ncremental, sparse and other varants. In M. I. Jordan, edtor, Learnng n Graphcal Models, pages Kluwer Academc Publshers, [12] K. Ngam, A. McCallum, S. Thrun, and T. Mtchell. Text classfcaton from labeled and unlabeled documents usng EM. Machne Learnng, 39(2-3): , [13] F. Pernkopf and D. Bouchaffra. Genetc-based EM algorthm for learnng gaussan mxture models. IEEE Transactons on Pattern Analyss and Machne Intellgence, 27(8): , [14] C. K. Reddy and H.D. Chang. A stablty boundary based method for fndng saddle ponts on potental energy surfaces. Journal of Computatonal Bology, 13(3): , [15] S. J. Roberts, D. Husmeer, I. Rezek, and W. Penny. Bayesan approaches to gaussan mxture modelng. IEEE Transactons on Pattern Analyss and Machne Intellgence, 20(11): , [16] N. Ueda and R. Nakano. Determnstc annealng EM algorthm. Neural Networks, 11(2): , [17] N. Ueda, R. Nakano, Z. Ghahraman, and G.E. Hnton. SMEM algorthm for mxture models. Neural Computaton, 12(9): , [18] J. J. Verbeek, N. Vlasss, and B. Krose. Effcent greedy learnng of gaussan mxture models. Neural Computaton, 15(2): , [19] L. Xu and M. I. Jordan. On convergence propertes of the EM algorthm for gaussan mxtures. Neural Computaton, 8(1): , [20] B. Zhang, C. Zhang, and X. Y. Compettve EM algorthm for fnte mxture models. Pattern Recognton, 37(1): , 2004.

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Learning the Kernel Parameters in Kernel Minimum Distance Classifier Learnng the Kernel Parameters n Kernel Mnmum Dstance Classfer Daoqang Zhang 1,, Songcan Chen and Zh-Hua Zhou 1* 1 Natonal Laboratory for Novel Software Technology Nanjng Unversty, Nanjng 193, Chna Department