IDIAP RESEARCH REPORT EXTEDED BIC CRITERIO FOR ODEL SELECTIO Itshak Lapdot Andrew orrs IDIAP-RR-0-4 Dalle olle Insttute for Perceptual Artfcal Intellgence P.O.Box 59 artgny Valas Swtzerland phone +4 7 7 77 fax +4 7 7 77 e-mal secretarat@dap.ch nternet http://www.dap.ch
IDIAP Research Report 0-4 EXTEDED BIC CRITERIO FOR ODEL SELECTIO Itshak Lapdot Andrew orrs OCTOBER 00 Abstract. odel selecton s commonly based on some varaton of the BIC or mnmum message length crtera, such as L and DL. In ether case the crteron s splt nto two terms: one for the model (data code length/model complexty) and one for the data gven the model (message length/data lkelhood). For problems such as change detecton, unsupervsed segmentaton or data clusterng t s common practce for the model term to comprse only a sum of sub-model terms. In ths paper t s shown that the full model complexty must also take nto account the number of sub models and the labels whch assgn data to each sub model. From ths analyss we derve an extended BIC approach (EBIC) for ths class of problem. Results wth artfcal data are gven to llustrate the propertes of ths procedure.
IDIAP-RR-0-4. Introducton In model selecton by mnmum two-part message length encodng, a penalty term s added to the data message length term to account for model complexty. Change detecton, segmentaton and clusterng are unsupervsed applcatons whch can apply the BIC, DL or L crtera for model selecton [,, 3]. In change detecton or segmentaton t s requred to dentfy change ponts n a data sequence at whch data should be separated and assgned to dfferent models. In clusterng unsequenced data must smlarly be assgned to some unspecfed number of one or more dfferent models. Wth the BIC model t s assumed that only the number of model parameters needs to be mnmzed, not the model code length. Ths model has been found to be successful both n segmentaton [], n clusterng [, 3]. Ths method, however, usually requres some emprcal adjustments, and does not usually take nto account the number of clusters, but only the number of parameters n the model for each data cluster. In clusterng under a mnmum duraton constrant, by Ajmera et al [4], the number of model parameters was constant although the number of clusters vared between one and 30,.e., no penalty was used accordng to the standard BIC. All these crtera were develop to estmate a model out of a known parametrc model class. In applcaton lke clusterng and change detecton t s requred to estmate more than one model from model class. In ths paper t s shown that extra terms for both the number of clusters and the labels whch assgn the data to each cluster must be added to the usual model code length for optmal model selecton. The prncple of two-part mnmum message length model selecton s brefly presented n Secton. The proposed extenson to the model code length s explaned n Secton 3. Secton 4 presents detals of the proposed extenson to the BIC message length approxmaton. Secton 5 presents some experments wth artfcal data, followed by a dscusson and concluson n Secton 6.. Two-Part essage Length nmum message length model selecton s based on the prncpal that the model whch best fts the unseen dstrbuton underlyng a gven set of model tranng data s the smplest model whch s able to ft the tranng data to some gven level of accuracy. It s very closely related to Bayesan model selecton, whch selects the model wth the maxmum posteror probablty for the gven tranng data. odel selecton uses ether one-part or two-part message length. One-part message length s used when the model (defned by a parameter vector, whch belongs to a known model class ) s fxed and known to both coder and decoder. In ths case the coder only has to send code for the data gven the esslen code _ length X. Two-part message length s used when the model model, = ( ) parameters,, are not known to the decoder, so that the coder must estmate and send code for both the model parameters and the data, esslen = code _ length + code _ length X, [5, 6]. Lke Bayesan model ( ) ( ) selecton [7], mnmum message length model selecton arses from nformaton theory as an optmal procedure for model selecton. In both cases the model code length (model complexty), as well as data code length (data lkelhood), must normally be taken nto account. Actually DL [5] and BIC [7] converged to the same formula f we replace the term of data message length by log-lkelhood and the model parameters are contnuous values that were quantzed wth unform dstrbuton over ther range.
3 IDIAP-RR-0-4 3. Extended odel Code Length In DL and BIC model selecton, model complexty s approxmated as a smple functon of the number of model parameters,. It s easly shown that n clusterng, cluster model ( parameters vector of, ) wth a mxture model pdf for each of clusters, and wth a fxed combned number of mxture components, greater wll always result n a greater lkelhood, p X, = { } k = k, and hence smaller data code length. At the same tme, the number of model parameters does not change wth. Hence, f model complexty s measured by, alone, the mnmum code length clusterng wll always use as many clusters as possble, whch s absurd. It follows that BIC model selecton s not suffcent for data clusterng unless some extenson to the model structure code length (pror probablty) s taken nto account, as some functon of, n such way that, when, s constant, a larger number of clusters results n a hgher model complexty. One can argue that the full defnton of cluster model requres that the parameter vector, must be augmented by addng a parameter to specfy the number of clusters, and a set of data labels, = { L n} n= L ( s the sze of the data and assumed to be known). To analyze ths extended model we should consder two cases. In the frst case the data can be rearranged nto blocks n the same arbtrary order as the data clusters, but the order of the data wthn each block s not sgnfcant. Ths would apply, for example, f we wand to code a number of mages dvded nto themed groups. In ths case we only need to send the number of data ponts n each block, = { } k, nstead of k = L. As the total number of data ponts s known, then the sze of the last block does not need to be ncluded. If we can assume that the probablty dstrbutons for and (possbly unform) are known to both coder and decoder, then we must add the followng extra terms to the model code length: code _ length ( ) code _ length ( ) k = k _ ( ) +. Both terms code _ ( ) length and code length k must be non-redundant prefx codes that satsfed the raft nequalty message _ length( s ) ( s s S s an element n a set S that represent ether or k ). Therefore, f we allow the message length to be a fractonal, than ths quantty s gven n terms of log-probabltes as: ( P( )) log ( P( k )) log () k = In the second case the order of the data s mportant. Ths case s out the scope of ths report and wll not be dscussed. We only manson that the smplest soluton mght be to send all ( k ) the labels nstead of block length and than nstead of the term P( ). n= should present the term log ( P ( Ln )) k = log, there
IDIAP-RR-0-4 4 4. Extended BIC (EBIC) for mult-cluster applcatons Gven two clusterng models based on the same model class, wth parameters and,, the BIC crteron for choosng whch model has the rght dmenson s and number of parameters. The, crteron to determne whch clusterng model s better; usng BIC s gven as follows: gven n terms of log-lkelhood l ( X, ) l X < l X >,, ( ) ( ) log ( ),, () The chosen model s the one wth the hgher value. The second term on the rght sde s the complexty penalty terms accordng to the dfference between numbers of parameters n each model, and the length of the nput data,. In applyng ths crteron n clusterng applcatons [, 3] have been found that t s necessary to retrospectvely ntroduce an arbtrary, emprcally found, postve scalng factor, λ, for BIC model complexty term. We now show how equaton () should be extended to take nto account the changng n cluster model complexty term gven n equaton (). Let us assume that we have two estmated models from parametrc model class of all mxture models of a gven dstrbuton famly, such as all possble Gaussan mxture model. The model wth parameters vector has, clusters and = k = k, mxture components ( k, number of mxture component n cluster k ). Frst consder the case where =. To understand how equatons () must change t wll be suffcent to fnd the values of, and,. For smplcty may assume that the number of parameters of each mxture component s a fxed at R. For a descrpton of the model accordng to the standard BIC or DL s requred to provde the followng number of parameters: R parameters for all the mxture components n all the clusters. Prors of the mxture components n each cluster, { Pm k } k, {,, m= } k =, parameters. Ths gves = R +, whch s ndependent of, and the decson s taken only accordng to the maxmum of the lkelhoods of the models. The nature of the parameters of the number of the clusters,, and the block length, k, that are nteger values, and they to be dfferent than the parameters, that assumed to be contnuous values, and can be analyzed separately n terms of the probabltes assocated wth each of these ntegers. If we wrte the BIC crteron ncludng terms for and,, than we wll get the followng: P ( ) ( k, ) P l ( X >, ) l ( X, ) ( ) ( ) P( k, ) k =,, k = < log + log + log (3) P In many cases t s reasonable that both and k, are unformly dstrbuted over fnte P + range, ( ) EBIC crteron: =. and P E B ( k, ) P max mn + BL = =. In ths case equaton (3) becomes the
5 IDIAP-RR-0-4 l X < l X >,, (, ) (, ) log ( ) + ( ) log ( PBL ) (4) If each segment can be any length n the nterval [, ] (case of maxmum uncertanty), than PBL = and the most smplfed verson of EBIC wll be: If, =, l X < l X >,, ( ) ( ) + log ( ),, (5), than nstead of equaton (5), the EBIC n wll be: ( ) log ( ) ε ( ) ( ) for some ( 0,] > (, ) (, ) ( ) log ( ) l X < l X (6) s the maxmal penalty. The actual penalty should therefore be scaled as log ε. As n BIC, the model penalty s a logarthmc functon of, whle the data loglkelhood s proportonal to. The model penalty term therefore more sgnfcant for small and wll have neglgble effect when s large. 5. Experments Two experments were conducted to llustrate the effect of EBIC model selecton. In the frst experment a data was generated from two Gaussans wth the same standard devatons,, t = and µ t=, t = µ, t. Two sets were generated: wth = 3 and = 8 ponts. Each Gaussan generated half of the data. Tests were made under dfferent constrans on the segment length,.e., t was assumed that several data pontes successvely generated from the same source and should be kept together. Segment length were,, 4, and 8. It should be mentoned that the hgher the segment length the less optmal a clusterng soluton, n terms of log-lkelhood, can be acheved. On the other hand more meanngful clusters may be produced. We compare one cluster wth two Gaussan mxture components aganst two clusters wth one mxture component each. Accordng to standard BIC no penalty should be used. Fgure shows the result of BIC (f BIC values are less than zero then one cluster s better otherwse two clusters are better). As can be seen a two-cluster model was always σ = and wth expectatons { µ 0.08t} 0 better. The black lne s the EBIC penalty value for ε =. It can be seen that there are no bg dfferences between BIC and EBIC except when the ambguty s hgh,.e. when there s a small amount of data, = 3, the Gaussans are close one to each other, µ < σ and there s a large duraton constrant, s = 8. Ths ndcates that when two clusters are smlar EBIC tends to prefer one more accurate cluster wth more mxture components.
IDIAP-RR-0-4 6 Fg.: dfference between BIC and EBIC for dfferent expectaton values, segment length, and amount of data (a - = 3, b - = 8 ). s In the second experment µ = 0 for both Gaussans, σ = 0.7, 0.8, 0.9,.0 and all the other parameters are as n the frst experment. The results are presented n fgure. We can see that small µ n the frst experment, and σ close to one n the second experment, leads the two data sets to have smlar statstcal propertes. So, for small and large segment length the resultng clusters become smlar. Whle n BIC the decson wll be that the two-cluster model s better, EBIC wll prefer a one-cluster model. As was mentoned, a scale factor ε n all the experments was equal to one. If the scale factor was smaller, the system would be more based towards the two-cluster model. Ths parameter can be found emprcally (n the same way as the scale factor λ that s used wth the BIC crteron), or calculated accordng to some pror knowledge of another block length dstrbuton. σ =, { }
7 IDIAP-RR-0-4 Fg.: dfference between BIC and EBIC for dfferent standard devaton values, segment length, and amount of data (a - = 3, b - = 8 ). s 6. Dscusson It was shown that the clusterng model complexty s not only a functon of the number of parameters and ther values n the parameter vector,, but also the number of clusters, and nformaton about the labelng of each data vector { n} n= L. The labels must not be coded n a drect way, but n a compact way whch s just suffcent to permute the data nto the blocks as requred (n order to mnmze the number of parameters to be sent). The code length of such extra nformaton wll ncrease wth. It was shown that when there s small amount of data or some ambguty due to the compact nature of the data or clusterng constrans, the mportance of the addtonal penalty terms ncreases. Acknowledgment The authors want to thank the Swss Federal Offce for Educaton and Scence (OFES) n the framework of both the EC/OFES ultodal eetng anager (4) project and the Swss atonal Scence Foundaton through the atonal Center of Competence n Research (CCR) on Interactve ultmodal Informaton anagement (I) for supportng ths work.
IDIAP-RR-0-4 8 References [] J. J. Olver, R. A. Baxter, and C. S. Wallace, Unsupervsed learnng usng L, Proc. 3 th Int. Conf. on achne Learnng, 996, pp. -0. []. Cettolo, Segmentaton, classfcaton and clusterng of an Italan broadcast news corpus, Proc. 6 th RIAO Conf., Aprl 000, pp. 37-38. [3] S. S. Chen and P. S. Gapalakrshnan, Clusterng va the Bayesan crteron wth applcatons to speech recognton, ICASSP 98, vol., 998, pp. 645-648. [4] J. Ajmera, H. Bourlard, I. Lapdot, and I. ccowan, Unknown-multple speaker clusterng usng H, ICSLP 0, USA, 00, pp.573-576. [5] J. Rssanen, A unversal pror for ntegers and estmaton by mnmum descrpton length, The Annals of Statstcs, vol., no., pp. 46-43, 983. [6] J. Olver and R. Baxter, L and Bayesansm: smlartes and dfferences: ntroducton to mnmum encodng nference Part II, Dep. Of Computer Scence, onash Unversty, Clayton, Vctora 368, Australa, Tech. Rep. TR-06, December 994. [7] G. Schwarz, Estmatng the dmenson of a model, The Annals of Statstcs, vol. 6, no., pp. 46-464, 978.