EXTENDED BIC CRITERION FOR MODEL SELECTION

Similar documents
CS 534: Computer Vision Model Fitting

Support Vector Machines

Determining the Optimal Bandwidth Based on Multi-criterion Fusion

A Robust Method for Estimating the Fundamental Matrix

Helsinki University Of Technology, Systems Analysis Laboratory Mat Independent research projects in applied mathematics (3 cr)

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

A Binarization Algorithm specialized on Document Images and Photos

Classifier Selection Based on Data Complexity Measures *

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

K-means and Hierarchical Clustering

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Subspace clustering. Clustering. Fundamental to all clustering techniques is the choice of distance measure between data points;

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

S1 Note. Basis functions.

Feature Reduction and Selection

The Research of Support Vector Machine in Agricultural Data Classification

NAG Fortran Library Chapter Introduction. G10 Smoothing in Statistics

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Unsupervised Learning and Clustering

Outline. Type of Machine Learning. Examples of Application. Unsupervised Learning

Positive Semi-definite Programming Localization in Wireless Sensor Networks

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

y and the total sum of

Biostatistics 615/815

Unsupervised Learning

Classifying Acoustic Transient Signals Using Artificial Intelligence

TECHNIQUE OF FORMATION HOMOGENEOUS SAMPLE SAME OBJECTS. Muradaliyev A.Z.

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Wishing you all a Total Quality New Year!

Outline. Self-Organizing Maps (SOM) US Hebbian Learning, Cntd. The learning rule is Hebbian like:

Edge Detection in Noisy Images Using the Support Vector Machines

A Statistical Model Selection Strategy Applied to Neural Networks

User Authentication Based On Behavioral Mouse Dynamics Biometrics

BAYESIAN MULTI-SOURCE DOMAIN ADAPTATION

Fitting & Matching. Lecture 4 Prof. Bregler. Slides from: S. Lazebnik, S. Seitz, M. Pollefeys, A. Effros.

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Empirical Distributions of Parameter Estimates. in Binary Logistic Regression Using Bootstrap

SVM-based Learning for Multiple Model Estimation

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Hybridization of Expectation-Maximization and K-Means Algorithms for Better Clustering Performance

An Optimal Algorithm for Prufer Codes *

Proper Choice of Data Used for the Estimation of Datum Transformation Parameters

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Sum of Linear and Fractional Multiobjective Programming Problem under Fuzzy Rules Constraints

Mixed Linear System Estimation and Identification

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

Machine Learning: Algorithms and Applications

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

Mathematics 256 a course in differential equations for engineering students

Lecture 5: Probability Distributions. Random Variables

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Outline. Discriminative classifiers for image recognition. Where in the World? A nearest neighbor recognition example 4/14/2011. CS 376 Lecture 22 1

Some Advanced SPC Tools 1. Cumulative Sum Control (Cusum) Chart For the data shown in Table 9-1, the x chart can be generated.

Related-Mode Attacks on CTR Encryption Mode

Adaptive Transfer Learning

Vanishing Hull. Jinhui Hu, Suya You, Ulrich Neumann University of Southern California {jinhuihu,suyay,

A Background Subtraction for a Vision-based User Interface *

Fast Sparse Gaussian Processes Learning for Man-Made Structure Classification

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

Intra-Parametric Analysis of a Fuzzy MOLP

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

12/2/2009. Announcements. Parametric / Non-parametric. Case-Based Reasoning. Nearest-Neighbor on Images. Nearest-Neighbor Classification

APPLICATION OF MULTIVARIATE LOSS FUNCTION FOR ASSESSMENT OF THE QUALITY OF TECHNOLOGICAL PROCESS MANAGEMENT

A Post Randomization Framework for Privacy-Preserving Bayesian. Network Parameter Learning

Optimal Workload-based Weighted Wavelet Synopses

Feature Selection for Target Detection in SAR Images

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

Report on On-line Graph Coloring

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 15

LOOP ANALYSIS. The second systematic technique to determine all currents and voltages in a circuit

Machine Learning. K-means Algorithm

LECTURE : MANIFOLD LEARNING

Investigating the Performance of Naïve- Bayes Classifiers and K- Nearest Neighbor Classifiers

Classification / Regression Support Vector Machines

6.854 Advanced Algorithms Petar Maymounkov Problem Set 11 (November 23, 2005) With: Benjamin Rossman, Oren Weimann, and Pouya Kheradpour

FEATURE EXTRACTION. Dr. K.Vijayarekha. Associate Dean School of Electrical and Electronics Engineering SASTRA University, Thanjavur

Journal of Chemical and Pharmaceutical Research, 2014, 6(6): Research Article

Reducing Frame Rate for Object Tracking

Problem Set 3 Solutions

Cell Count Method on a Network with SANET

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

Cluster Analysis of Electrical Behavior

NGPM -- A NSGA-II Program in Matlab

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

Learning-Based Top-N Selection Query Evaluation over Relational Databases

Understanding K-Means Non-hierarchical Clustering

Machine Learning 9. week

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

An Improved Image Segmentation Algorithm Based on the Otsu Method

Machine Learning. Topic 6: Clustering

Three supervised learning methods on pen digits character recognition dataset

Improved Methods for Lithography Model Calibration

Solutions to Programming Assignment Five Interpolation and Numerical Differentiation

A new paradigm of fuzzy control point in space curve

The Greedy Method. Outline and Reading. Change Money Problem. Greedy Algorithms. Applications of the Greedy Strategy. The Greedy Method Technique

Meta-heuristics for Multidimensional Knapsack Problems

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

Implementation Naïve Bayes Algorithm for Student Classification Based on Graduation Status

Transcription:

IDIAP RESEARCH REPORT EXTEDED BIC CRITERIO FOR ODEL SELECTIO Itshak Lapdot Andrew orrs IDIAP-RR-0-4 Dalle olle Insttute for Perceptual Artfcal Intellgence P.O.Box 59 artgny Valas Swtzerland phone +4 7 7 77 fax +4 7 7 77 e-mal secretarat@dap.ch nternet http://www.dap.ch

IDIAP Research Report 0-4 EXTEDED BIC CRITERIO FOR ODEL SELECTIO Itshak Lapdot Andrew orrs OCTOBER 00 Abstract. odel selecton s commonly based on some varaton of the BIC or mnmum message length crtera, such as L and DL. In ether case the crteron s splt nto two terms: one for the model (data code length/model complexty) and one for the data gven the model (message length/data lkelhood). For problems such as change detecton, unsupervsed segmentaton or data clusterng t s common practce for the model term to comprse only a sum of sub-model terms. In ths paper t s shown that the full model complexty must also take nto account the number of sub models and the labels whch assgn data to each sub model. From ths analyss we derve an extended BIC approach (EBIC) for ths class of problem. Results wth artfcal data are gven to llustrate the propertes of ths procedure.

IDIAP-RR-0-4. Introducton In model selecton by mnmum two-part message length encodng, a penalty term s added to the data message length term to account for model complexty. Change detecton, segmentaton and clusterng are unsupervsed applcatons whch can apply the BIC, DL or L crtera for model selecton [,, 3]. In change detecton or segmentaton t s requred to dentfy change ponts n a data sequence at whch data should be separated and assgned to dfferent models. In clusterng unsequenced data must smlarly be assgned to some unspecfed number of one or more dfferent models. Wth the BIC model t s assumed that only the number of model parameters needs to be mnmzed, not the model code length. Ths model has been found to be successful both n segmentaton [], n clusterng [, 3]. Ths method, however, usually requres some emprcal adjustments, and does not usually take nto account the number of clusters, but only the number of parameters n the model for each data cluster. In clusterng under a mnmum duraton constrant, by Ajmera et al [4], the number of model parameters was constant although the number of clusters vared between one and 30,.e., no penalty was used accordng to the standard BIC. All these crtera were develop to estmate a model out of a known parametrc model class. In applcaton lke clusterng and change detecton t s requred to estmate more than one model from model class. In ths paper t s shown that extra terms for both the number of clusters and the labels whch assgn the data to each cluster must be added to the usual model code length for optmal model selecton. The prncple of two-part mnmum message length model selecton s brefly presented n Secton. The proposed extenson to the model code length s explaned n Secton 3. Secton 4 presents detals of the proposed extenson to the BIC message length approxmaton. Secton 5 presents some experments wth artfcal data, followed by a dscusson and concluson n Secton 6.. Two-Part essage Length nmum message length model selecton s based on the prncpal that the model whch best fts the unseen dstrbuton underlyng a gven set of model tranng data s the smplest model whch s able to ft the tranng data to some gven level of accuracy. It s very closely related to Bayesan model selecton, whch selects the model wth the maxmum posteror probablty for the gven tranng data. odel selecton uses ether one-part or two-part message length. One-part message length s used when the model (defned by a parameter vector, whch belongs to a known model class ) s fxed and known to both coder and decoder. In ths case the coder only has to send code for the data gven the esslen code _ length X. Two-part message length s used when the model model, = ( ) parameters,, are not known to the decoder, so that the coder must estmate and send code for both the model parameters and the data, esslen = code _ length + code _ length X, [5, 6]. Lke Bayesan model ( ) ( ) selecton [7], mnmum message length model selecton arses from nformaton theory as an optmal procedure for model selecton. In both cases the model code length (model complexty), as well as data code length (data lkelhood), must normally be taken nto account. Actually DL [5] and BIC [7] converged to the same formula f we replace the term of data message length by log-lkelhood and the model parameters are contnuous values that were quantzed wth unform dstrbuton over ther range.

3 IDIAP-RR-0-4 3. Extended odel Code Length In DL and BIC model selecton, model complexty s approxmated as a smple functon of the number of model parameters,. It s easly shown that n clusterng, cluster model ( parameters vector of, ) wth a mxture model pdf for each of clusters, and wth a fxed combned number of mxture components, greater wll always result n a greater lkelhood, p X, = { } k = k, and hence smaller data code length. At the same tme, the number of model parameters does not change wth. Hence, f model complexty s measured by, alone, the mnmum code length clusterng wll always use as many clusters as possble, whch s absurd. It follows that BIC model selecton s not suffcent for data clusterng unless some extenson to the model structure code length (pror probablty) s taken nto account, as some functon of, n such way that, when, s constant, a larger number of clusters results n a hgher model complexty. One can argue that the full defnton of cluster model requres that the parameter vector, must be augmented by addng a parameter to specfy the number of clusters, and a set of data labels, = { L n} n= L ( s the sze of the data and assumed to be known). To analyze ths extended model we should consder two cases. In the frst case the data can be rearranged nto blocks n the same arbtrary order as the data clusters, but the order of the data wthn each block s not sgnfcant. Ths would apply, for example, f we wand to code a number of mages dvded nto themed groups. In ths case we only need to send the number of data ponts n each block, = { } k, nstead of k = L. As the total number of data ponts s known, then the sze of the last block does not need to be ncluded. If we can assume that the probablty dstrbutons for and (possbly unform) are known to both coder and decoder, then we must add the followng extra terms to the model code length: code _ length ( ) code _ length ( ) k = k _ ( ) +. Both terms code _ ( ) length and code length k must be non-redundant prefx codes that satsfed the raft nequalty message _ length( s ) ( s s S s an element n a set S that represent ether or k ). Therefore, f we allow the message length to be a fractonal, than ths quantty s gven n terms of log-probabltes as: ( P( )) log ( P( k )) log () k = In the second case the order of the data s mportant. Ths case s out the scope of ths report and wll not be dscussed. We only manson that the smplest soluton mght be to send all ( k ) the labels nstead of block length and than nstead of the term P( ). n= should present the term log ( P ( Ln )) k = log, there

IDIAP-RR-0-4 4 4. Extended BIC (EBIC) for mult-cluster applcatons Gven two clusterng models based on the same model class, wth parameters and,, the BIC crteron for choosng whch model has the rght dmenson s and number of parameters. The, crteron to determne whch clusterng model s better; usng BIC s gven as follows: gven n terms of log-lkelhood l ( X, ) l X < l X >,, ( ) ( ) log ( ),, () The chosen model s the one wth the hgher value. The second term on the rght sde s the complexty penalty terms accordng to the dfference between numbers of parameters n each model, and the length of the nput data,. In applyng ths crteron n clusterng applcatons [, 3] have been found that t s necessary to retrospectvely ntroduce an arbtrary, emprcally found, postve scalng factor, λ, for BIC model complexty term. We now show how equaton () should be extended to take nto account the changng n cluster model complexty term gven n equaton (). Let us assume that we have two estmated models from parametrc model class of all mxture models of a gven dstrbuton famly, such as all possble Gaussan mxture model. The model wth parameters vector has, clusters and = k = k, mxture components ( k, number of mxture component n cluster k ). Frst consder the case where =. To understand how equatons () must change t wll be suffcent to fnd the values of, and,. For smplcty may assume that the number of parameters of each mxture component s a fxed at R. For a descrpton of the model accordng to the standard BIC or DL s requred to provde the followng number of parameters: R parameters for all the mxture components n all the clusters. Prors of the mxture components n each cluster, { Pm k } k, {,, m= } k =, parameters. Ths gves = R +, whch s ndependent of, and the decson s taken only accordng to the maxmum of the lkelhoods of the models. The nature of the parameters of the number of the clusters,, and the block length, k, that are nteger values, and they to be dfferent than the parameters, that assumed to be contnuous values, and can be analyzed separately n terms of the probabltes assocated wth each of these ntegers. If we wrte the BIC crteron ncludng terms for and,, than we wll get the followng: P ( ) ( k, ) P l ( X >, ) l ( X, ) ( ) ( ) P( k, ) k =,, k = < log + log + log (3) P In many cases t s reasonable that both and k, are unformly dstrbuted over fnte P + range, ( ) EBIC crteron: =. and P E B ( k, ) P max mn + BL = =. In ths case equaton (3) becomes the

5 IDIAP-RR-0-4 l X < l X >,, (, ) (, ) log ( ) + ( ) log ( PBL ) (4) If each segment can be any length n the nterval [, ] (case of maxmum uncertanty), than PBL = and the most smplfed verson of EBIC wll be: If, =, l X < l X >,, ( ) ( ) + log ( ),, (5), than nstead of equaton (5), the EBIC n wll be: ( ) log ( ) ε ( ) ( ) for some ( 0,] > (, ) (, ) ( ) log ( ) l X < l X (6) s the maxmal penalty. The actual penalty should therefore be scaled as log ε. As n BIC, the model penalty s a logarthmc functon of, whle the data loglkelhood s proportonal to. The model penalty term therefore more sgnfcant for small and wll have neglgble effect when s large. 5. Experments Two experments were conducted to llustrate the effect of EBIC model selecton. In the frst experment a data was generated from two Gaussans wth the same standard devatons,, t = and µ t=, t = µ, t. Two sets were generated: wth = 3 and = 8 ponts. Each Gaussan generated half of the data. Tests were made under dfferent constrans on the segment length,.e., t was assumed that several data pontes successvely generated from the same source and should be kept together. Segment length were,, 4, and 8. It should be mentoned that the hgher the segment length the less optmal a clusterng soluton, n terms of log-lkelhood, can be acheved. On the other hand more meanngful clusters may be produced. We compare one cluster wth two Gaussan mxture components aganst two clusters wth one mxture component each. Accordng to standard BIC no penalty should be used. Fgure shows the result of BIC (f BIC values are less than zero then one cluster s better otherwse two clusters are better). As can be seen a two-cluster model was always σ = and wth expectatons { µ 0.08t} 0 better. The black lne s the EBIC penalty value for ε =. It can be seen that there are no bg dfferences between BIC and EBIC except when the ambguty s hgh,.e. when there s a small amount of data, = 3, the Gaussans are close one to each other, µ < σ and there s a large duraton constrant, s = 8. Ths ndcates that when two clusters are smlar EBIC tends to prefer one more accurate cluster wth more mxture components.

IDIAP-RR-0-4 6 Fg.: dfference between BIC and EBIC for dfferent expectaton values, segment length, and amount of data (a - = 3, b - = 8 ). s In the second experment µ = 0 for both Gaussans, σ = 0.7, 0.8, 0.9,.0 and all the other parameters are as n the frst experment. The results are presented n fgure. We can see that small µ n the frst experment, and σ close to one n the second experment, leads the two data sets to have smlar statstcal propertes. So, for small and large segment length the resultng clusters become smlar. Whle n BIC the decson wll be that the two-cluster model s better, EBIC wll prefer a one-cluster model. As was mentoned, a scale factor ε n all the experments was equal to one. If the scale factor was smaller, the system would be more based towards the two-cluster model. Ths parameter can be found emprcally (n the same way as the scale factor λ that s used wth the BIC crteron), or calculated accordng to some pror knowledge of another block length dstrbuton. σ =, { }

7 IDIAP-RR-0-4 Fg.: dfference between BIC and EBIC for dfferent standard devaton values, segment length, and amount of data (a - = 3, b - = 8 ). s 6. Dscusson It was shown that the clusterng model complexty s not only a functon of the number of parameters and ther values n the parameter vector,, but also the number of clusters, and nformaton about the labelng of each data vector { n} n= L. The labels must not be coded n a drect way, but n a compact way whch s just suffcent to permute the data nto the blocks as requred (n order to mnmze the number of parameters to be sent). The code length of such extra nformaton wll ncrease wth. It was shown that when there s small amount of data or some ambguty due to the compact nature of the data or clusterng constrans, the mportance of the addtonal penalty terms ncreases. Acknowledgment The authors want to thank the Swss Federal Offce for Educaton and Scence (OFES) n the framework of both the EC/OFES ultodal eetng anager (4) project and the Swss atonal Scence Foundaton through the atonal Center of Competence n Research (CCR) on Interactve ultmodal Informaton anagement (I) for supportng ths work.

IDIAP-RR-0-4 8 References [] J. J. Olver, R. A. Baxter, and C. S. Wallace, Unsupervsed learnng usng L, Proc. 3 th Int. Conf. on achne Learnng, 996, pp. -0. []. Cettolo, Segmentaton, classfcaton and clusterng of an Italan broadcast news corpus, Proc. 6 th RIAO Conf., Aprl 000, pp. 37-38. [3] S. S. Chen and P. S. Gapalakrshnan, Clusterng va the Bayesan crteron wth applcatons to speech recognton, ICASSP 98, vol., 998, pp. 645-648. [4] J. Ajmera, H. Bourlard, I. Lapdot, and I. ccowan, Unknown-multple speaker clusterng usng H, ICSLP 0, USA, 00, pp.573-576. [5] J. Rssanen, A unversal pror for ntegers and estmaton by mnmum descrpton length, The Annals of Statstcs, vol., no., pp. 46-43, 983. [6] J. Olver and R. Baxter, L and Bayesansm: smlartes and dfferences: ntroducton to mnmum encodng nference Part II, Dep. Of Computer Scence, onash Unversty, Clayton, Vctora 368, Australa, Tech. Rep. TR-06, December 994. [7] G. Schwarz, Estmatng the dmenson of a model, The Annals of Statstcs, vol. 6, no., pp. 46-464, 978.