MIXMOD : a software for model-based classification with continuous and categorical data

Similar documents
Model-Based Cluster and Discriminant

BoxPlot++ Zeina Azmeh, Fady Hamoui, Marianne Huchard. To cite this version: HAL Id: lirmm

Model-based Cluster and Discriminant Analysis with the MIXMOD software

Regularization parameter estimation for non-negative hyperspectral image deconvolution:supplementary material

mixmod User s Guide MIXture MODelling Software

lambda-min Decoding Algorithm of Regular and Irregular LDPC Codes

Setup of epiphytic assistance systems with SEPIA

Fuzzy sensor for the perception of colour

Efficient implementation of interval matrix multiplication

Natural Language Based User Interface for On-Demand Service Composition

Simulations of VANET Scenarios with OPNET and SUMO

How to simulate a volume-controlled flooding with mathematical morphology operators?

Rankcluster: An R package for clustering multivariate partial rankings

Spectral Active Clustering of Remote Sensing Images

KeyGlasses : Semi-transparent keys to optimize text input on virtual keyboard

Partially-supervised learning in Independent Factor Analysis

Real-Time Collision Detection for Dynamic Virtual Environments

SDLS: a Matlab package for solving conic least-squares problems

Study on Feebly Open Set with Respect to an Ideal Topological Spaces

Selection of Scale-Invariant Parts for Object Class Recognition

A Practical Approach for 3D Model Indexing by combining Local and Global Invariants

NP versus PSPACE. Frank Vega. To cite this version: HAL Id: hal

Real-time FEM based control of soft surgical robots

Computing and maximizing the exact reliability of wireless backhaul networks

The Connectivity Order of Links

Branch-and-price algorithms for the Bi-Objective Vehicle Routing Problem with Time Windows

Comparison of spatial indexes

HySCaS: Hybrid Stereoscopic Calibration Software

THE COVERING OF ANCHORED RECTANGLES UP TO FIVE POINTS

Linux: Understanding Process-Level Power Consumption

Kernel perfect and critical kernel imperfect digraphs structure

Moveability and Collision Analysis for Fully-Parallel Manipulators

The optimal routing of augmented cubes.

Traffic Grooming in Bidirectional WDM Ring Networks

The Proportional Colouring Problem: Optimizing Buffers in Radio Mesh Networks

DANCer: Dynamic Attributed Network with Community Structure Generator

QuickRanking: Fast Algorithm For Sorting And Ranking Data

Every 3-connected, essentially 11-connected line graph is hamiltonian

Multimedia CTI Services for Telecommunication Systems

From medical imaging to numerical simulations

Workspace and joint space analysis of the 3-RPS parallel robot

Fault-Tolerant Storage Servers for the Databases of Redundant Web Servers in a Computing Grid

Blind Browsing on Hand-Held Devices: Touching the Web... to Understand it Better

Deformetrica: a software for statistical analysis of anatomical shapes

Robust IP and UDP-lite header recovery for packetized multimedia transmission

Acyclic Coloring of Graphs of Maximum Degree

An FCA Framework for Knowledge Discovery in SPARQL Query Answers

An Efficient Numerical Inverse Scattering Algorithm for Generalized Zakharov-Shabat Equations with Two Potential Functions

A Generic Architecture of CCSDS Low Density Parity Check Decoder for Near-Earth Applications

Continuous Control of Lagrangian Data

Synthesis of fixed-point programs: the case of matrix multiplication

Tacked Link List - An Improved Linked List for Advance Resource Reservation

Sliding HyperLogLog: Estimating cardinality in a data stream

Relabeling nodes according to the structure of the graph

Mokka, main guidelines and future

Quasi-tilings. Dominique Rossin, Daniel Krob, Sebastien Desreux

Representation of Finite Games as Network Congestion Games

Efficient Gradient Method for Locally Optimizing the Periodic/Aperiodic Ambiguity Function

An Object Oriented Finite Element Toolkit

Comparison of radiosity and ray-tracing methods for coupled rooms

RETIN AL: An Active Learning Strategy for Image Category Retrieval

DSM GENERATION FROM STEREOSCOPIC IMAGERY FOR DAMAGE MAPPING, APPLICATION ON THE TOHOKU TSUNAMI

Prototype Selection Methods for On-line HWR

An Experimental Assessment of the 2D Visibility Complex

Clustering using EM and CEM, cluster number selection via the Von Mises-Fisher mixture models

A N-dimensional Stochastic Control Algorithm for Electricity Asset Management on PC cluster and Blue Gene Supercomputer

Primitive roots of bi-periodic infinite pictures

A Voronoi-Based Hybrid Meshing Method

Taking Benefit from the User Density in Large Cities for Delivering SMS

Syrtis: New Perspectives for Semantic Web Adoption

Periodic meshes for the CGAL library

Structuring the First Steps of Requirements Elicitation

X-Kaapi C programming interface

Multi-atlas labeling with population-specific template and non-local patch-based label fusion

FIT IoT-LAB: The Largest IoT Open Experimental Testbed

Learning Object Representations for Visual Object Class Recognition

SLMRACE: A noise-free new RACE implementation with reduced computational time

ASAP.V2 and ASAP.V3: Sequential optimization of an Algorithm Selector and a Scheduler

BugMaps-Granger: A Tool for Causality Analysis between Source Code Metrics and Bugs

[Demo] A webtool for analyzing land-use planning documents

Service Reconfiguration in the DANAH Assistive System

Stream Ciphers: A Practical Solution for Efficient Homomorphic-Ciphertext Compression

Tools and methods for model-based clustering in R

Lossless and Lossy Minimal Redundancy Pyramidal Decomposition for Scalable Image Compression Technique

Assisted Policy Management for SPARQL Endpoints Access Control

SIM-Mee - Mobilizing your social network

Very Tight Coupling between LTE and WiFi: a Practical Analysis

Comparator: A Tool for Quantifying Behavioural Compatibility

XML Document Classification using SVM

YAM++ : A multi-strategy based approach for Ontology matching task

Scale Invariant Detection and Tracking of Elongated Structures

Is GPU the future of Scientific Computing?

Linked data from your pocket: The Android RDFContentProvider

A Methodology for Improving Software Design Lifecycle in Embedded Control Systems

Change Detection System for the Maintenance of Automated Testing

MUTE: A Peer-to-Peer Web-based Real-time Collaborative Editor

MLweb: A toolkit for machine learning on the web

Fuzzy interpolation and level 2 gradual rules

Real-time tracking of multiple persons by Kalman filtering and face pursuit for multimedia applications

A Robust Approach for Multivariate Binary Vectors Clustering and Feature Selection

Transcription:

MIXMOD : a software for model-based classification with continuous and categorical data Christophe Biernacki, Gilles Celeux, Gérard Govaert, Florent Langrognet To cite this version: Christophe Biernacki, Gilles Celeux, Gérard Govaert, Florent Langrognet. MIXMOD : a software for model-based classification with continuous and categorical data. 2008. <hal-00469522> HAL Id: hal-00469522 https://hal.archives-ouvertes.fr/hal-00469522 Submitted on 1 Apr 2010 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

MIXMOD: a software for model-based classification with continuous and categorical data C. Biernacki, G. Celeux, G. Govaert and F. Langrognet Abstract The MIXMOD (MIXture MODeling) program fits mixture models to a given data set for the purposes of density estimation, clustering or discriminant analysis. A large variety of algorithms to estimate the mixture parameters are proposed (EM, Classification EM, Stochastic EM), and it is possible to combine these to yield different strategies for obtaining a sensible maximum for the likelihood (or complete-data likelihood) function. MIXMOD is currently intended to be used for multivariate Gaussian mixtures and also for latent class models, respectively devoted to continuous and categorical data. In both situations, numerous meaninful and parsimonious models are proposed. Moreover, different information criteria for choosing a parsimonious model (the number of mixture components, for instance) are included, their suitability depending on the particular perspective (cluster analysis or discriminant analysis). Written in C++, MIXMOD is interfaced with SCILAB and MATLAB. The program, the statistical documentation and the user guide are available on the internet at the following address: http://www-math.univ-fcomte.fr/mixmod/index.php C. Biernacki Université Lille 1 & CNRS, Villeneuve d Ascq (France), e-mail: biernack@math.univ-lille1.fr G. Celeux INRIA, Orsay e-mail: Gilles.Celeux@inria.fr G. Govaert UTC & CNRS, Compigne (France) e-mail: Gerard.Govaert@utc.fr F. Langrognet CNRS, Besançon (France) e-mail: Florent.Langrognet@univ-fcomte.fr 1

2 C. Biernacki, G. Celeux, G. Govaert and F. Langrognet 1 Overview of MIXMOD Because of their high flexibility, finite mixture distributions are become a very popular approach to model a wide variety of random phenomena. In particular, it is recognized as being a powerful tool for density estimation, clustering and discriminant analysis. Consequently, fields which are potentially concerned by the mixture modelling approach are extremely varied, including astronomy, biology, genetics, economics, etc. Thus, softwares implementing the most recent evolutions in modelbased cluster and discriminant analysis are welcome for various categories of people as researchers, engineers and teachers. mixmod is a software having for goal to meet these particular needs. MIXMOD is publicly available under the GPL license and is distributed for different platforms (Linux, Unix, Windows). It is an object-oriented package built around C++ language but it is interfaced with the widely used mathe- matical softwares Matlab and Scilab. It was developed jointly by INRIA & CNRS, by several laboratories of mathematics (university of Besançon and of Lille 1), and by the Heudiasyc laboratory of Compiègne. In its present version, MIXMOD proposes multivariate Gaussian mixture models for continuous data and also multivariate latent class models for categorical data. The main features of the present version of the software are the following: three levels of use from the beginner to the expert; fourteen geometrically meaningful Gaussian mixture models from different variance matrices parameterizations; five multinomial meaningful models; estimation of mixture parameters with EM and EM-like algorithms, pro- vided with different initialization strategies and possibility of combining such algorithms; possibility of partial labeling of individuals (semi-supervised situation); criteria to select a model which depends on the cluster or the discriminant analysis purpose; numerous displays including densities, iso-densities, discriminant rules, ob- servations, labels, etc. in canonical or PCA (Principal Component Analysis) axes and for several dimensions (1D, 2D and 3D). 2 MIXMOD for continuous data Cluster analysis is concerned with discovering a group structure in a n by d matrix x = {x 1,...,x n }, where x i is an individual of R d. Consequently, the structure to be discovered by clustering is typically a partition of x into K groups defined by the labels z = {z 1,...,z n }, with z i = (z i1,...,z ik ), z ik = 1 or 0, according to the fact that x i belongs to the kth class or not. In the Gaussian mixture model, each x i is assumed to arise independently from a mixture with density

Title Suppressed Due to Excessive Length 3 f(x i θ) = K k=1 p k h(x i µ k,σ k ) (1) where p k is the mixing proportion (0 < p k < 1 for all k = 1,...,K and p 1 +...+ p K = 1) of the kth component and h( µ k,σ k ) denotes the d-dimensional Gaussian density with mean µ k and variance matrix Σ k. The vector of mixture parameters is also denoted by θ = (p 1,..., p K 1, µ 1,..., µ K,Σ 1,...,Σ K ). In MIXMOD, a partition of the data can directly be derived from the maximum likelihood estimates ˆθ of the mixture parameters obtained, for instance, by the EM [10] or the SEM [6] algorithm, by assigning each x i to the component providing the largest conditional probability that x i arises from it using a MAP (Maximum A Posteriori) principle: { 1 if k = argmaxl=1...,k t ẑ ik = l (x i ˆθ) 0 if not where t k (x i ˆθ) = ˆp kh(x i ˆµ k, ˆΣ k ) l ˆp l h(x i ˆµ l, ˆΣ l ). (2) Alternatively, an estimate ˆθ can be retained as being the maximum completed likelihood estimate obtained, for instance, by the CEM algorithm [8]. Many strategies for using and combining these algorithms are available in MIXMOD for helping to improve the optimisation process [4]. Following Celeux and Govaert [9], the software integrates also a parameterization of the variance matrices of the mixture components consisting of expressing the variance matrix Σ k in terms of its eigenvalue decomposition Σ k = λ k D k A k D k where λ k = Σ k 1/d,D k is the matrix of eigenvectors of Σ k and A k is a diagonal matrix, such that A k = 1, with the normalized eigenvalues of Σ k on the diagonal in a decreasing order. The parameter λ k determines the volume of the kth cluster, D k its orientation and A k its shape. By allowing some but not all of these quantities to vary between clusters, we obtain parsimonious and easily interpreted models which are appropriate to describe various clustering situations, including standard methods as kmeans for instance. It is of high interest to automatically select one of these Gaussian models and/or the number K of mixture components. However, choosing a sensible mixture model is highly dependent of the modelling purpose. Three criteria are available in an unsupervised setting: BIC [13], ICL [3] and NEC [2]. In a density estimation perspective, BIC must be preferred. But in a cluster analysis perspective, ICL and NEC can provide more parsimonious answers. Nevertheless, NEC is essentially devoted to choose the number of mixture components K, rather that the model parameterization. When the labels z are known, discriminant analysis is concerned. In this situation, the aim is to estimate the group z n+1 of any new individual x n+1 of R d with unknown label. In MIXMOD, the n couples (x i,z i ),..., (x n,z n ) are supposed to be n i.i.d. realizations of the following joint distribution:

4 C. Biernacki, G. Celeux, G. Govaert and F. Langrognet f(x i,z i θ) = K k=1 p z ik k [h(x i µ k,σ k )] z ik. (3) An estimate ˆθ of θ is obtained by the maximum likelihood method (from complete data (x,z) and, then, the MAP principle is involved to class any new individual x n+1. In the software, two criteria are proposed in this supervised setting for selecting one of the previous Gaussain models: BIC and cross-validation. 3 MIXMOD for categorical data We consider now that data are n objects described by d categorical variables, with respective number of categories m 1,...,m d. The data can be represented by n binary vectors x i = (x jh i ; j = 1,...,d;h = 1,...,m j ) (i = 1,...,n) where x jh i = 1 if the object i belongs to the category h of the variable j and 0 otherwise. Denoting m = d j=1 m j the total number of categories, the data are defined by the matrix x = (x 1,...,x n ) with n rows and m columns. Binary data can be seen as a particular case of categorical data with d dichotomous variables, i.e. m j = 2 for any j = 1,...,d. The latent class model assumes that the d ordinal variables are independent given the latent variable. Formulated in mixture terms [11], each x i arises independently from a mixture of multivariate multinomial distributions defined by f(x i θ) = K k=1 p k h(x i α k ) where h(x i α k ) = d m j (α jh jh k )x j=1 h=1 i (4) with α k = (α jh k ; j = 1,...,d;h = 1,...,m j). We recognize the product of d conditionally independent multinomial distributions of parameters α j k. In this situation, the mixture parameter is denoted by θ = (p 1,..., p K 1,α 1,...,α K ). This model may present problems of identifiability [12] but most situations of interest are identified. sc mixmod proposes four parsimonious models declined from the previous one by following extension of the parameterization of Bernoulli distributions used by [7] for clustering and also by [1] for kernel discriminant analysis. All models, algorithms an criteria presented previously in the Gaussian situation are also implanted in MIXMOD for the latent class model. Both the clustering and the supervised classification purposes can be involved by the user in this context. 4 An extension coming soon for high dimensional data In order to deal with high dimensional data, Mixture of Factor Analyzers have been considered by several authors including [5]. In MIXMOD, a family of eight Gaussian

Title Suppressed Due to Excessive Length 5 mixture models introduced by [5] are being implemented for discriminant analysis in high dimensional spaces. References 1. Aitchison, J. and Aitken, C.G. (1976). Multivariate binaery discrimination by the kernel method, Biometrika, 63, 413 420. 2. Biernacki, C. Celeux, G. et Govaert, G. (1999). An improvement of the NEC criterion for assessing the number of components arising from a mixture, Pattern Recognition letters, 20, 267 272. 3. Biernacki, C. Celeux, G. et Govaert, G. (2000). Assessing a Mixture Model for Clustering with the Integrated Completed Likelihood, IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 7, 719 725. 4. Biernacki, C., Celeux, G., Govaert, G. and Langrognet, F. (2006). Model-bases cluster and discriminant analysis with MIXMOD software. Computational Statistics & Data Analysis, 51, 2, 587 600. 5. Bouveyron, C., Girard, G. and Schmid, C. (2007). High dimensional discriminant analysis, Communications in Statistics: Theory and Methods, 36, 2607 2623. 6. Celeux, G. et Diebolt, J. (1985). The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comp. Statis. Quaterly, 2, 73 82. 7. Celeux, G. et Govaert, G. (1991). Clustering Criteria for Discrete Data and Latent Class Models, Journal of Classification, 8, 157 176. 8. Celeux, G. and Govaert, G. (1992). A classification EM algorithm for clustering and two stochastic versions. Computational Statistics & Data Analysis, 14, 315 332. 9. Celeux, G. et Govaert, G. (1995). Gaussian Parsimonious Clustering Models. Pattern Recognition, 28, 781 793. 10. Dempster, A.P., Laird, N.M. et Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm (with discussion), Journal of the Royal Statistical Society B, 39, 1 38. 11. Everitt, B. (1984). An Introduction to Latent Variable Models. London, Chapman and Hall. 12. Goodman, L.A. (1974). Exploratory Latent Structure Analysis Using Both Identifiable and Unidentifiable Models, Biometrika, 61, 215 231. 13. Schwarz, G. (1978). Estimating the Dimension of a Model, Annals of Statistics, 6, 461 464.