Package MixSim. April 29, 2017

Similar documents
Package ManlyMix. November 19, 2017

Package pmclust. February 2, 2018

Package EMCluster. February 1, 2018

Package longclust. March 18, 2018

Package ClustGeo. R topics documented: July 14, Type Package

Package robets. March 6, Type Package

Package sure. September 19, 2017

Package clustvarsel. April 9, 2018

Package rknn. June 9, 2015

Package bayesdp. July 10, 2018

Package mixphm. July 23, 2015

Package OLScurve. August 29, 2016

Package esabcv. May 29, 2015

Package pairsd3. R topics documented: August 29, Title D3 Scatterplot Matrices Version 0.1.0

Package optimus. March 24, 2017

Package RCA. R topics documented: February 29, 2016

Package flexcwm. May 20, 2018

Package ECctmc. May 1, 2018

Package ICSOutlier. February 3, 2018

Set up of the data is similar to the Randomized Block Design situation. A. Chang 1. 1) Setting up the data sheet

Package r2d2. February 20, 2015

Package texteffect. November 27, 2017

Package DTRlearn. April 6, 2018

Package MIICD. May 27, 2017

Package lcc. November 16, 2018

Package SSRA. August 22, 2016

Package permuco. February 14, Type Package

Package LVGP. November 14, 2018

Package RaPKod. February 5, 2018

Package sspline. R topics documented: February 20, 2015

Package manet. September 19, 2017

Package RobustGaSP. R topics documented: September 11, Type Package

Package clustering.sc.dp

Package nprotreg. October 14, 2018

Package glassomix. May 30, 2013

Package RSKC. R topics documented: August 28, Type Package Title Robust Sparse K-Means Version Date Author Yumi Kondo

Package sigqc. June 13, 2018

Package clusterpower

Package ConvergenceClubs

Package DPBBM. September 29, 2016

Package simsurv. May 18, 2018

Package OptimaRegion

Package lhs. R topics documented: January 4, 2018

Package simr. April 30, 2018

The nor1mix Package. August 3, 2006

Package mmeta. R topics documented: March 28, 2017

Package datasets.load

Package GLDreg. February 28, 2017

Package mrm. December 27, 2016

Package radiomics. March 30, 2018

Package mgc. April 13, 2018

Package flam. April 6, 2018

Package JOP. February 19, 2015

Package alphashape3d

Package bayescl. April 14, 2017

Package svmpath. R topics documented: August 30, Title The SVM Path Algorithm Date Version Author Trevor Hastie

Package manet. August 23, 2018

Package intccr. September 12, 2017

The nor1mix Package. June 12, 2007

The Curse of Dimensionality

Package tclust. May 24, 2018

Package MatchIt. April 18, 2017

Package areaplot. October 18, 2017

Package acebayes. R topics documented: November 21, Type Package

Package gibbs.met. February 19, 2015

Medical Image Analysis

Package bisect. April 16, 2018

Package cwm. R topics documented: February 19, 2015

Package pomdp. January 3, 2019

Package clustmixtype

Package ipft. January 4, 2018

Package exporkit. R topics documented: May 5, Type Package Title Expokit in R Version Date

Package pwrrasch. R topics documented: September 28, Type Package

Package subspace. October 12, 2015

Package RPEnsemble. October 7, 2017

Package weco. May 4, 2018

Package basictrendline

Package TVsMiss. April 5, 2018

Package snn. August 23, 2015

Package frequencyconnectedness

Package rereg. May 30, 2018

Package binmto. February 19, 2015

Package edrgraphicaltools

Package FisherEM. February 19, 2015

Package RAMP. May 25, 2017

Package ssa. July 24, 2016

Package cgh. R topics documented: February 19, 2015

Package quickreg. R topics documented:

Package CLA. February 6, 2018

Package r.jive. R topics documented: April 12, Type Package

Package OutlierDC. R topics documented: February 19, 2015

Package glmnetutils. August 1, 2017

Package GFD. January 4, 2018

Package fusionclust. September 19, 2017

Package ecp. December 15, 2017

Package gems. March 26, 2017

Package SSLASSO. August 28, 2018

Package swcrtdesign. February 12, 2018

I How does the formulation (5) serve the purpose of the composite parameterization

Package dalmatian. January 29, 2018

Transcription:

Version 1.1-3 Date 2017-04-22 Package MixSim April 29, 2017 Title Simulating Data to Study Performance of Clustering Algorithms Depends R (>= 3.0.0), MASS Enhances mclust, cluster LazyLoad yes LazyData yes The utility of this package is in simulating mixtures of Gaussian distributions with different levels of overlap between mixture components. Pairwise overlap, defined as a sum of two misclassification probabilities, measures the degree of interaction between components and can be readily employed to control the clustering complexity of datasets simulated from mixtures. These datasets can then be used for systematic performance investigation of clustering and finite mixture modeling algorithms. Among other capabilities of 'MixSim', there are computing the exact overlap for Gaussian mixtures, simulating Gaussian and non-gaussian data, simulating outliers and noise variables, calculating various measures of agreement between two partitionings, and constructing parallel distribution plots for the graphical display of finite mixture models. License GPL (>= 2) NeedsCompilation yes Author Volodymyr Melnykov [aut], Wei-Chen Chen [aut, cre], Ranjan Maitra [aut], Robert Davies [ctb] (quadratic form probabilities), Stephen Moshier [ctb] (eigenvalue calculations), Rouben Rostamian [ctb] (memory allocation) Maintainer Wei-Chen Chen <wccsnow@gmail.com> Repository CRAN Date/Publication 2017-04-29 17:05:22 UTC 1

2 MixSim-package R topics documented: MixSim-package...................................... 2 ClassProp.......................................... 3 MixGOM.......................................... 4 MixSim........................................... 6 overlap........................................... 8 overlapgom........................................ 9 pdplot............................................ 10 perms............................................ 12 print.object......................................... 12 RandIndex.......................................... 13 simdataset.......................................... 15 VarInf............................................ 17 Index 18 MixSim-package Simulation of Gaussian Finite Mixture Models Details Simulation of Gaussian finite mixture models for prespecified levels of average or/and maximum overlap. Pairwise overlap is defined as the sum of two misclassification probabilities. Package: MixSim Type: Package Date: 2012-08-12 License: GPL (>= 2) Function MixSim simulates a finite mixture model for a prespecified level of average or/and maximum overlap. Function overlap computes all misclassification probabilities for a finite mixture model. Function pdplot constructs a parallel distribution plot for a finite mixture model. Function simdataset simulates a dataset from a finite mixture model. Maintainer: Volodymyr Melnykov <vmelnykov@cba.ua.edu>

ClassProp 3 Maitra, R. and Melnykov, V. (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms, The Journal of Computational and Graphical Statistics, 2:19, 354-376. Melnykov, V., Chen, W.-C., and Maitra, R. (2012) MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms, Journal of Statistical Software, 51:12, 1-25. Davies, R. (1980) The distribution of a linear combination of chi-square random variables, Applied Statistics, 29, 323-333. Meila, M. (2006) Comparing clusterings - an information based distance, Journal of Multivariate Analysis, 98, 873-895. # Simulate parameters of a mixture model A <- MixSim(BarOmega = 0.01, MaxOmega = 0.10, K = 10, p = 5) # Display the mixture via the parallel distribution plot pdplot(a$pi, A$Mu, A$S, MaxInt = 0.5) ClassProp Classification Proportion Computes the agreement proportion between two classification vectors. ClassProp(id1, id2) id1 id2 first partitioning vector. second partitioning vector. Returns the value of the proportion of agreeing elements.

4 MixGOM Meila, M. (2006) Comparing clusterings - an information based distance, Journal of Multivariate Analysis, 98, 873-895. RandIndex, and VarInf. id1 <- c(rep(1, 50), rep(2,100)) id2 <- rep(1:3, each = 50) ClassProp(id1, id2) MixGOM Mixture Simulation based on generalized overlap of Maitra Generates a finite mixture model with Gaussian components for a prespecified level of gomega (generalized overlap of Maitra). MixGOM(goMega = NULL, K, p, sph = FALSE, hom = FALSE, ecc = 0.90, PiLow = 1.0, int = c(0.0, 1.0), resn = 100, eps = 1e-06, lim = 1e06) gomega value of desired generalized overlap of Maitra. K number of components. p number of dimensions. sph covariance matrix structure (FALSE - non-spherical, TRUE - spherical). hom heterogeneous or homogeneous clusters (FALSE - heterogeneous, TRUE - homogeneous). ecc maximum eccentricity. PiLow value of the smallest mixing proportion (if PiLow is not reachable with respect to K, equal proportions are taken; PiLow = 1.0 implies equal proportions by default). int mean vectors are simulated uniformly on a hypercube with sides specified by int = (lower.bound, upper.bound). resn maximum number of mixture resimulations. eps error bound for overlap computation. lim maximum number of integration terms (Davies, 1980).

MixGOM 5 Details Returns mixture parameters satisfying the prespecified level of gomega. Pi Mu S gomega fail vector of mixing proportions. matrix consisting of components mean vectors (K * p). set of components covariance matrices (p * p * K). value of generalized overlap of Maitra. flag value; 0 represents successful mixture generation, 1 represents failure. Maitra, R. (2010) A re-defined and generalized percent-overlap-of-activation measure for studies of fmri reproducibility and its use in identifying outlier activation maps, NeuroImage, 50, 124-135. Maitra, R. and Melnykov, V. (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms, The Journal of Computational and Graphical Statistics, 2:19, 354-376. Melnykov, V., Chen, W.-C., and Maitra, R. (2012) MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms, Journal of Statistical Software, 51:12, 1-25. Davies, R. (1980) The distribution of a linear combination of chi-square random variables, Applied Statistics, 29, 323-333. overlapgom, MixSim, and simdataset. set.seed(1234) # controls average and maximum overlaps (ex.1 <- MixGOM(goMega = 0.05, K = 4, p = 5)) # controls maximum overlap (ex.2 <- MixGOM(goMega = 0.15, K = 4, p = 5, sph = TRUE))

6 MixSim MixSim Mixture Simulation Generates a finite mixture model with Gaussian components for prespecified levels of maximum and/or average overlaps. MixSim(BarOmega = NULL, MaxOmega = NULL, K, p, sph = FALSE, hom = FALSE, ecc = 0.90, PiLow = 1.0, int = c(0.0, 1.0), resn = 100, eps = 1e-06, lim = 1e06) BarOmega MaxOmega K p sph hom ecc PiLow int resn eps Details value of desired average overlap. value of desired maximum overlap. number of components. number of dimensions. covariance matrix structure (FALSE - non-spherical, TRUE - spherical). heterogeneous or homogeneous clusters (FALSE - heterogeneous, TRUE - homogeneous). maximum eccentricity. value of the smallest mixing proportion (if PiLow is not reachable with respect to K, equal proportions are taken; PiLow = 1.0 implies equal proportions by default). mean vectors are simulated uniformly on a hypercube with sides specified by int = (lower.bound, upper.bound). maximum number of mixture resimulations. error bound for overlap computation. lim maximum number of integration terms (Davies, 1980). If BarOmega is not specified, the function generates a mixture solely based on MaxOmega ; if MaxOmega is not specified, the function generates a mixture solely based on BarOmega. Pi Mu S vector of mixing proportions. matrix consisting of components mean vectors (K * p). set of components covariance matrices (p * p * K).

MixSim 7 OmegaMap BarOmega MaxOmega rcmax fail matrix of misclassification probabilities (K * K); OmegaMap[i,j] is the probability that X coming from the i-th component is classified to the j-th component. value of average overlap. value of maximum overlap. row and column numbers for the pair of components producing maximum overlap MaxOmega. flag value; 0 represents successful mixture generation, 1 represents failure. Maitra, R. and Melnykov, V. (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms, The Journal of Computational and Graphical Statistics, 2:19, 354-376. Melnykov, V., Chen, W.-C., and Maitra, R. (2012) MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms, Journal of Statistical Software, 51:12, 1-25. Davies, R. (1980) The distribution of a linear combination of chi-square random variables, Applied Statistics, 29, 323-333. overlap, pdplot, and simdataset. set.seed(1234) # controls average and maximum overlaps (ex.1 <- MixSim(BarOmega = 0.05, MaxOmega = 0.15, K = 4, p = 5)) summary(ex.1) # controls average overlap (ex.2 <- MixSim(BarOmega = 0.05, K = 4, p = 5, hom = TRUE)) summary(ex.2) # controls maximum overlap (ex.3 <- MixSim(MaxOmega = 0.15, K = 4, p = 5, sph = TRUE)) summary(ex.3)

8 overlap overlap Overlap Computes misclassification probabilities and pairwise overlaps for finite mixture models with Gaussian components. Overlap is defined as sum of two misclassification probabilities. overlap(pi, Mu, S, eps = 1e-06, lim = 1e06) Pi vector of mixing proprtions (length K). Mu matrix consisting of components mean vectors (K * p). S set of components covariance matrices (p * p * K). eps error bound for overlap computation. lim maximum number of integration terms (Davies, 1980). OmegaMap BarOmega MaxOmega rcmax matrix of misclassification probabilities (K * K); OmegaMap[i,j] is the probability that X coming from the i-th component is classified to the j-th component. value of average overlap. value of maximum overlap. row and column numbers for the pair of components producing maximum overlap MaxOmega. Maitra, R. and Melnykov, V. (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms, The Journal of Computational and Graphical Statistics, 2:19, 354-376. Melnykov, V., Chen, W.-C., and Maitra, R. (2012) MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms, Journal of Statistical Software, 51:12, 1-25. Davies, R. (1980) The distribution of a linear combination of chi-square random variables, Applied Statistics, 29, 323-333. MixSim, pdplot, and simdataset.

overlapgom 9 data("iris", package = "datasets") p <- ncol(iris) - 1 id <- as.integer(iris[, 5]) K <- max(id) # estimate mixture parameters Pi <- prop.table(tabulate(id)) Mu <- t(sapply(1:k, function(k){ colmeans(iris[id == k, -5]) })) S <- sapply(1:k, function(k){ var(iris[id == k, -5]) }) dim(s) <- c(p, p, K) overlap(pi = Pi, Mu = Mu, S = S) overlapgom Generalized overlap of Maitra Computes the generalized overlap as defined by R. Maitra. overlapgom(pi, Mu, S, eps = 1e-06, lim = 1e06) Pi vector of mixing proprtions (length K). Mu matrix consisting of components mean vectors (K * p). S set of components covariance matrices (p * p * K). eps error bound for overlap computation. lim maximum number of integration terms (Davies, 1980). Returns the value of gomega.

10 pdplot Maitra, R. (2010) A re-defined and generalized percent-overlap-of-activation measure for studies of fmri reproducibility and its use in identifying outlier activation maps, NeuroImage, 50, 124-135. Melnykov, V., Chen, W.-C., and Maitra, R. (2012) MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms, Journal of Statistical Software, 51:12, 1-25. Davies, R. (1980) The distribution of a linear combination of chi-square random variables, Applied Statistics, 29, 323-333. MixSim, MixGOM, and overlap. data("iris", package = "datasets") p <- ncol(iris) - 1 id <- as.integer(iris[, 5]) K <- max(id) # estimate mixture parameters Pi <- prop.table(tabulate(id)) Mu <- t(sapply(1:k, function(k){ colmeans(iris[id == k, -5]) })) S <- sapply(1:k, function(k){ var(iris[id == k, -5]) }) dim(s) <- c(p, p, K) overlapgom(pi = Pi, Mu = Mu, S = S) pdplot Parallel Distribution Plot Constructs a parallel distribution plot for Gaussian finite mixture models. pdplot(pi, Mu, S, file = NULL, Nx = 5, Ny = 5, MaxInt = 1, marg = c(2,1,1,1)) Pi Mu S file vector of mixing proportions. matrix consisting of components mean vectors (K * p). set of components covariance matrices (p * p * K). name of.pdf-file.

pdplot 11 Nx Ny MaxInt marg number of color levels for smoothing along the x-axis. number of color levels for smoothing along the y-axis. maximum color intensity. plot margins. Details If file is specified, produced plot will be saved as a.pdf-file. Maitra, R. and Melnykov, V. (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms, The Journal of Computational and Graphical Statistics, 2:19, 354-376. Melnykov, V., Chen, W.-C., and Maitra, R. (2012) MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms, Journal of Statistical Software, 51:12, 1-25. MixSim, overlap, and simdataset. data("iris", package = "datasets") p <- ncol(iris) - 1 id <- as.integer(iris[, 5]) K <- max(id) # estimate mixture parameters Pi <- prop.table(tabulate(id)) Mu <- t(sapply(1:k, function(k){ colmeans(iris[id == k, -5]) })) S <- sapply(1:k, function(k){ var(iris[id == k, -5]) }) dim(s) <- c(p, p, K) pdplot(pi = Pi, Mu = Mu, S = S)

12 print.object perms Permutations Returns all possible permutations given the number of elements. perms(n) n Number of elements. Returns a matrix containing all possible permutations. ClassProp. perms(3) print.object Functions for Printing or Summarizing Objects A MixSim and MixGOM classes are declared, and these are functions to print and summarize objects. ## S3 method for class 'MixSim' print(x,...) ## S3 method for class 'MixSim' summary(object,...) ## S3 method for class 'MixGOM' print(x,...)

RandIndex 13 x Details object an object with the MixSim (or MixGOM ) class attributes. an object with the MixSim (or MixGOM ) class attributes.... other possible options. These are useful functions for summarizing and debugging. For other functions, they only show summaries of objects. Use names or str to explore the details. The results will cat or print on the STDOUT by default. Maitra, R. and Melnykov, V. (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms, The Journal of Computational and Graphical Statistics, 2:19, 354-376. Melnykov, V., Chen, W.-C., and Maitra, R. (2012) MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms, Journal of Statistical Software, 51:12, 1-25. MixSim. ## Not run: # Functions applied by directly type the names of objects. ## End(Not run) RandIndex Rand s Index Computes Rand, adjusted Rand, Fowlkes and Mallows, and Merkin indices. RandIndex(id1, id2)

14 RandIndex id1 id2 first partitioning vector. second partitioning vector. R AR F M Rand s index. adjusted Rand s index. Fowlkes and Mallows index. Mirkin metric. Rand, W.M. (1971) Objective criteria for the evaluation of clustering methods, Journal of the American Statistical Association, 66:336, 846-850. Maitra, R. and Melnykov, V. (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms, The Journal of Computational and Graphical Statistics, 2:19, 354-376. Meila, M. (2006) Comparing clusterings - an information based distance, Journal of Multivariate Analysis, 98, 873-895. Melnykov, V., Chen, W.-C., and Maitra, R. (2012) MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms, Journal of Statistical Software, 51:12, 1-25. MixSim, pdplot, simdataset, ClassProp, and VarInf. id1 <- c(rep(1, 50), rep(2,100)) id2 <- rep(1:3, each = 50) RandIndex(id1, id2)

simdataset 15 simdataset Dataset Simulation Simulates a datasets of sample size n given parameters of finite mixture model with Gaussian components. simdataset(n, Pi, Mu, S, n.noise = 0, n.out = 0, alpha = 0.001, max.out = 100000, int = NULL, lambda = NULL) n Pi Mu S n.noise n.out alpha max.out int lambda sample size. vector of mixing proportions (length K). matrix consisting of components mean vectors (K * p). set of components covariance matrices (p * p * K). number of noise variables. number of outlying observations. level for simulating outliers. maximum number of trials to simulate outliers. interval for noise and outlier generation. inverse Box-Cox transformation coefficients. Details The function simulates a dataset of n observations from a mixture model with parameters Pi (mixing proportions), Mu (mean vectors), and S (covariance matrices). Mixture component sample sizes are produced as a realization from a multinomial distribution with probabilities given by mixing proportions. To make a dataset more challenging for clustering, a user might want to simulate noise variables or outliers. Parameter n.noise specifies the desired number of noise variables. If an interval int is specified, noise will be simulated from a Uniform distribution on the interval given by int. Otherwise, noise will be simulated uniformly between the smallest and largest coordinates of mean vectors. n.out specifies the number of observations outside (1 - alpha ) ellipsoidal contours for the weighted component distributions. Outliers are simulated on a hypercube specified by the interval int. A user can apply an inverse Box-Cox transformation providing a vector of coefficients lambda. The value 1 implies that no transformation is needed for the corresponding coordinate. X id simulated dataset (n + n.out) x (p + n.noise); noise coordinates are provided in the last n.noise columns. classification vector (length n + n.out); 0 represents an outlier.

16 simdataset Maitra, R. and Melnykov, V. (2010) Simulating data to study performance of finite mixture modeling and clustering algorithms, The Journal of Computational and Graphical Statistics, 2:19, 354-376. Melnykov, V., Chen, W.-C., and Maitra, R. (2012) MixSim: An R Package for Simulating Data to Study Performance of Clustering Algorithms, Journal of Statistical Software, 51:12, 1-25. MixSim, overlap, and pdplot. ## Not run: set.seed(1234) repeat{ Q <- MixSim(BarOmega = 0.01, K = 4, p = 2) if (Q$fail == 0) break } # simulate a dataset of size 300 and add 10 outliers simulated on (0,1)x(0,1) A <- simdataset(n = 500, Pi = Q$Pi, Mu = Q$Mu, S = Q$S, n.out = 10, int = c(0, 1)) colors <- c("red", "green", "blue", "brown", "magenta") plot(a$x, xlab = "x1", ylab = "x2", type = "n") for (k in 0:4){ points(a$x[a$id == k, ], col = colors[k+1], pch = 19, cex = 0.5) } repeat{ Q <- MixSim(MaxOmega = 0.1, K = 4, p = 1) if (Q$fail == 0) break } # simulate a dataset of size 300 with 1 noise variable A <- simdataset(n = 300, Pi = Q$Pi, Mu = Q$Mu, S = Q$S, n.noise = 1) plot(a$x, xlab = "x1", ylab = "x2", type = "n") for (k in 1:4){ points(a$x[a$id == k, ], col = colors[k+1], pch = 19, cex = 0.5) } ## End(Not run)

VarInf 17 VarInf Variation of Information Computes the variation of information for two classification vectors. VarInf(id1, id2) id1 id2 first partitioning vector. second partitioning vector. Returns the variation of information. It is equal to 0 if and only if two classification vectors are identical. Meila, M. (2006) Comparing clusterings - an information based distance, Journal of Multivariate Analysis, 98, 873-895. ClassProp, and RandIndex. id1 <- c(rep(1, 50), rep(2,100)) id2 <- rep(1:3, each = 50) VarInf(id1, id2)

Index Topic cluster ClassProp, 3 MixGOM, 4 MixSim, 6 overlap, 8 overlapgom, 9 pdplot, 10 perms, 12 RandIndex, 13 simdataset, 15 VarInf, 17 Topic datagen MixGOM, 4 MixSim, 6 simdataset, 15 Topic hplot pdplot, 10 Topic programming print.object, 12 00_MixSim-package (MixSim-package), 2 ClassProp, 3 MixGOM, 4 MixSim, 6, 13 MixSim-package, 2 overlap, 8 overlapgom, 9 pdplot, 10 perms, 12 print.mixgom (print.object), 12 print.mixsim (print.object), 12 print.object, 12 RandIndex, 13 simdataset, 15 summary.mixsim (print.object), 12 VarInf, 17 18