Package lga. R topics documented: February 20, 2015

Similar documents
Grouping Objects by Linear Pattern

Package birch. February 15, 2013

Package RSKC. R topics documented: August 28, Type Package Title Robust Sparse K-Means Version Date Author Yumi Kondo

Package GLDreg. February 28, 2017

Package RCA. R topics documented: February 29, 2016

Adventures in HPC and R: Going Parallel

Package complmrob. October 18, 2015

Package cwm. R topics documented: February 19, 2015

Package pnmtrem. February 20, Index 9

Package cgh. R topics documented: February 19, 2015

Package ADMM. May 29, 2018

Package ssd. February 20, Index 5

Package cumseg. February 19, 2015

Package CausalImpact

Package svmpath. R topics documented: August 30, Title The SVM Path Algorithm Date Version Author Trevor Hastie

Package desirability

Package Numero. November 24, 2018

Package sbf. R topics documented: February 20, Type Package Title Smooth Backfitting Version Date Author A. Arcagni, L.

Package nlnet. April 8, 2018

Package fusionclust. September 19, 2017

Package optimus. March 24, 2017

Supervised vs. Unsupervised Learning

Package flam. April 6, 2018

Package catenary. May 4, 2018

Package PTE. October 10, 2017

Package hsiccca. February 20, 2015

Package maboost. R topics documented: February 20, 2015

Package misclassglm. September 3, 2016

Package SoftClustering

Package gibbs.met. February 19, 2015

Package mtsdi. January 23, 2018

Package parfossil. February 20, 2015

Package edrgraphicaltools

Package subsemble. February 20, 2015

The supclust Package

Package nsprcomp. August 29, 2016

Package ecp. December 15, 2017

Package TANDEM. R topics documented: June 15, Type Package

Package restlos. June 18, 2013

Package rknn. June 9, 2015

Package influence.sem

Package esabcv. May 29, 2015

Package glmmml. R topics documented: March 25, Encoding UTF-8 Version Date Title Generalized Linear Models with Clustering

Package Daim. February 15, 2013

Package CuCubes. December 9, 2016

Package CatEncoders. March 8, 2017

Package pampe. R topics documented: November 7, 2015

Package XMRF. June 25, 2015

Package beast. March 16, 2018

Additional File 3 - ValWorkBench: an open source Java library for cluster validation, with applications to microarray data analysis

Package StVAR. February 11, 2017

Package fractaldim. R topics documented: February 19, Version Date Title Estimation of fractal dimensions

Package densityclust

Package ccapp. March 7, 2016

Package texteffect. November 27, 2017

Package SAENET. June 4, 2015

Package subtype. February 20, 2015

Package RaPKod. February 5, 2018

Package parcor. February 20, 2015

Package kernelfactory

Package mixphm. July 23, 2015

Package r.jive. R topics documented: April 12, Type Package

Package semisup. March 10, Version Title Semi-Supervised Mixture Model

Package sgpls. R topics documented: August 29, Type Package Title Sparse Group Partial Least Square Methods Version 1.

Package robustreg. R topics documented: April 27, Version Date Title Robust Regression Functions

Package sprinter. February 20, 2015

Package missforest. February 15, 2013

Package rgcvpack. February 20, Index 6. Fitting Thin Plate Smoothing Spline. Fit thin plate splines of any order with user specified knots

Package GFD. January 4, 2018

Package PCADSC. April 19, 2017

Package DTRreg. August 30, 2017

Package rferns. November 15, 2017

Package dkdna. June 1, Description Compute diffusion kernels on DNA polymorphisms, including SNP and bi-allelic genotypes.

Package FWDselect. December 19, 2015

Package kamila. August 29, 2016

Package EMC. February 19, 2015

Package PCGSE. March 31, 2017

Package biglars. February 19, 2015

Package DPBBM. September 29, 2016

Package RLRsim. November 4, 2016

Package inca. February 13, 2018

Package mhde. October 23, 2015

Package GGMselect. February 1, 2018

Package acebayes. R topics documented: November 21, Type Package

Package IntNMF. R topics documented: July 19, 2018

Package robets. March 6, Type Package

Package GWRM. R topics documented: July 31, Type Package

Package svapls. February 20, 2015

Package polychaosbasics

Package CALIBERrfimpute

Package enpls. May 14, 2018

Package sigqc. June 13, 2018

Package bdots. March 12, 2018

Package penalizedlda

Package extweibquant

Package gee. June 29, 2015

Entity Resolution, Clustering Author References

Package Binarize. February 6, 2017

Package JGEE. November 18, 2015

Package intccr. September 12, 2017

Transcription:

Version 1.1-1 Date 2008-06-15 Title Tools for linear grouping analysis (LGA) Author Justin Harrington Maintainer ORPHANED Depends R (>= 2.2.1) Imports boot, lattice Suggests snow, MASS, maps Package lga February 20, 2015 Tools for linear grouping analysis. Three user-level functions: gap, rlga and lga. License GPL X-CRAN-Comment Orphaned on 2012-01-10 as maintainer address bounces and maintainer is untraceable. Repository CRAN Date/Publication 2012-01-13 08:52:30 NeedsCompilation no R topics documented: lga-package......................................... 2 brain............................................. 2 corridorwalls........................................ 3 gap.............................................. 3 lga.............................................. 6 nhl94............................................ 9 ob.............................................. 9 Index 11 1

2 brain lga-package The lga Package The lga package is an implementation of the algorithms described in Van Aelst et al (2006) and Garcia-Escudero et al (2008). It has three main functions (with accompanying print and plot methods). lgathe core linear grouping analysis rlgathe robust version of linear grouping analysis gapperforms a gap analysis to find the number of clusters For more details refer to the relevant help files. Author(s) Justin Harrington <harringt@stat.ubc.ca> Garcia-Escudero, L.A., Gordaliza, A., San Martin, R., Van Aelst, S. and Zamar, R.H. (2008) Robust linear clustering. To appear in Journal of the Royal Statistical Society, Series B (accepted June, 2008). brain Allometry Allometry data from Van Aelst et al (2006) data(brain) Format A matrix with 282 observations on the following 2 variables: BrainWeight.g. a numeric vector OlfactoryBulbs.ml. a numeric vector

corridorwalls 3 corridorwalls Corridor Walls data Format Corridor Walls data from Garcia-Escudero et al. (2008). A laser device is introduced in a corridor of an office building and throws a laser ray which touches a point from the object found at the end of its trajectory. The device produces a three dimensional measurement of the placement of that point with respect to a fixed reference system. data(corridorwalls) A matrix with 11710 observations on the following 3 variables (3-dimensional coordinates). x a numeric vector y a numeric vector z a numeric vector Garcia-Escudero, L.A., Gordaliza, A., San Martin, R., Van Aelst, S. and Zamar, R.H. (2008) Robust linear clustering. To appear in Journal of the Royal Statistical Society, Series B (accepted June, 2008) gap Perform gap analysis Performs the gap analysis using lga to estimate the number of clusters. ## Default S3 method: gap(x, K, B, criteria=c("tibshirani", "DandF","none"), nnode=null, scale=true,...)

4 gap Arguments x K B Details Value criteria nnode scale a numeric matrix. an integer giving the maximum number of clusters to consider. an integer giving the number of bootstraps. a character string indicating which criteria to evaluate the gap data. One of tibshirani (default), DandF or none. Can be abbreviated. an integer of many CPUS to use for parallel processing. Defaults to NULL i.e. no parallel processing. logical. Should the data be scaled?... For any other arguments passed from the generic function. This code performs the gap analysis using lga. The gap statistic is defined as the difference between the log of the Residual Orthogonal Sum of Squared Distances (denoted log(w k )) and its expected value derived using bootstrapping under the null hypothesis that there is only one cluster. In this implementation, the reference distribution used for the bootstrapping is a random uniform hypercube, transformed by the principal components of the underlying data set. For further details see Tibshirani et al (2001). For different criteria, different rules apply. With tibshirani (ibid) we calculate the gap statistic for k = 1,..., K, stopping when gap(k) gap(k + 1) s k+1 where s k+1 is a function of standard deviation of the bootstrapped estimates. With the DandF criteria from Dudoit et al (2002), we calculate the gap statistic for all values of k = 1,..., K, selecting the number of clusters as where k = arg max k 1 gap(k). ˆk = smallest k 1such that gap(k) gap(k ) s k Finally, for the criteria none, no rules are applied, and just the gap data is returned. As lga is ostensibly unsupervised in this case, the parameter niter is set to 20 to ensure convergence. This function is parallel computing aware via the nnode argument, and works with the package snow. In order to use parallel computing, one of MPI (e.g. lamboot) or PVM is necessary. For further details, see the documentation for snow. An object of class gap with components finished nclust data criteria a logical. For the tibshirani, was there a solution found? a integer for the number of clusters estimated. Returns NA if nothing conclusive is found. the original data set, scaled if specified in the arguments. the criteria used.

gap 5 Author(s) Justin Harrington <harringt@stat.ubc.ca> Tibshirani, R. and Walther, G. and Hastie, T. (2001) Estimating the number of clusters in a data set via the gap statistic, J. R. Statist. Soc. B 63, 411 423. Dudoit, S. and Fridlyand, J. (2002) A prediction-based resampling method for estimating the number of clusters in a dataset, Genome Biology 3. See Also lga Examples ## Synthetic example ## Make a dataset with 2 clusters in 2 dimensions library(mass) set.seed(1234) X <- rbind(mvrnorm(n=100, mu=c(1, -2), Sigma=diag(0.1, 2) + 0.9), mvrnorm(n=100, mu=c(1, 1), Sigma=diag(0.1, 2) + 0.9)) gap(x, K=4, B=20) ## to run this using parallel processing with 4 nodes, the equivalent ## code would be ## Not run: gap(x, K=4, B=20, nnode=4) ## Quakes data (from package:datasets) ## Including the first two dimensions versus three dimensions ## yields different results set.seed(1234) ## Not run: gap(quakes[,1:2], K=4, B=20) gap(quakes[,1:3], K=4, B=20) ## End(Not run) library(maps) lgaout1 <- lga(quakes[,1:2], k=3) plot(lgaout1) lgaout2 <- lga(quakes[,1:3], k=2)

6 lga plot(lgaout2) ## Let s put this in context par(mfrow=c(1,2)) map("world", xlim=range(quakes[,2]), ylim=range(quakes[,1])); box() points(quakes[,2], quakes[,1], pch=lgaout1$cluster, col=lgaout1$cluster) map("world", xlim=range(quakes[,2]), ylim=range(quakes[,1])); box() points(quakes[,2], quakes[,1], pch=lgaout2$cluster, col=lgaout2$cluster) lga Perform LGA/RLGA Linear Grouping Analysis ## Default S3 method: lga(x, k, biter = NULL, niter = 10, showall = FALSE, scale = TRUE, nnode=null, silent=false,...) ## Default S3 method: rlga(x, k, alpha=0.9, biter = NULL, niter = 10, showall = FALSE, scale = TRUE, nnode=null, silent=false,...) Arguments x k alpha biter niter showall scale nnode silent a numeric matrix. an integer for the number of clusters. a numeric value between 0.5 and 1. For the robust estimate of LGA, specifying the percentage of points in the best subset. an integer for the number of different starting hyperplanes to try. an integer for the number of iterations to attempt for convergence. logical. If TRUE then display all the outcomes, not just the best one. logical. Allows you to scale the data, dividing each column by its standard deviation, before fitting. an integer of many CPUS to use for parallel processing. Defaults to NULL i.e. no parallel processing. logical. If TRUE, produces no text output during processing.... For any other arguments passed from the generic function.

lga 7 Details Value This code tries to find k clusters using the lga algorithm described in Van Aelst et al (2006). For each attempt, it has up to niter steps to get to convergence, and it does this from biter different starting hyperplanes. It then selects the clustering with the smallest Residual Orthoganal Sum of Squareds. If biter is left as NULL, then it is selected via the equation given in Van Aeslt et al (2006). The function rlga is the robust equivalent to LGA, and is introduced in Garcia-Escudero et al (2008). Both functions are parallel computing aware via the nnode argument, and works with the package snow. In order to use parallel computing, one of MPI (e.g. lamboot) or PVM is necessary. For further details, see the documentation for snow. Associated with the lga and rlga functions are a print method and a plot method (see the examples). In the plot method, the fitted hyperplanes are also shown as dashed-lines when there are only two dimensions. An object of class lga. The list contains cluster ROSS converged nconverg x a vector containing the cluster memberships. the Residual Orthogonal Sum of Squares for the solution. a logical. True if at least one solution has converged. the number of converged solutions (out of biter starts). the (scaled if selected) dataset. and the attributes include scaled k biter niter logical. Is the data scaled? the number of clusters to be found. the biter setting used. the niter setting used. Author(s) Justin Harrington <harringt@stat.ubc.ca> Garcia-Escudero, L.A., Gordaliza, A., San Martin, R., Van Aelst, S. and Zamar, R.H. (2008) Robust linear clustering. To appear in Journal of the Royal Statistical Society, Series B (accepted June, 2008). See Also gap

8 lga Examples ## Synthetic Data ## Make a dataset with 2 clusters in 2 dimensions library(mass) set.seed(1234) X <- rbind(mvrnorm(n=100, mu=c(1,-1), Sigma=diag(0.1,2)+0.9), mvrnorm(n=100, mu=c(1,1), Sigma=diag(0.1,2)+0.9)) lgaout <- lga(x,2) plot(lgaout) print(lgaout) ## Robust equivalent rlgaout <- rlga(x,2, alpha=0.75) plot(rlgaout) print(rlgaout) ## nhl94 data set data(nhl94) plot(lga(nhl94, k=3, niter=30)) ## Allometry data set data(brain) plot(lga(log(brain, base=10), k=3)) ## Second Allometry data set data(ob) plot(lga(log(ob[,2:3]), k=3), pch=as.character(ob[,1])) ## Corridor Walls data set ## To obtain the results reported in Garcia-Escudero et al. (2008): data(corridorwalls) rlgaout <- rlga(corridorwalls, k=3, biter = 100, niter = 30, alpha=0.85) pairs(corridorwalls, col=rlgaout$cluster+1) plot(rlgaout) ## Parallel processing case ## In this example, running using 4 nodes. ## Not run: set.seed(1234) X <- rbind(mvrnorm(n=1e6, mu=c(1,-1), Sigma=diag(0.1,2)+0.9), mvrnorm(n=1e6, mu=c(1,1), Sigma=diag(0.1,2)+0.9)) abc <- lga(x, k=2, nnode=4)

nhl94 9 ## End(Not run) nhl94 Player Performance in NHL for 1994-1995 This data set gives four variables for each of 871 players for the NHL 1994-1995 season. From Van Aelst et al (2006) data(nhl94) Format A matrix with 871 observations on the following 3 variables: PTS Points scored, a numeric vector PM plus/minus average rating, a numeric vector PIM Total penalty time in minutes, a numeric vector PP Power play goals, a numeric vector ob Allometry Data Allometry data from Van Aelst et al (2006) data(ob) Format A data frame with 83 observations on the following 3 variables. Group a factor with levels a c h H i m p t BrainWeight.g. a numeric vector OlfactoryBulbs.ml. a numeric vector

10 ob

Index Topic cluster gap, 3 lga, 6 lga-package, 2 Topic datasets brain, 2 corridorwalls, 3 nhl94, 9 ob, 9 Topic multivariate gap, 3 lga, 6 lga-package, 2 brain, 2 corridorwalls, 3 gap, 2, 3, 7 lga, 2, 5, 6 lga-package, 2 nhl94, 9 ob, 9 rlga, 2 rlga (lga), 6 11