Package clustmixtype

Similar documents
Package rpst. June 6, 2017

Package SSLASSO. August 28, 2018

Package RAMP. May 25, 2017

Package kamila. August 29, 2016

Package cclust. August 29, 2013

Package milr. June 8, 2017

Package wskm. July 8, 2015

Package glinternet. June 15, 2018

Package weco. May 4, 2018

Package FPDclustering

Package C50. December 1, 2017

Package SAM. March 12, 2019

Package GWRM. R topics documented: July 31, Type Package

Package balance. October 12, 2018

Package maboost. R topics documented: February 20, 2015

Package assortnet. January 18, 2016

Package colf. October 9, 2017

Package FisherEM. February 19, 2015

Package naivebayes. R topics documented: January 3, Type Package. Title High Performance Implementation of the Naive Bayes Algorithm

Package glmmml. R topics documented: March 25, Encoding UTF-8 Version Date Title Generalized Linear Models with Clustering

Package kernelfactory

Package rgcvpack. February 20, Index 6. Fitting Thin Plate Smoothing Spline. Fit thin plate splines of any order with user specified knots

Package orthodr. R topics documented: March 26, Type Package

Package crossrun. October 8, 2018

Package IntNMF. R topics documented: July 19, 2018

Data Mining Algorithms In R/Clustering/K-Means

Package Rambo. February 19, 2015

Package MOST. November 9, 2017

Package fasttextr. May 12, 2017

Package vinereg. August 10, 2018

Package ClustVarLV. December 14, 2016

Package Cubist. December 2, 2017

Package palmtree. January 16, 2018

Package flam. April 6, 2018

Package TANDEM. R topics documented: June 15, Type Package

Package RNAseqNet. May 16, 2017

Package sbrl. R topics documented: July 29, 2016

Package ECctmc. May 1, 2018

Package superheat. February 4, 2017

Package raker. October 10, 2017

Package PCADSC. April 19, 2017

Package biganalytics

Package glmnetutils. August 1, 2017

Package madsim. December 7, 2016

Package RLT. August 20, 2018

Package spark. July 21, 2017

Package RCA. R topics documented: February 29, 2016

Package Funclustering

Package sspline. R topics documented: February 20, 2015

Package svmpath. R topics documented: August 30, Title The SVM Path Algorithm Date Version Author Trevor Hastie

Package cwm. R topics documented: February 19, 2015

Package hgam. February 20, 2015

Package MixSim. April 29, 2017

Package pwrab. R topics documented: June 6, Type Package Title Power Analysis for AB Testing Version 0.1.0

Package sfc. August 29, 2016

Package qboxplot. November 12, qboxplot... 1 qboxplot.stats Index 5. Quantile-Based Boxplots

Package ClustGeo. R topics documented: July 14, Type Package

Package msda. February 20, 2015

Package intccr. September 12, 2017

Package FMsmsnReg. March 30, 2016

Package ordinalnet. December 5, 2017

2 cubist.default. Fit a Cubist model

K-means Clustering. customers.data <- read.csv(file = "wholesale_customers_data1.csv") str(customers.data)

Package freeknotsplines

Package clustering.sc.dp

Package rknn. June 9, 2015

Package anfis. February 19, 2015

Package Mondrian. R topics documented: March 4, Type Package

Package dalmatian. January 29, 2018

Package beast. March 16, 2018

Package coda.base. August 6, 2018

Package frequencyconnectedness

Package densityclust

Package kerasformula

Package endogenous. October 29, 2016

Package fso. February 19, 2015

Package edrgraphicaltools

Package Rramas. November 25, 2017

Package ICEbox. July 13, 2017

Package rferns. November 15, 2017

Multivariate analyses in ecology. Cluster (part 2) Ordination (part 1 & 2)

Package desirability

Package mtsdi. January 23, 2018

Package ConvergenceClubs

Package GenSA. R topics documented: January 17, Type Package Title Generalized Simulated Annealing Version 1.1.

Package hiernet. March 18, 2018

Package glassomix. May 30, 2013

Package manet. September 19, 2017

Package mcemglm. November 29, 2015

Package GauPro. September 11, 2017

UNSUPERVISED LEARNING IN R. Introduction to hierarchical clustering

Package texteffect. November 27, 2017

Package CompGLM. April 29, 2018

Package optimus. March 24, 2017

Package StatMeasures

Package metaheur. June 30, 2016

Package rucrdtw. October 13, 2017

Package MeanShift. R topics documented: August 29, 2016

The supclust Package

Package rafalib. R topics documented: August 29, Version 1.0.0

Transcription:

Package clustmixtype October 17, 2017 Version 0.1-29 Date 2017-10-15 Title k-prototypes Clustering for Mixed Variable-Type Data Author Gero Szepannek Maintainer Gero Szepannek <gero.szepannek@web.de> Imports RColorBrewer Description Functions to perform k-prototypes partitioning clustering for mixed variable-type data according to Z.Huang (1998): Extensions to the k-means Algorithm for Clustering Large Data Sets with Categorical Variables, Data Mining and Knowledge Discovery 2, 283-304, <DOI:10.1023/A:1009769707641>. License GPL (>= 2) RoxygenNote 5.0.1 NeedsCompilation no Repository CRAN Date/Publication 2017-10-17 11:19:07 UTC R topics documented: clprofiles.......................................... 2 kproto............................................ 3 lambdaest.......................................... 5 predict.kproto........................................ 6 summary.kproto....................................... 8 Index 10 1

2 clprofiles clprofiles profile k prototypes clustering Description Visualization of k prototypes clustering result for cluster interpretation. Usage clprofiles(object, x, vars = NULL, col = NULL) Arguments object x vars col Object resulting from a call of resulting kproto. Also other kmeans like objects with object$cluster and object$size are possible. Original data. Vector of either coloumn indices or variable names. Palette of cluster colours to be used for the plots. As a default RColorBrewer s brewer.pal(max(unique(object$cluster)), "Set3") is used for k > 2 clusters and lightblue and orange else. Details For numerical variables boxplots and for factor variables barplots of each cluster are generated. Author(s) <gero.szepannek@web.de> Examples # generate toy data with factors and numerics n <- 100 prb <- 0.9 muk <- 1.5 clusid <- rep(1:4, each = n) x1 <- sample(c("a","b"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("a","b"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("a","b"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("a","b"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk))

kproto 3 x <- data.frame(x1,x2,x3,x4) # apply k prototyps kpres <- kproto(x, 4) # in real world clusters are often not as clear cut # by variation of lambda the emphasize is shifted towards factor / numeric variables kpres <- kproto(x, 2) kpres <- kproto(x, 2, lambda = 0.1) kpres <- kproto(x, 2, lambda = 25) kproto k prototypes clustering Description Computes k prototypes clustering for mixed type data. Usage kproto(x,...) ## Default S3 method: kproto(x, k, lambda = NULL, iter.max = 100, nstart = 1, keep.data = TRUE,...) Arguments x Data frame with both mumerics and factors. k Either the number of clusters, a vector specifying indices of initial prototypes, or a data frame of prototypes of the same coloumns as x. lambda Parameter > 0 to trade off between Euclidean distance of numeric variables and simple matching coefficient between categorical variables. Also a vector of variable specific factors is possible where the order must correspond to the order of the variables in the data. In this case all variables distances will be multiplied by their corresponding lambda value. iter.max Maximum number of iterations if no convergence before. nstart If > 1 repetetive computations with random initializations are computed and the result with minimum tot.dist is returned. keep.data Logical whether original should be included in the returned object.... Currently not used.

4 kproto Details Value The algorithm like k means iteratively recomputes cluster prototypes and reassigns clusters. Clusters are assigned using d(x, y) = d euclid (x, y) + λd simple matching (x, y). Cluster prototypes are computed as cluster means for numeric variables and modes for factors (cf. Huang, 1998). kmeans like object of class kproto: cluster centers lambda size withinss tot.withinss dists iter trace Vector of cluster memberships. Data frame of cluster prototypes. Distance parameter lambda. Vector of cluster sizes. Vector of summed distances to the cluster prototype per cluster. Target function: sum of all distances to cluster prototype. Matrix with distances of observations to all cluster prototypes. Prespecified maximum number of iterations. List with two elements (vectors) tracing the iteration process: tot.dists and moved number of observations over all iterations. Author(s) <gero.szepannek@web.de> References Z.Huang (1998): Extensions to the k-means Algorithm for Clustering Large Data Sets with Categorical Variables, Data Mining and Knowledge Discovery 2, 283-304. Examples # generate toy data with factors and numerics n <- 100 prb <- 0.9 muk <- 1.5 clusid <- rep(1:4, each = n) x1 <- sample(c("a","b"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("a","b"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("a","b"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("a","b"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk))

lambdaest 5 x <- data.frame(x1,x2,x3,x4) # apply k prototypes kpres <- kproto(x, 4) # in real world clusters are often not as clear cut # by variation of lambda the emphasize is shifted towards factor / numeric variables kpres <- kproto(x, 2) kpres <- kproto(x, 2, lambda = 0.1) kpres <- kproto(x, 2, lambda = 25) lambdaest compares variance of all variables Description Investigation of variances to specify lambda for k prototypes clustering. Usage lambdaest(x, num.method = 1, fac.method = 1, outtype = "numeric") Arguments x num.method fac.method outtype Original data. Integer 1 or 2. Specifies the heuristic used for numeric variables. Integer 1 or 2. Specifies the heuristic used for factor variables. Specidfies the desired output: either numeric, vector or variation. Details Value Variance (num.method = 1) or standard deviation (num.method = 2) of numeric variables and 1 i p2 i (fac.method = 1/3) or 1 max i p i (fac.method = 2/4) for categorical variables is computed. lambda Ratio of averages over all numeric/factor variables is returned. In case of outtype = "vector" the separate lambda for all variables is returned as the inverse of the single variables variation as specified by the method argument. outtype = "variation" returns these values and is not ment to be passed directly to kproto().

6 predict.kproto Author(s) <gero.szepannek@web.de> Examples # generate toy data with factors and numerics n <- 100 prb <- 0.9 muk <- 1.5 clusid <- rep(1:4, each = n) x1 <- sample(c("a","b"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("a","b"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("a","b"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("a","b"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x <- data.frame(x1,x2,x3,x4) lambdaest(x) res <- kproto(x, 4, lambda = lambdaest(x)) predict.kproto k prototypes clustering Description Predicts k prototypes cluster memberships and distances for new data. Usage ## S3 method for class 'kproto' predict(object, newdata,...) Arguments object newdata Object resulting from a call of resulting kproto. New data frame (of same structure) where cluster memberships are to be predicted.... Currently not used.

predict.kproto 7 Details The algorithm like k means iteratively recomputes cluster prototypes and reassigns clusters. Clusters are assigned using d(x, y) = d euclid (x, y) + λd simple matching (x, y). Cluster prototypes are computed as cluster means for numeric variables and modes for factors (cf. Huang, 1998). Value kmeans like object of class kproto: cluster dists Vector of cluster memberships. Matrix with distances of observations to all cluster prototypes. Author(s) <gero.szepannek@web.de> Examples # generate toy data with factors and numerics n <- 100 prb <- 0.9 muk <- 1.5 clusid <- rep(1:4, each = n) x1 <- sample(c("a","b"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("a","b"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1) x2 <- sample(c("a","b"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("a","b"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x <- data.frame(x1,x2,x3,x4) # apply k prototyps kpres <- kproto(x, 4) predicted.clusters <- predict(kpres, x)

8 summary.kproto summary.kproto Summary method for kproto cluster result Description Usage Investigation of variances to specify lambda for k prototypes clustering. ## S3 method for class 'kproto' summary(object, data = NULL, pct.dig = 3,...) Arguments object data Details Value pct.dig Object of class kproto. Optional data set to be analyzed. If!(is.null(data)) clusters for data are assigned by predict(object, data). If not specified the clusters of the original data ara analyzed. Only possible if kproto has been called using keep.data = TRUE. Number of digits for rounding percentages of factor variables.... Further arguments to be passed to internal call of summary() for numeric variables. For numeric variables statistics are computed for each clusters using summary(). For categorical variables distribution percent are computed. List where each element corresponds to one variable. Each row of any element corresponds to one cluster. Author(s) <gero.szepannek@web.de> Examples # generate toy data with factors and numerics n <- 100 prb <- 0.9 muk <- 1.5 clusid <- rep(1:4, each = n) x1 <- sample(c("a","b"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x1 <- c(x1, sample(c("a","b"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x1 <- as.factor(x1)

summary.kproto 9 x2 <- sample(c("a","b"), 2*n, replace = TRUE, prob = c(prb, 1-prb)) x2 <- c(x2, sample(c("a","b"), 2*n, replace = TRUE, prob = c(1-prb, prb))) x2 <- as.factor(x2) x3 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x4 <- c(rnorm(n, mean = -muk), rnorm(n, mean = muk), rnorm(n, mean = -muk), rnorm(n, mean = muk)) x <- data.frame(x1,x2,x3,x4) res <- kproto(x, 4) summary(res)

Index clprofiles, 2 kmeans, 4, 7 kproto, 3 lambdaest, 5 predict.kproto, 6 summary.kproto, 8 10