Package densityclust

Similar documents
Package Rtsne. April 14, 2017

Package projector. February 27, 2018

Package farver. November 20, 2018

Package RaPKod. February 5, 2018

Package rmi. R topics documented: August 2, Title Mutual Information Estimators Version Author Isaac Michaud [cre, aut]

Package datasets.load

Package geojsonsf. R topics documented: January 11, Type Package Title GeoJSON to Simple Feature Converter Version 1.3.

Package ggrepel. September 30, 2017

Package ipft. January 4, 2018

Package fbroc. August 29, 2016

Package clustree. July 10, 2018

Package rvinecopulib

Package pairsd3. R topics documented: August 29, Title D3 Scatterplot Matrices Version 0.1.0

Package rollply. R topics documented: August 29, Title Moving-Window Add-on for 'plyr' Version 0.5.0

Package ECctmc. May 1, 2018

Package validara. October 19, 2017

Package tweenr. September 27, 2018

Package PSTR. September 25, 2017

Package QCAtools. January 3, 2017

Package docxtools. July 6, 2018

Package fasttextr. May 12, 2017

Package packcircles. April 28, 2018

Package nngeo. September 29, 2018

Package strat. November 23, 2016

Package TVsMiss. April 5, 2018

Package balance. October 12, 2018

Package Numero. November 24, 2018

Package PCADSC. April 19, 2017

Package SCORPIUS. June 29, 2018

Package CatEncoders. March 8, 2017

Package mgc. April 13, 2018

Package CuCubes. December 9, 2016

Package canvasxpress

Package kdevine. May 19, 2017

Package ibm. August 29, 2016

Package sigqc. June 13, 2018

Package nmslibr. April 14, 2018

Package omu. August 2, 2018

Package vinereg. August 10, 2018

Package diagis. January 25, 2018

Package MTLR. March 9, 2019

Package mmtsne. July 28, 2017

Package bisect. April 16, 2018

Package readobj. November 2, 2017

Package inca. February 13, 2018

Package areaplot. October 18, 2017

Package superheat. February 4, 2017

Package fastdummies. January 8, 2018

Package svalues. July 15, 2018

Package queuecomputer

Package optimus. March 24, 2017

Package weco. May 4, 2018

Package kdtools. April 26, 2018

Package kamila. August 29, 2016

Package kdetrees. February 20, 2015

Package preprosim. July 26, 2016

Package modmarg. R topics documented:

Package curry. September 28, 2016

Package mfa. R topics documented: July 11, 2018

Package TPD. June 14, 2018

Package intcensroc. May 2, 2018

Package coga. May 8, 2018

Package bigreadr. R topics documented: August 13, Version Date Title Read Large Text Files

Package glcm. August 29, 2016

Package diffusr. May 17, 2018

Package BiocNeighbors

Package ConvergenceClubs

Package autocogs. September 22, Title Automatic Cognostic Summaries Version 0.1.1

Package JOUSBoost. July 12, 2017

Package spark. July 21, 2017

Package harrypotter. September 3, 2018

Package tbart. February 20, 2015

Package UNF. June 13, 2017

Package routr. October 26, 2017

Package ggimage. R topics documented: December 5, Title Use Image in 'ggplot2' Version 0.1.0

Package ITNr. October 11, 2018

Package climber. R topics documented:

Package geogrid. August 19, 2018

Package mvpot. R topics documented: April 9, Type Package

Package rtext. January 23, 2019

Package readxl. April 18, 2017

Package ClusterSignificance

Package rhor. July 9, 2018

Package kirby21.base

Package Rsomoclu. January 3, 2019

Package OsteoBioR. November 15, 2018

Package comparedf. February 11, 2019

Package transformr. December 9, 2018

Package RobustGaSP. R topics documented: September 11, Type Package

Package ordinalclust

Package Rprofet. R topics documented: April 14, Type Package Title WOE Transformation and Scorecard Builder Version 2.2.0

Package anidom. July 25, 2017

Package gameofthrones

Package nonlinearicp

Package ompr. November 18, 2017

Package braqca. February 23, 2018

Package tidyselect. October 11, 2018

Package gppm. July 5, 2018

Package sankey. R topics documented: October 22, 2017

Package waver. January 29, 2018

Transcription:

Type Package Package densityclust October 24, 2017 Title Clustering by Fast Search and Find of Density Peaks Version 0.3 Date 2017-10-24 Maintainer Thomas Lin Pedersen <thomasp85@gmail.com> An improved implementation (based on k-nearest neighbors) of the density peak clustering algorithm, originally described by Alex Rodriguez and Alessandro Laio (Science, 2014 vol. 344). It can handle large datasets (> 100, 000 samples) very efficiently. It was initially implemented by Thomas Lin Pedersen, with inputs from Sean Hughes and later improved by Xiaojie Qiu to handle large datasets with knns. License GPL (>= 2) Suggests testthat LinkingTo Rcpp Imports Rcpp, FNN, Rtsne, ggplot2, ggrepel, grdevices, gridextra, RColorBrewer RoxygenNote 6.0.1 NeedsCompilation yes Author Thomas Lin Pedersen [aut, cre], Sean Hughes [aut], Xiaojie Qiu [aut] Repository CRAN Date/Publication 2017-10-24 12:52:51 UTC R topics documented: densityclust-package.................................... 2 clustered........................................... 3 clusters........................................... 4 densityclust......................................... 5 estimatedc......................................... 6 1

2 densityclust-package findclusters......................................... 7 plotdensityclust...................................... 9 plotmds.......................................... 10 plottsne.......................................... 11 Index 12 densityclust-package Clustering by fast search and find of density peaks This package implement the clustering algorithm described by Alex Rodriguez and Alessandro Laio (2014). It provides the user with tools for generating the initial rho and delta values for each observation as well as using these to assign observations to clusters. This is done in two passes so the user is free to reassign observations to clusters using a new set of rho and delta thresholds, without needing to recalculate everything. Plotting Two types of plots are supported by this package, and both mimics the types of plots used in the publication for the algorithm. The standard plot function produces a decision plot, with optional colouring of cluster peaks if these are assigned. Furthermore plotmds() performs a multidimensional scaling of the distance matrix and plots this as a scatterplot. If clusters are assigned observations are coloured according to their assignment. Cluster detection The two main functions for this package are densityclust() and findclusters(). The former takes a distance matrix and optionally a distance cutoff and calculates rho and delta for each observation. The latter takes the output of densityclust() and make cluster assignment for each observation based on a user defined rho and delta threshold. If the thresholds are not specified the user is able to supply them interactively by clicking on a decision plot. Author(s) Maintainer: Thomas Lin Pedersen <thomasp85@gmail.com> Authors: Sean Hughes Xiaojie Qiu <xqiu@uw.edu> References Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492-1496. doi:10.1126/science.1242072

clustered 3 See Also densityclust(), findclusters(), plotmds() irisdist <- dist(iris[,1:4]) irisclust <- densityclust(irisdist, gaussian=true) plot(irisclust) # Inspect clustering attributes to define thresholds irisclust <- findclusters(irisclust, rho=2, delta=2) plotmds(irisclust) split(iris[,5], irisclust$clusters) clustered Check whether a densitycluster object have been clustered This function checks whether findclusters() has been performed on the given object and returns a boolean depending on the outcome clustered(x) ## S3 method for class 'densitycluster' clustered(x) x A densitycluster object Value TRUE if findclusters() have been performed, otherwise FALSE

4 clusters clusters Extract cluster membership from a densitycluster object This function allows the user to extract the cluster membership of all the observations in the given densitycluster object. The output can be formatted in two ways as described below. Halo observations can be chosen to be removed from the output. clusters(x,...) ## S3 method for class 'densitycluster' clusters(x, as.list = FALSE, halo.rm = TRUE,...) x The densitycluster object. findclusters() must have been performed prior to this call to avoid throwing an error.... Currently ignored as.list halo.rm Should the output be in the list format. Defaults to FALSE Logical. should halo observations be removed. Defaults to TRUE Details Two formats for the output are available. Either a vector of integers denoting for each observation, which cluster the observation belongs to. If halo observations are removed, these are set to NA. The second format is a list with a vector for each group containing the index for the member observations in the group. If halo observations are removed their indexes are omitted. The list format correspond to the following transform of the vector format split(1:length(clusters), clusters), where clusters are the cluster information in vector format. Value A vector or list with cluster memberships for the observations in the initial distance matrix

densityclust 5 densityclust Calculate clustering attributes based on the densityclust algorithm This function takes a distance matrix and optionally a distance cutoff and calculates the values necessary for clustering based on the algorithm proposed by Alex Rodrigues and Alessandro Laio (see references). The actual assignment to clusters are done in a later step, based on user defined threshold values. If a distance matrix is passed into distance the original algorithm described in the paper is used. If a matrix or data.frame is passed instead it is interpretted as point coordinates and rho will be estimated based on k-nearest neighbors of each point (rho is estimated as exp(-mean(x)) where x is the distance to the nearest neighbors). This can be useful when data is so large that calculating the full distance matrix can be prohibitive. densityclust(distance, dc, gaussian = FALSE, verbose = FALSE,...) distance dc Details gaussian verbose A distance matrix or a matrix (or data.frame) for the coordinates of the data. If a matrix or data.frame is used the distances and local density will be estimated using a fast k-nearest neighbor approach. A distance cutoff for calculating the local density. If missing it will be estimated with estimatedc(distance) Logical. Should a gaussian kernel be used to estimate the density (defaults to FALSE) Logical. Should the running details be reported... Additional parameters passed on to get.knn The function calculates rho and delta for the observations in the provided distance matrix. If a distance cutoff is not provided this is first estimated using estimatedc() with default values. The information kept in the densitycluster object is: rho A vector of local density values delta A vector of minimum distances to observations of higher density distance The initial distance matrix dc The distance cutoff used to calculate rho threshold A named vector specifying the threshold values for rho and delta used for cluster detection peaks A vector of indexes specifying the cluster center for each cluster

6 estimatedc Value clusters A vector of cluster affiliations for each observation. The clusters are referenced as indexes in the peaks vector halo A logical vector specifying for each observation if it is considered part of the halo knn_graph knn graph constructed. It is only applicable to the case where coordinates are used as input. Currently it is set as NA. nearest_higher_density_neighbor index for the nearest sample with higher density. It is only applicable to the case where coordinates are used as input. nn.index indices for each cell s k-nearest neighbors. It is only applicable for the case where coordinates are used as input. nn.dist distance to each cell s k-nearest neighbors. It is only applicable for the case where coordinates are used as input. Before running findclusters the threshold, peaks, clusters and halo data is NA. A densitycluster object. See details for a description. References Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492-1496. doi:10.1126/science.1242072 See Also estimatedc(), findclusters() irisdist <- dist(iris[,1:4]) irisclust <- densityclust(irisdist, gaussian=true) plot(irisclust) # Inspect clustering attributes to define thresholds irisclust <- findclusters(irisclust, rho=2, delta=2) plotmds(irisclust) split(iris[,5], irisclust$clusters) estimatedc Estimate the distance cutoff for a specified neighbor rate This function calculates a distance cutoff value for a specific distance matrix that makes the average neighbor rate (number of points within the distance cutoff value) fall between the provided range. The authors of the algorithm suggests aiming for a neighbor rate between 1 and 2 percent, but also states that the algorithm is quite robust with regards to more extreme cases.

findclusters 7 estimatedc(distance, neighborratelow = 0.01, neighborratehigh = 0.02) distance A distance matrix neighborratelow The lower bound of the neighbor rate neighborratehigh The upper bound of the neighbor rate Value A numeric value giving the estimated distance cutoff value Note If the number of points is larger than 448 (resulting in 100,128 pairwise distances), 100,128 distance pairs will be randomly selected to speed up computation time. Use set.seed() prior to calling estimatedc in order to ensure reproducable results. References Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492-1496. doi:10.1126/science.1242072 irisdist <- dist(iris[,1:4]) estimatedc(irisdist) findclusters Detect clusters in a densitycluster obejct This function uses the supplied rho and delta thresholds to detect cluster peaks and assign the rest of the observations to one of these clusters. Furthermore core/halo status is calculated. If either rho or delta threshold is missing the user is presented with a decision plot where they are able to click on the plot area to set the treshold. If either rho or delta is set, this takes presedence over the value found by clicking.

8 findclusters findclusters(x,...) ## S3 method for class 'densitycluster' findclusters(x, rho, delta, plot = FALSE, peaks = NULL, verbose = FALSE,...) x A densitycluster object as produced by densityclust()... Additional parameters passed on rho delta plot peaks verbose The threshold for local density when detecting cluster peaks The threshold for minimum distance to higher density when detecting cluster peaks Logical. Should a decision plot be shown after cluster detection A numeric vector indicates the index of density peaks used for clustering. This vector should be retrieved from the decision plot with caution. No checking involved. Logical. Should the running details be reported Value A densitycluster object with clusters assigned to all observations References Rodriguez, A., & Laio, A. (2014). Clustering by fast search and find of density peaks. Science, 344(6191), 1492-1496. doi:10.1126/science.1242072 irisdist <- dist(iris[,1:4]) irisclust <- densityclust(irisdist, gaussian=true) plot(irisclust) # Inspect clustering attributes to define thresholds irisclust <- findclusters(irisclust, rho=2, delta=2) plotmds(irisclust) split(iris[,5], irisclust$clusters)

plotdensityclust 9 plotdensityclust Plot densitycluster results Generate a single panel of up to three diagnostic plots for a densityclust object. plotdensityclust(x, type = "all", n = 20, mds = NULL, dim.x = 1, dim.y = 2, col = NULL, alpha = 0.8) x type n mds dim.x, dim.y col alpha A densitycluster object as produced by densityclust A character vector designating which figures to produce. Valid options include "dg" for a decision graph of δ vs. ρ, "gg" for a gamma graph depicting the decrease of γ (= δ * ρ) across samples, and "mds", for a Multi-Dimensional Scaling (MDS) plot of observations. Any combination of these three can be included in the vector, or to produce all plots, specify type = "all". Number of observations to plot in the gamma graph. A matrix of scores for observations from a Principal Components Analysis or MDS. If omitted, and a MDS plot has been requested, one will be calculated. The numbers of the dimensions to plot on the x and y axes of the MDS plot. Vector of colors for clusters. Value in 0:1 controlling transparency of points in the decision graph and MDS plot. Value A panel of the figures specified in type are produced. If designated, clusters are color-coded and labelled. If present in x, the rho and delta thresholds are designated in the decision graph by a set of solid black lines. Author(s) Eric Archer <eric.archer@noaa.gov> data(iris) data.dist <- dist(iris[, 1:4]) pca <- princomp(iris[, 1:4]) # Run initial density clustering dens.clust <- densityclust(data.dist)

10 plotmds op <- par(ask = TRUE) # Show the decision graph plotdensityclust(dens.clust, type = "dg") # Show the decision graph and the gamma graph plotdensityclust(dens.clust, type = c("dg", "gg")) # Cluster based on rho and delta new.clust <- findclusters(dens.clust, rho = 4, delta = 2) # Show all graphs with clustering plotdensityclust(new.clust, mds = pca$scores) par(op) plotmds Plot observations using multidimensional scaling and colour by cluster This function produces an MDS scatterplot based on the distance matrix of the densitycluster object (if there is only the coordinates information, a distance matrix will be calculate first), and, if clusters are defined, colours each observation according to cluster affiliation. Observations belonging to a cluster core is plotted with filled circles and observations belonging to the halo with hollow circles. This plotting is not suitable for running large datasets (for example datasets with > 1000 samples). Users are suggested to use other methods, for example tsne, etc. to visualize their clustering results too. plotmds(x,...) x A densitycluster object as produced by densityclust()... Additional parameters. Currently ignored See Also densityclust() for creating densitycluster objects, and plottsne() for an alternative plotting approach.

plottsne 11 irisdist <- dist(iris[,1:4]) irisclust <- densityclust(irisdist, gaussian=true) plot(irisclust) # Inspect clustering attributes to define thresholds irisclust <- findclusters(irisclust, rho=2, delta=2) plotmds(irisclust) split(iris[,5], irisclust$clusters) plottsne Plot observations using t-distributed neighbor embedding and colour by cluster This function produces an t-sne scatterplot based on the distance matrix of the densitycluster object (if there is only the coordinates information, a distance matrix will be calculate first), and, if clusters are defined, colours each observation according to cluster affiliation. Observations belonging to a cluster core is plotted with filled circles and observations belonging to the halo with hollow circles. plottsne(x,...) x See Also A densitycluster object as produced by densityclust()... Additional parameters. Currently ignored densityclust() for creating densitycluster objects, and plotmds() for an alternative plotting approach. irisdist <- dist(iris[,1:4]) irisclust <- densityclust(irisdist, gaussian=true) plot(irisclust) # Inspect clustering attributes to define thresholds irisclust <- findclusters(irisclust, rho=2, delta=2) plottsne(irisclust) split(iris[,5], irisclust$clusters)

Index clustered, 3 clusters, 4 densityclust, 5, 9 densityclust(), 2, 3, 8, 10, 11 densityclust-package, 2 estimatedc, 6 estimatedc(), 5, 6 findclusters, 7 findclusters(), 2 4, 6 get.knn, 5 plotdensityclust, 9 plotmds, 10 plotmds(), 2, 3, 11 plottsne, 11 plottsne(), 10 set.seed(), 7 12