Package phylogram. June 15, 2017

Similar documents
Package kdtools. April 26, 2018

Package geojsonsf. R topics documented: January 11, Type Package Title GeoJSON to Simple Feature Converter Version 1.3.

Package kdetrees. February 20, 2015

Package ECctmc. May 1, 2018

Package markophylo. December 31, 2015

Package coalescentmcmc

Package tidytree. June 13, 2018

Package gggenes. R topics documented: November 7, Title Draw Gene Arrow Maps in 'ggplot2' Version 0.3.2

Package smoothr. April 4, 2018

Package stapler. November 27, 2017

Package fastdummies. January 8, 2018

Package cattonum. R topics documented: May 2, Type Package Version Title Encode Categorical Features

Package validara. October 19, 2017

Package wrswor. R topics documented: February 2, Type Package

Package graphframes. June 21, 2018

Package treeio. April 1, 2018

Package treeio. January 1, 2019

Package UNF. June 13, 2017

PROTEIN MULTIPLE ALIGNMENT MOTIVATION: BACKGROUND: Marina Sirota

Package docxtools. July 6, 2018

Package semver. January 6, 2017

Package snakecase. R topics documented: March 25, Version Date Title Convert Strings into any Case

Package lvec. May 24, 2018

Package Numero. November 24, 2018

Package fst. June 7, 2018

Package sfc. August 29, 2016

Package labelvector. July 28, 2018

Package readxl. April 18, 2017

Package narray. January 28, 2018

Package redux. May 31, 2018

Package jdx. R topics documented: January 9, Type Package Title 'Java' Data Exchange for 'R' and 'rjava'

Package nmslibr. April 14, 2018

Package rtext. January 23, 2019

Package enpls. May 14, 2018

Package zebu. R topics documented: October 24, 2017

Package datasets.load

Package effectr. January 17, 2018

Package mgc. April 13, 2018

Package fastqcr. April 11, 2017

Package phylotop. R topics documented: February 21, 2018

Package balance. October 12, 2018

Package widyr. August 14, 2017

Package tibble. August 22, 2017

Package customlayout

Package bisect. April 16, 2018

Package barcoder. October 26, 2018

Package diagis. January 25, 2018

Package fst. December 18, 2017

Package strat. November 23, 2016

Package nngeo. September 29, 2018

Package catenary. May 4, 2018

Package ggimage. R topics documented: December 5, Title Use Image in 'ggplot2' Version 0.1.0

Package meme. November 2, 2017

Package optimus. March 24, 2017

Package meme. December 6, 2017

Package spark. July 21, 2017

Package rsppfp. November 20, 2018

Package slam. December 1, 2016

Package nltt. October 12, 2016

Package postgistools

Package splithalf. March 17, 2018

Package rprojroot. January 3, Title Finding Files in Project Subdirectories Version 1.3-2

Package ggrepel. September 30, 2017

Package rucrdtw. October 13, 2017

Package BiocNeighbors

Package styler. December 11, Title Non-Invasive Pretty Printing of R Code Version 1.0.0

Package gtrendsr. August 4, 2018

Package signalhsmm. August 29, 2016

Package gtrendsr. October 19, 2017

Unsupervised Learning

Package SC3. November 27, 2017

Package bigreadr. R topics documented: August 13, Version Date Title Read Large Text Files

Package condformat. October 19, 2017

Package covr. October 18, 2018

Package paralleldist

Package ggimage. R topics documented: November 1, Title Use Image in 'ggplot2' Version 0.0.7

Package reconstructr

Package canvasxpress

Package modules. July 22, 2017

Package muscle. R topics documented: March 7, Type Package

Package Rtsne. April 14, 2017

Package roperators. September 28, 2018

Package bannercommenter

Package indelmiss. March 22, 2019

Package projector. February 27, 2018

Package utf8. May 24, 2018

Package gquad. June 7, 2017

Package textrank. December 18, 2017

Package rngtools. May 15, 2018

Package robotstxt. November 12, 2017

Package rgho. R topics documented: January 18, 2017

Recent Research Results. Evolutionary Trees Distance Methods

Package d3heatmap. February 1, 2018

Package bikedata. April 27, 2018

Package TipDatingBeast

Package transformr. December 9, 2018

Package scmap. March 29, 2019

Package fitur. March 11, 2018

Package brranching. R topics documented:

Package mixsqp. November 14, 2018

Transcription:

Type Package Title Dendrograms for Evolutionary Analysis Version 1.0.1 Author [aut, cre] Package phylogram June 15, 2017 Maintainer <shaunpwilkinson@gmail.com> Contains functions for importing and eporting 'dendrogram' objects in parenthetic tet format, and several functions for command-line tree manipulation. With an emphasis on speed and computational efficiency, the package also includes a suite of tools for rapidly computing distance matrices and building large trees using fast alignment-free 'k-mer' counting and divisive clustering techniques. License GPL-3 LazyData TRUE URL http://github.com/shaunpwilkinson/phylogram BugReports http://github.com/shaunpwilkinson/phylogram/issues Encoding UTF-8 Depends R (>= 3.0.0) Imports openssl, Rcpp (>= 0.12.8), stats, Suggests ape (>= 3.0), knitr, rmarkdown, testthat LinkingTo Rcpp RoygenNote 6.0.1 VignetteBuilder knitr NeedsCompilation yes Repository CRAN Date/Publication 2017-06-14 23:58:19 UTC 1

2 kcount R topics documented: kcount............................................ 2 kdistance.......................................... 4 ladder............................................ 5 mbed............................................ 7 phylogram.......................................... 9 print.mbed.......................................... 10 prune............................................ 11 read.dendrogram...................................... 12 remidpoint.......................................... 13 reposition.......................................... 14 topdown........................................... 15 ultrametricize........................................ 17 write.dendrogram...................................... 17 Inde 19 kcount K-mer counting. Count all k-letter words in a sequence or set of sequences with a sliding window of length k. kcount(, k = 5, residues = NULL, gap = "-") k residues gap a matri of aligned sequences, a list of unaligned sequences, or a vector representing a single sequence. Accepted modes are "character" and "raw" (the latter being applicable for "DNAbin" and "AAbin" objects). integer representing the k-mer size. Defaults to 5. Note that high values of k may be slow to compute and use a lot of memory due to the large numbers of calculations required, particularly when the residue alphabet is also large. either NULL (default; the residue alphabet is automatically detected from the sequences), a case sensitive character vector specifying the residue alphabet, or one of the character strings "RNA", "DNA", "AA", "AMINO". Note that the default option can be slow for large lists of character vectors. Specifying the residue alphabet is therefore recommended unless is a "DNAbin" or "AAbin" object. the character used to represent gaps in the alignment matri (if applicable). Ignored for "DNAbin" and "AAbin" objects. Defaults to "-" otherwise.

kcount 3 Details This function computes a vector or matri of k-mer counts from a sequence or set of sequences using a sliding a window of length k. DNA and amino acid sequences can be passed to the function either as a list of non-aligned sequences or a matri of aligned sequences, preferably in the "DNAbin" or "AAbin" raw-byte format (Paradis et al 2004, 2012; see the ape package documentation for more information on these S3 classes). Character sequences are supported; however ambiguity codes may not be recognized or treated appropriately, since raw ambiguity codes are counted according to their underlying residue frequencies (e.g. the 5-mer "ACRGT" would contribute 0.5 to the tally for "ACAGT" and 0.5 to that of "ACGGT"). To minimize computation time when counting longer k-mers (k > 3), amino acid sequences in the raw "AAbin" format are automatically compressed using the Dayhoff-6 alphabet as detailed in Edgar (2004). Note that amino acid sequences will not be compressed if they are supplied as a list of character vectors rather than an "AAbin" object, in which case the k-mer length should be reduced (k < 4) to avoid ecessive memory use and computation time. Returns a matri of k-mer counts with one row for each sequence and n^k columns (where n is the size of the residue alphabet and k is the k-mer size) References Edgar RC (2004) Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Research, 32, 380-385. Paradis E, Claude J, Strimmer K, (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289-290. Paradis E (2012) Analysis of Phylogenetics and Evolution with R (Second Edition). Springer, New York. See Also kdistance for k-mer distance matri computation. ## compute a matri of k-mer counts for the woodmouse ## data (ape package) using a k-mer size of 3 library(ape) data(woodmouse) <- kcount(woodmouse, k = 3) ## 64 columns for nucleotide 3-mers AAA, AAC,... TTT ## convert to AAbin object and repeat the operation y <- kcount(ape::trans(woodmouse, 2), k = 2) y

4 kdistance ## 400 columns for amino acid 2-mers AA, AB,..., YY kdistance K-mer distance matri computation. Computes the matri of k-mer distances between all pairwise comparisons of a set of sequences. kdistance(, k = 5, method = "edgar", residues = NULL, gap = "-",...) k method residues gap Details a matri of aligned sequences or a list of unaligned sequences. Accepted modes are "character" and "raw" (the latter being applicable for "DNAbin" and "AAbin" objects). integer representing the k-mer size to be used for calculating the distance matri. Defaults to 5. Note that high values of k may be slow to compute and use a lot of memory due to the large numbers of calculations required, particularly when the residue alphabet is also large. a character string giving the k-mer distance measure to be used. Currently the available options are "edgar" (default; see Edgar (2004) for details) and the standard methods available for the base function "dist" ("euclidean", "maimum", "manhattan", "canberra", "binary" and "minkowski"). either NULL (default; the residue alphabet is automatically detected from the sequences), a case sensitive character vector specifying the residue alphabet, or one of the character strings "RNA", "DNA", "AA", "AMINO". Note that the default option can be slow for large lists of character vectors. Specifying the residue alphabet is therefore recommended unless is a "DNAbin" or "AAbin" object. the character used to represent gaps in the alignment matri (if applicable). Ignored for "DNAbin" or "AAbin" objects. Defaults to "-" otherwise.... further arguments to be passed to "as.dist". This function computes the n * n k-mer distance matri (where n is the number of sequences), returning an object of class "dist". DNA and amino acid sequences can be passed to the function either as a list of non-aligned sequences or as a matri of aligned sequences, preferably in the "DNAbin" or "AAbin" raw-byte format (Paradis et al 2004, 2012; see the ape package documentation for more information on these S3 classes). Character sequences are supported; however ambiguity codes may not be recognized or treated appropriately, since raw ambiguity codes are counted according to their underlying residue frequencies (e.g. the 5-mer "ACRGT" would contribute 0.5 to the tally for "ACAGT" and 0.5 to that of "ACGGT").

ladder 5 To minimize computation time when counting longer k-mers (k > 3), amino acid sequences in the raw "AAbin" format are automatically compressed using the Dayhoff-6 alphabet as detailed in Edgar (2004). Note that amino acid sequences will not be compressed if they are supplied as a list of character vectors rather than an "AAbin" object, in which case the k-mer length should be reduced (k < 4) to avoid ecessive memory use and computation time. an object of class "dist". References Edgar RC (2004) Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Research, 32, 380-385. Paradis E, Claude J, Strimmer K, (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289-290. Paradis E (2012) Analysis of Phylogenetics and Evolution with R (Second Edition). Springer, New York. See Also kcount for k-mer counting, and mbed for leaner distance matrices ## compute a k-mer distance matri for the woodmouse ## dataset (ape package) using a k-mer size of 5 library(ape) data(woodmouse) ### subset global alignment by removing gappy ends woodmouse <- woodmouse[, apply(woodmouse, 2, function(v)!any(v == 0f0))] ### compute the distance matri woodmouse.dist <- kdistance(woodmouse, k = 5) ### cluster and plot UPGMA tree woodmouse.tree <- as.dendrogram(hclust(woodmouse.dist, "average")) plot(woodmouse.tree) ladder Reorder tree branches in ladderized pattern. This function ladderizes the branches of a dendrogram object to aid in visual interpretation.

6 ladder ladder(, decreasing = FALSE) decreasing an object of class "dendrogram". logical indicating whether the tree should be ladderized upwards or downwards. Defaults to FALSE (downwards). Details This function is the dendrogram analogue of the ladderize function in the ape package (Paradis et al 2004, 2012). Returns an object of class dendrogram. References Paradis E, Claude J, Strimmer K, (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289-290. Paradis E (2012) Analysis of Phylogenetics and Evolution with R (Second Edition). Springer, New York. See Also The ladderize function in the ape package performs a similar operation for objects of class "phylo". <- read.dendrogram(tet = "(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);") plot(, horiz = TRUE) <- ladder(, decreasing = TRUE) plot(, horiz = TRUE)

mbed 7 mbed Convert sequences to vectors of distances to a subset of seed sequences. This function computes a matri of distances from each sequence to a subset of seed sequences using the method outlined in Blacksheilds et al (2010). mbed(, seeds = NULL, k = 5, residues = NULL, gap = "-", counts = FALSE) seeds k residues gap counts a matri of aligned sequences or a list of unaligned sequences. Accepted modes are "character" and "raw" (the latter is for "DNAbin" and "AAbin" objects). optional integer vector indicating which sequences should be used as the seed sequences. If seeds = NULL a set of log(n, 2)^2 non-identical sequences is randomly selected from the sequence set (where n is the number of sequences; see Blacksheilds et al. 2010). Alternatively, if seeds = 'all' a standard n * n distance matri is computed. integer representing the k-mer size to be used for calculating the distance matri. Defaults to 5. Note that high values of k may be slow to compute and use a lot of memory due to the large numbers of calculations required, particularly when the residue alphabet is also large. either NULL (default; emitted residues are automatically detected from the sequences), a case sensitive character vector specifying the residue alphabet, or one of the character strings "RNA", "DNA", "AA", "AMINO". Note that the default option can be slow for large lists of character vectors. Specifying the residue alphabet is therefore recommended unless is a "DNAbin" or "AAbin" object. the character used to represent gaps in the alignment matri (if applicable). Ignored for "DNAbin" or "AAbin" objects. Defaults to "-" otherwise. logical indicating whether the (usually large) matri of k-mer counts should be returned as an attribute of the returned object. Defaults to FALSE. Details This function computes a n * log(n, 2)^2 k-mer distance matri (where n is the number of sequences), returning an object of class "mbed". If the number of sequences is less than or equal to 19, the full n * n distance matri is produced (since the rounded up value of log(19, 2)^2 is 19). Currently the only distance measure supported is that of Edgar (2004). For maimum information retention following the embedding process it is generally desirable to select the seed sequences based on their uniqueness, rather than simply selecting a random subset

8 mbed (Blackshields et al. 2010). Hence if seeds is set to NULL (the default setting) the the mbed function selects the subset by clustering the sequence set into t groups using the k-means algorithm (k = t), and choosing one representative from each group. Users can alternatively pass an integer vector (as in the above eample) to specify the seeds manually. See Blackshields et al (2010) for other seed selection options. DNA and amino acid sequences can be passed to the function either as a list of non-aligned sequences or as a matri of aligned sequences, preferably in the "DNAbin" or "AAbin" raw-byte format (Paradis et al 2004, 2012; see the ape package documentation for more information on these S3 classes). Character sequences are supported; however ambiguity codes may not be recognized or treated appropriately, since raw ambiguity codes are counted according to their underlying residue frequencies (e.g. the 5-mer "ACRGT" would contribute 0.5 to the tally for "ACAGT" and 0.5 to that of "ACGGT"). To minimize computation time when counting longer k-mers (k > 3), amino acid sequences in the raw "AAbin" format are automatically compressed using the Dayhoff-6 alphabet as detailed in Edgar (2004). Note that amino acid sequences will not be compressed if they are supplied as a list of character vectors rather than an "AAbin" object, in which case the k-mer length should be reduced (k < 4) to avoid ecessive memory use and computation time. Note that agglomerative (bottom-up) tree-building methods such as neighbor-joining and UPGMA depend on a full n * n distance matri. See the kdistance function for details on computing symmetrical distance matrices. Returns an object of class "mbed", whose primary object is an n * log(n, 2)^2 matri (where n is the number of sequences). The returned object contains additional attributes including an integer vector of seed sequence indices ("seeds"), a logical vector identifying the duplicated sequences ("duplicates"), an integer vector giving the matching indices of the non-duplicated sequences ("pointers"), a character vector of MD5 digests of the sequences ("hashes"), an integer vector of sequence lengths ("seqlengths"), and if counts = TRUE, the matri of k-mer counts ("kcounts"; see kcount for details). References Blackshields G, Sievers F, Shi W, Wilm A, Higgins DG (2010) Sequence embedding for fast construction of guide trees for multiple sequence alignment. Algorithms for Molecular Biology, 5, 21. Edgar RC (2004) Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Research, 32, 380-385. Paradis E, Claude J, Strimmer K, (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289-290. Paradis E (2012) Analysis of Phylogenetics and Evolution with R (Second Edition). Springer, New York.

phylogram 9 See Also kdistance for full n * n distance matri computation. ## compute an embedded k-mer distance matri for the woodmouse ## dataset (ape package) using a k-mer size of 5 library(ape) data(woodmouse) ## randomly select three sequences as seeds set.seed(999) seeds <- sample(1:15, size = 3) ## embed the woodmouse dataset in three dimensions woodmouse.mbed <- mbed(woodmouse, seeds = seeds, k = 5) ## print the distance matri (without attributes) print(woodmouse.mbed[,], digits = 2) phylogram Dendrograms for evolutionary analysis. The phylogram package contains functions for importing and eporting dendrogram objects in the Newick parenthetic tet format, as well as several functions for command-line tree manipulation. With an emphasis on speed and computational efficiency, the package also includes a suite of tools for rapidly computing distance matrices and building large trees using fast alignment-free k-mer counting and divisive clustering techniques. Functions A breif description of the primary phylogram functions are provided with links to their help pages below. File import/eport Tree building read.dendrogram is a tet parser that converts parenthetic tet (Newick strings) into objects of class "dendrogram" write.dendrogram outputs an object of class "dendrogram" to a tet string or file in Newick/New Hampshire format kcount tabulates all of the k-letter words in a sequence or set of sequence kdistance calculates pairwise distances between sequences by k-mer counting mbed embeds sequences as vectors of distances to a set of seed sequences topdown builds a phylogenetic tree by successively splitting a set of sequences (recursive partitioning)

10 print.mbed Tree editing and manipulation prune remove branches from a dendrogram object based on regular epression pattern matching ladder reorders the branches of a dendrogram object to aid visualization remidpoint recursively sets "midpoint" and "members" attributes for a nested list/dendrogram object reposition shifts a dendrogram object up or down (or sideways if plotted horizontally) ultrametricize modifies the "height" attributes of the nodes such that all leaves terminate at zero print.mbed Print summary methods. Print summary methods. ## S3 method for class 'mbed' print(,...) object of various classes.... additional arguments to be passed between methods. NULL (invisibly)

prune 11 prune Remove tree nodes by regular epression pattern matching. "prune" takes an object of class "dendrogram" and removes all branches whose branch labels match a given regular epression. prune(tree, pattern, invert = FALSE, untag = FALSE,...) tree pattern invert untag an object of class "dendrogram". a regular epression. logical indicating whether the branches whose labels match the regular epression provided in "pattern" should be discarded and the others kept (FALSE; default) or vice versa. Nodes without "label" attributes are ignored. logical (used only when invert = TRUE). Indicates whether the specified pattern should be removed from the branch labels in the returned object.... further arguments to be passed to grepl and gsub. Details This function recursively tests the "label" attribute of each dendrogram node (including non-leaf inner nodes if applicable) for the specified pattern, removing those that register a positive hit. Note that positive matching inner nodes are removed along with all of their sub-nodes, regardless of whether the "label" attributes of the sub-nodes match the pattern. Returns an object of class "dendrogram". See Also The drop.tip function in the ape package performs a similar operation for objects of class "phylo". See rege for help with compiling regular epressions.

12 read.dendrogram <- read.dendrogram(tet = "(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);") plot(, horiz = TRUE) <- prune(, pattern = "^A$") plot(, horiz = TRUE) read.dendrogram Read a dendrogram from parenthetic tet. read.dendrogram parses a tet file or character string in Newick (New Hampshire) format and creates an object of class "dendrogram". read.dendrogram(file = "", tet = NULL, edges = TRUE,...) file tet Details edges character string giving a valid path to the file from where to read the data. optional character string in lieu of a "file" argument. If a tet argument is provided instead of a file path, the data are read via a tet connection. logical indicating whether edge weights provided in the Newick string should be retained in the returned object (defaults to TRUE).... further arguments to be passed to scan. There are varying interpretations of the Newick/New Hampshire tet format. This function tries to adhere to the Felsenstein standard outlined here. The function supports weighted edges, labels with special metacharacters (enclosed in single quotation marks), comments (enclosed in square brackets; ignored by the parser), multifuricating nodes, and both rooted and unrooted trees. Comments enclosed in square brackets are also discarded. Inner-node labels (for eample "(B:6.0,(A:5.0,C:3.0,E:4.0)Ancestor1:5.0,D:11 are also currently ignored; however, the parsing of label attributes for non-leaf dendrogram nodes will be available in a future version. Returns an object of class "dendrogram".

remidpoint 13 References http://evolution.genetics.washington.edu/phylip/newicktree.html http://evolution. genetics.washington.edu/phylip/newick_doc.html See Also write.dendrogram writes an object of class "dendrogram" to a tet string. The read.tree function in the ape package performs a similar operation for objects of class "phylo" and "multiphylo". newick <- "(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);" <- read.dendrogram(tet = newick) plot(, horiz = TRUE) remidpoint Set dendrogram attributes for a nested list. remidpoint is a helper function used for manually creating "dendrogram" objects from nested lists. The function recursively assigns the necessary midpoint and members attributes at each node. remidpoint() a nested list, possibly of class "dendrogram" returns a nested list, or an object of class "dendrogram" depending on the class of the input object.

14 reposition ## manually create a small dendrogram with three members, A, B and C <- list("a", list("b", "C")) attr([[1]], "leaf") <- TRUE attr([[2]][[1]], "leaf") <- TRUE attr([[2]][[2]], "leaf") <- TRUE attr([[1]], "label") <- "A" attr([[2]][[1]], "label") <- "B" attr([[2]][[2]], "label") <- "C" attr(, "height") <- 1 attr([[1]], "height") <- 0 attr([[2]], "height") <- 0.5 attr([[2]][[1]], "height") <- 0 attr([[2]][[2]], "height") <- 0 <- remidpoint() class() <- "dendrogram" plot(, horiz = TRUE) reposition Reset dendrogram height attributes. reposition is a helper function used for manually creating "dendrogram" objects from nested lists. The function recursively reassigns the height attributes at each node by a given constant. reposition(, shift = "reset") shift an object of class "dendrogram". either the character string "reset" (shift the graph so that the height of the farthest leaf from the root is zero), or a numeric value giving the amount to shift the graph along the primary ais. Returns an object of class "dendrogram".

topdown 15 <- read.dendrogram(tet = "(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);") plot(, horiz = TRUE) <- reposition() plot(, horiz = TRUE) topdown Top down (divisive) tree-building. Create phylogenetic trees by successively splitting the sequence dataset into smaller and smaller subsets. topdown(, k = 5, residues = NULL, gap = "-",...) k residues gap Details a list or matri of sequences, possibly an object of class "DNAbin" or "AAbin". integer. The k-mer size required. either NULL (default; emitted residues are automatically detected from the sequences), a case sensitive character vector specifying the residue alphabet, or one of the character strings "RNA", "DNA", "AA", "AMINO". Note that the default option can be slow for large lists of character vectors. Specifying the residue alphabet is therefore recommended unless the sequence list is a "DNAbin" or "AAbin" object. the character used to represent gaps in the alignment matri (if applicable). Ignored for "DNAbin" or "AAbin" objects. Defaults to "-" otherwise.... further arguments to be passed to kmeans (not including centers). This function creates a tree by successively splitting the dataset into smaller and smaller subsets (recursive partitioning). This is a divisive, or "top-down" approach to tree-building, as opposed to agglomerative "bottom-up" methods such as neighbour joining and UPGMA. It is particularly useful for large large datasets with many sequences (n > 10, 000) since the need to compute a large n * n distance matri is circumvented. Instead, a matri of k-mer counts is computed, and split recursively row-wise using a k-means clustering algorithm (k = 2). This effectively reduces the time and memory compleity from quadratic to linear, while generally maintaining comparable accuracy. If a more accurate tree is required, users can increase the value of nstart passed to kmeans via the... argument. While this can increase computation time, it can improve tree accuracy considerably. DNA and amino acid sequences can be passed to the function either as a list of non-aligned sequences or a matri of aligned sequences, preferably in the "DNAbin" or "AAbin" raw-byte format

16 topdown (Paradis et al 2004, 2012; see the ape package documentation for more information on these S3 classes). Character sequences are supported; however ambiguity codes may not be recognized or treated appropriately, since raw ambiguity codes are counted according to their underlying residue frequencies (e.g. the 5-mer "ACRGT" would contribute 0.5 to the tally for "ACAGT" and 0.5 to that of "ACGGT"). To minimize computation time when counting longer k-mers (k > 3), amino acid sequences in the raw "AAbin" format are automatically compressed using the Dayhoff-6 alphabet as detailed in Edgar (2004). Note that amino acid sequences will not be compressed if they are supplied as a list of character vectors rather than an "AAbin" object, in which case the k-mer length should be reduced (k < 4) to avoid ecessive memory use and computation time. Returns an object of class "dendrogram". References Edgar RC (2004) Local homology recognition and distance measures in linear time using compressed amino acid alphabets. Nucleic Acids Research, 32, 380-385. Paradis E, Claude J, Strimmer K, (2004) APE: analyses of phylogenetics and evolution in R language. Bioinformatics 20, 289-290. Paradis E (2012) Analysis of Phylogenetics and Evolution with R (Second Edition). Springer, New York. See Also kcount ## Not run: ## Cluster the woodmouse dataset (ape package) library(ape) data(woodmouse) ## trim gappy ends to subset global alignment woodmouse <- woodmouse[, apply(woodmouse, 2, function(v)!any(v == 0f0))] ## build tree divisively set.seed(999) woodmouse.tree <- topdown(woodmouse, nstart = 20) ## plot tree op <- par(no.readonly = TRUE) par(mar = c(5, 2, 4, 8) + 0.1) plot(woodmouse.tree, main = "Woodmouse phylogeny", horiz = TRUE) par(op) ## End(Not run)

ultrametricize 17 ultrametricize Make dendrogram ultrametric. This is a simple function that sets the height attributes of all leaf nodes to zero to aid vizualization. ultrametricize() an object of class "dendrogram". Returns an object of class "dendrogram". <- read.dendrogram(tet = "(A:0.1,B:0.2,(C:0.3,D:0.4):0.5);") plot(, horiz = TRUE) <- ultrametricize() plot(, horiz = TRUE) write.dendrogram Eport a dendrogram object to tet. This function writes a dendrogram object to Newick-style parenthetic tet. write.dendrogram(, file = "", append = FALSE, edges = TRUE,...)

18 write.dendrogram file append edges See Also an object of class "dendrogram". a character string naming a file or connection to write the output to. If no file path is specified or file = "" the result is printed to the console. logical indicating whether the output should be appended to the file. If append = FALSE the contents of the file will be overwritten (the default setting). logical indicating whether edge weights should be included in the output string.... further arguments to be passed to format. Used to specify the numbering style of the edge weights (if edges = TRUE). read.dendrogram to create a "dendrogram" object from a tet file. The write.tree function in the ape package performs a similar operation for "phylo" and "multiphylo" objects. ## build and eport tree for the woodmouse data (ape package) library(ape) data(woodmouse) ## trim gappy ends for global alignment woodmouse <- woodmouse[, apply(woodmouse, 2, function(v)!any(v == 0f0))] ## build topdown tree set.seed(999) <- topdown(woodmouse, nstart = 20) write.dendrogram(, edges = TRUE)

Inde ape, 3, 4, 6, 8, 11, 13, 16, 18 drop.tip, 11 kcount, 2, 5, 8, 9, 16 kdistance, 3, 4, 8, 9 ladder, 5, 10 ladderize, 6 mbed, 5, 7, 9 phylogram, 9 phylogram-package (phylogram), 9 print.mbed, 10 prune, 10, 11 read.dendrogram, 9, 12, 18 read.tree, 13 rege, 11 remidpoint, 10, 13 reposition, 10, 14 topdown, 9, 15 ultrametricize, 10, 17 write.dendrogram, 9, 13, 17 write.tree, 18 19