Package bestnormalize - PDF Free Download

Type Package Title Normalizing Transformation Functions Version 1.0.1 Date 2018-02-05 Package bestnormalize February 5, 2018 Estimate a suite of normalizing transformations, including a new adaptation of a technique based on ranks which can guarantee normally distributed transformed data if there are no ties: ordered quantile normalization (ORQ). ORQ normalization combines a rank-mapping approach with a shifted logit approimation that allows the transformation to work on data outside the original domain. It is also able to handle new data within the original domain via linear interpolation. The package is built to estimate the best normalizing transformation for a vector consistently and accurately. It implements the Bo-Co transformation, the Yeo-Johnson transformation, three types of Lambert WF transformations, and the ordered quantile normalization transformation. URL https://github.com/petersonr/bestnormalize License GPL-3 Depends R (>= 3.1.0) Imports LambertW, nortest, parallel Suggests knitr, MASS, testthat, mgcv VignetteBuilder knitr LazyData true RoygenNote 6.0.1 NeedsCompilation no Author Ryan Andrew Peterson [aut, cre] Maintainer Ryan Andrew Peterson <ryan-peterson@uiowa.edu> Repository CRAN Date/Publication 2018-02-05 18:48:28 UTC 1

2 autotrader R topics documented: bestnormalize-package................................... 2 autotrader.......................................... 2 bestnormalize........................................ 3 binarize........................................... 5 boco............................................ 7 lambert........................................... 8 ordernorm.......................................... 10 plot.bestnormalize..................................... 12 yeojohnson......................................... 13 Inde 15 bestnormalize-package bestnormalize: Fleibly calculate the best normalizing transformation for a vector The bestnormalize package provides several normalizing transformations, and introduces a new transformation based off of the order statistics, ordernorm. Perhaps the most useful function is bestnormalize, which attempts all of these transformations and picks the best one based off of a goodness of fit statistic. Author(s) Maintainer: Ryan Andrew Peterson <ryan-peterson@uiowa.edu> See Also Useful links: https://github.com/petersonr/bestnormalize autotrader Prices of 6,283 cars listed on Autotrader A dataset containing the prices and other attributes of over 6000 cars in the Minneapolis area. autotrader

bestnormalize 3 Format A data frame with 6283 rows and 10 variables: price price, in US dollars Car_Info Raw description from website Link hyperlink to listing (must be appended to https://www.autotrader.com/) Make Car manufacturer Year Year car manufactured Location Location of listing Radius Radius chosen for search mileage mileage on vehicle status used/new/certified model make and model, separated by space Source https://www.autotrader.com/ bestnormalize Calculate and perform best normalizing transformation Performs a suite of normalizing transformations, and selects the best one on the basis of the Pearson P test statistic for normality. The transformation that has the lowest P (calculated on the transformed data) is selected. See details for more information. bestnormalize(, allow_ordernorm = TRUE, out_of_sample = TRUE, cluster = NULL, k = 10, r = 5) ## S3 method for class 'bestnormalize' predict(object, newdata = NULL, inverse = FALSE,...) ## S3 method for class 'bestnormalize' print(,...)

4 bestnormalize Arguments A vector to normalize allow_ordernorm set to FALSE if ordernorm should not be applied out_of_sample cluster k r Details Value object newdata inverse if FALSE, estimates quickly in-sample performance name of cluster set using makecluster number of folds number of repeats an object of class bestnormalize a vector of data to be (reverse) transformed if TRUE, performs reverse transformation... additional arguments bestnormalize estimates the optimal normalizing transformation. This transformation can be performed on new data, and inverted, via the predict function. This function currently estimates the Yeo-Johnson transformation, the Bo Co transformation (if the data is positive), and the Lambert WF Gaussianizing transformation of type "s". If allow_ordernorm == TRUE and if out_of_sample == FALSE then the ordered quantile normalization technique will likely be chosen since it essentially forces the data to follow a normal distribution. More information on the ordernorm technique can be found in the package vignette, or using?ordernorm. Repeated cross-validation is used to estimate the out-of-sample performance of each transformation if out_of_sample = TRUE. While this can take some time, users can speed it up by creating a cluster via the parallel package s makecluster function, and passing the name of this cluster to bestnormalize via the cl argument. For best performance, we recommend the number of clusters to be set to the number of repeats r. Care should be taken to account for the number of observations per fold; to small a number and the estimated normality statistic could be inaccurate, or at least suffer from high variability. NOTE: Only the Lambert technique of type = "s" (skew) ensures that the transformation is consistently 1-1, so it is the only method currently used in bestnormalize(). Use type = "h" or type = hh at risk of not having this estimate 1-1 transform. These alternative types are effective when the data has eceptionally heavy tails, e.g. the Cauchy distribution. A list of class bestnormalize with elements.t norm_stats method transformed original data original data Pearson s Pearson s P / degrees of freedom out-of-sample or in-sample, number of folds + repeats

binarize 5 chosen_transform the chosen transformation (of appropriate class) other_transforms the other transformations (of appropriate class) The predict function returns the numeric value of the transformation performed on new data, and allows for the inverse transformation as well. See Also boco, lambert, ordernorm, yeojohnson Eamples <- rgamma(100, 1, 1) ## Not run: # With Repeated CV BN_obj <- bestnormalize() BN_obj p <- predict(bn_obj) 2 <- predict(bn_obj, newdata = p, inverse = TRUE) all.equal(2, ) ## End(Not run) # Without Repeated CV BN_obj <- bestnormalize(, allow_ordernorm = FALSE, out_of_sample = FALSE) BN_obj p <- predict(bn_obj) 2 <- predict(bn_obj, newdata = p, inverse = TRUE) all.equal(2, ) binarize Binarize This function will perform a binarizing transformation, which could be used as a last resort if the data cannot be adequately normalized. This may be useful when accidentally attempting normalization of a binary vector (which could occur if implementing bestnormalize in an automated fashion). Note that the transformation is not one-to-one, in contrast to the other functions in this package.

6 binarize binarize(, location_measure = "median") ## S3 method for class 'binarize' predict(object, newdata = NULL, inverse = FALSE,...) ## S3 method for class 'binarize' print(,...) Arguments A vector to binarize location_measure which location measure should be used? can either be "median", "mean", "mode", a number, or a function. object newdata inverse an object of class binarize a vector of data to be (reverse) transformed if TRUE, performs reverse transformation... additional arguments Value A list of class binarize with elements.t method location n norm_stat transformed original data original data location_measure used for original fitting estimated location_measure number of nonmissing observations Pearson s P / degrees of freedom The predict function with inverse = FALSE returns the numeric value (0 or 1) of the transformation on newdata (which defaults to the original data). If inverse = TRUE, since the transform is not 1-1, it will create and return a factor that indicates where the original data was cut. Eamples <- rgamma(100, 1, 1) binarize_obj <- binarize() (p <- predict(binarize_obj)) predict(binarize_obj, newdata = p, inverse = TRUE)

boco 7 boco Bo-Co Normalization Perform a Bo-Co transformation and center/scale a vector to attempt normalization boco(,...) ## S3 method for class 'boco' predict(object, newdata = NULL, inverse = FALSE,...) ## S3 method for class 'boco' print(,...) Arguments Details Value A vector to normalize with Bo-Co... Additional arguments that can be passed to the estimation of the lambda parameter (lower, upper, epsilon) object newdata inverse an object of class boco a vector of data to be (reverse) transformed if TRUE, performs reverse transformation boco estimates the optimal value of lambda for the Bo-Co transformation. This transformation can be performed on new data, and inverted, via the predict function. The function will return an error if a user attempt to transform nonpositive data. A list of class boco with elements.t mean sd lambda n norm_stat transformed original data original data mean of vector post-bc transformation sd of vector post-bc transformation estimated lambda value for skew transformation number of nonmissing observations Pearson s P / degrees of freedom The predict function returns the numeric value of the transformation performed on new data, and allows for the inverse transformation as well.

8 lambert References Bo, G. E. P. and Co, D. R. (1964) An analysis of transformations. Journal of the Royal Statistical Society B, 26, 211-252. See Also boco Eamples <- rgamma(100, 1, 1) bc_obj <- boco() bc_obj p <- predict(bc_obj) 2 <- predict(bc_obj, newdata = p, inverse = TRUE) all.equal(2, ) lambert Lambert W F Normalization Perform Lambert s W F transformation and center/scale a vector to attempt normalization via the LambertW package. lambert(, type = "s",...) ## S3 method for class 'lambert' predict(object, newdata = NULL, inverse = FALSE,...) ## S3 method for class 'lambert' print(,...) Arguments type A vector to normalize with Bo-Co a character indicating which transformation to perform (options are "s", "h", and "hh", see details)... Additional arguments that can be passed to the LambertW::Gaussianize function object newdata inverse an object of class lambert a vector of data to be (reverse) transformed if TRUE, performs reverse transformation

lambert 9 Details Value lambert uses the LambertW package to estimate a normalizing (or "Gaussianizing") transformation. This transformation can be performed on new data, and inverted, via the predict function. NOTE: The type = "s" argument is the only one that does the 1-1 transform consistently, and so it is the only method currently used in bestnormalize(). Use type = "h" or type = hh at risk of not having this estimate 1-1 transform. These alternative types are effective when the data has eceptionally heavy tails, e.g. the Cauchy distribution. Additionally, sometimes (depending on the distribution) this method will be unable to etrapolate beyond the observed bounds. In these cases, NaN is returned. A list of class lambert with elements.t mean sd tau.mat n norm_stat transformed original data original data mean of vector sd of vector post-bc transformation estimated parameters of LambertW::Gaussianize number of nonmissing observations Pearson s P / degrees of freedom The predict function returns the numeric value of the transformation performed on new data, and allows for the inverse transformation as well. References Georg M. Goerg (2016). LambertW: An R package for Lambert W F Random Variables. R package version 0.6.4. Georg M. Goerg (2011): Lambert W random variables - a new family of generalized skewed distributions with applications to risk estimation. Annals of Applied Statistics 3(5). 2197-2230. Georg M. Goerg (2014): The Lambert Way to Gaussianize heavy-tailed data with the inverse of Tukey s h transformation as a special case. The Scientific World Journal. See Also Gaussianize Eamples <- rgamma(100, 1, 1) lambert_obj <- lambert() lambert_obj p <- predict(lambert_obj) 2 <- predict(lambert_obj, newdata = p, inverse = TRUE)

10 ordernorm all.equal(2, ) ordernorm Calculate and perform Ordered Quantile normalizing transformation The Ordered Quantile (ORQ) normalization transformation, ordernorm(), is a rank-based procedure by which the values of a vector are mapped to their percentile, which is then mapped to the same percentile of the normal distribution. Without the presence of ties, this essentially guarantees that the transformation leads to a uniform distribution. The transformation is: g() = Φ 1 (rank()/(length() + 1)) Where Φ refers to the standard normal cdf, rank() refers to each observation s rank, and length() refers to the number of observations. By itself, this method is certainly not new; the earliest mention of it that I could find is in a 1947 paper by Bartlett (see references). This formula was outlined eplicitly in Van der Waerden, and epounded upon in Beasley (2009). However there is a key difference to this version of it, as eplained below. Using linear interpolation between these percentiles, the ORQ normalization becomes a 1-1 transformation that can be applied to new data. However, outside of the observed domain of, it is unclear how to etrapolate the transformation. In the ORQ normalization procedure, a binomial glm with a logit link is used on the ranks in order to etrapolate beyond the bounds of the original domain of. The inverse normal CDF is then applied to these etrapolated predictions in order to etrapolate the transformation. This mitigates the influence of heavy-tailed distributions while preserving the 1-1 nature of the transformation. The etrapolation will provide a warning unless warn = FALSE.) However, we found that the etrapolation was able to perform very well even on data as heavy-tailed as a Cauchy distribution (paper to be published). This transformation can be performed on new data and inverted via the predict function. ordernorm(, warn = TRUE) ## S3 method for class 'ordernorm' predict(object, newdata = NULL, inverse = FALSE, warn = TRUE,...) ## S3 method for class 'ordernorm' print(,...)

ordernorm 11 Arguments A vector to normalize warn transforms outside observed range or ties will yield warning object an object of class ordernorm newdata a vector of data to be (reverse) transformed inverse if TRUE, performs reverse transformation... additional arguments Value A list of class ordernorm with elements.t n ties_status fit norm_stat transformed original data original data number of nonmissing observations indicator if ties are present fit to be used for etrapolation, if needed Pearson s P / degrees of freedom The predict function returns the numeric value of the transformation performed on new data, and allows for the inverse transformation as well. References Bartlett, M. S. "The Use of Transformations." Biometrics, vol. 3, no. 1, 1947, pp. 39-52. JSTOR www.jstor.org/stable/3001536. Van der Waerden BL. Order tests for the two-sample problem and their power. 1952;55:453-458. Ser A. Beasley TM, Erickson S, Allison DB. Rank-based inverse normal transformations are increasingly used, but are they merited? Behav. Genet. 2009;39(5): 580-595. pmid:19526352 See Also boco, lambert, bestnormalize, yeojohnson Eamples <- rgamma(100, 1, 1) ordernorm_obj <- ordernorm() ordernorm_obj p <- predict(ordernorm_obj) 2 <- predict(ordernorm_obj, newdata = p, inverse = TRUE) all.equal(2, )

12 plot.bestnormalize plot.bestnormalize Transformation plotting Plots transformation functions for objects produced by the bestnormalize package ## S3 method for class 'bestnormalize' plot(, inverse = FALSE, bounds = NULL, cols = c("green3", 1, 2, 4:6), leg_loc = "top",...) ## S3 method for class 'ordernorm' plot(, inverse = FALSE, bounds = NULL,...) ## S3 method for class 'boco' plot(, inverse = FALSE, bounds = NULL,...) ## S3 method for class 'yeojohnson' plot(, inverse = FALSE, bounds = NULL,...) ## S3 method for class 'lambert' plot(, inverse = FALSE, bounds = NULL,...) Arguments a fitted transformation inverse if TRUE, plots the inverse transformation bounds a vector of bounds to plot for the transformation cols a vector of colors to use for the transforms (see details) leg_loc the location of the legend on the plot... further parameters to be passed to plot and lines Details The plots produced by the individual transformations are simply plots of the original values by the newly transformed values, with a line denoting where transformations would take place for new data. For the bestnormalize object, this plots each of the possible transformations run by the original call to bestnormalize. The first argument in the "cols" parameter refers to the color of the chosen transformation.

yeojohnson 13 yeojohnson Yeo-Johnson Normalization Perform a Yeo-Johnson Transformation and center/scale a vector to attempt normalization yeojohnson(, eps = 0.001,...) ## S3 method for class 'yeojohnson' predict(object, newdata = NULL, inverse = FALSE,...) ## S3 method for class 'yeojohnson' print(,...) Arguments eps Details Value A vector to normalize with Yeo-Johnson A value to compare lambda against to see if it is equal to zero... Additional arguments that can be passed to the estimation of the lambda parameter (lower, upper) object newdata inverse an object of class yeojohnson a vector of data to be (reverse) transformed if TRUE, performs reverse transformation yeojohnson estimates the optimal value of lambda for the Yeo-Johnson transformation. This transformation can be performed on new data, and inverted, via the predict function. The Yeo-Johnson is similar to the Bo-Co method, however it allows for the transformation of nonpositive data as well. The step_yeojohnson function in the recipes package is another useful resource (see references). A list of class yeojohnson with elements.t mean sd lambda transformed original data original data mean of vector post-yj transformation sd of vector post-bc transformation estimated lambda value for skew transformation

14 yeojohnson n norm_stat number of nonmissing observations Pearson s P / degrees of freedom The predict function returns the numeric value of the transformation performed on new data, and allows for the inverse transformation as well. References Yeo, I. K., & Johnson, R. A. (2000). A new family of power transformations to improve normality or symmetry. Biometrika. Ma Kuhn and Hadley Wickham (2017). recipes: Preprocessing Tools to Create Design Matrices. R package version 0.1.0.9000. https://github.com/topepo/recipes Eamples <- rgamma(100, 1, 1) yeojohnson_obj <- yeojohnson() yeojohnson_obj p <- predict(yeojohnson_obj) 2 <- predict(yeojohnson_obj, newdata = p, inverse = TRUE) all.equal(2, )

Inde Topic datasets autotrader, 2 _PACKAGE (bestnormalize-package), 2 autotrader, 2 bestnormalize, 3, 11 bestnormalize-package, 2 binarize, 5 boco, 5, 7, 8, 11 Gaussianize, 9 lambert, 5, 8, 11 ordernorm, 5, 10 plot.bestnormalize, 12 plot.boco (plot.bestnormalize), 12 plot.lambert (plot.bestnormalize), 12 plot.ordernorm (plot.bestnormalize), 12 plot.yeojohnson (plot.bestnormalize), 12 predict.bestnormalize (bestnormalize), 3 predict.binarize (binarize), 5 predict.boco (boco), 7 predict.lambert (lambert), 8 predict.ordernorm (ordernorm), 10 predict.yeojohnson (yeojohnson), 13 print.bestnormalize (bestnormalize), 3 print.binarize (binarize), 5 print.boco (boco), 7 print.lambert (lambert), 8 print.ordernorm (ordernorm), 10 print.yeojohnson (yeojohnson), 13 yeojohnson, 5, 11, 13 15