Package infer. July 11, Type Package Title Tidy Statistical Inference Version 0.3.0

Similar documents
Package cattonum. R topics documented: May 2, Type Package Version Title Encode Categorical Features

Package kdtools. April 26, 2018

Package fastdummies. January 8, 2018

Package descriptr. March 20, 2018

Package mason. July 5, 2018

Package fitur. March 11, 2018

Package catenary. May 4, 2018

Package docxtools. July 6, 2018

Package pcr. November 20, 2017

Package robotstxt. November 12, 2017

Package tidylpa. March 28, 2018

Package crossword.r. January 19, 2018

Package stapler. November 27, 2017

Package messaging. May 27, 2018

Package canvasxpress

Package bisect. April 16, 2018

Package rgho. R topics documented: January 18, 2017

Package qualmap. R topics documented: September 12, Type Package

Package ECctmc. May 1, 2018

Package dotwhisker. R topics documented: June 28, Type Package

Package spark. July 21, 2017

Package citools. October 20, 2018

Package pwrab. R topics documented: June 6, Type Package Title Power Analysis for AB Testing Version 0.1.0

Package vip. June 15, 2018

Package jpmesh. December 4, 2017

Package sigqc. June 13, 2018

Package ggimage. R topics documented: December 5, Title Use Image in 'ggplot2' Version 0.1.0

Package validara. October 19, 2017

Package ggimage. R topics documented: November 1, Title Use Image in 'ggplot2' Version 0.0.7

Package tibble. August 22, 2017

Package ezsummary. August 29, 2016

Package gggenes. R topics documented: November 7, Title Draw Gene Arrow Maps in 'ggplot2' Version 0.3.2

Package tidyimpute. March 5, 2018

Package quickreg. R topics documented:

Package keyholder. May 19, 2018

Package tidystats. May 6, 2018

Package vinereg. August 10, 2018

Package rsample. R topics documented: January 7, Title General Resampling Infrastructure Version Maintainer Max Kuhn

Package postal. July 27, 2018

Package modelr. May 11, 2018

Package mgc. April 13, 2018

Package oec. R topics documented: May 11, Type Package

Package zebu. R topics documented: October 24, 2017

Package widyr. August 14, 2017

Package narray. January 28, 2018

Package meme. November 2, 2017

Package condir. R topics documented: February 15, 2017

Package tidytransit. March 4, 2019

Package labelvector. July 28, 2018

Package autocogs. September 22, Title Automatic Cognostic Summaries Version 0.1.1

Package ruler. February 23, 2018

Package reval. May 26, 2015

Package rsppfp. November 20, 2018

Package skynet. December 12, 2018

Package lumberjack. R topics documented: July 20, 2018

Package facerec. May 14, 2018

Package smoothr. April 4, 2018

Package driftr. June 14, 2018

Package textrecipes. December 18, 2018

Package statsdk. September 30, 2017

Package splithalf. March 17, 2018

Package dkanr. July 12, 2018

Package embed. November 19, 2018

Package knitrprogressbar

Package ClusterSignificance

Package PCADSC. April 19, 2017

Package meme. December 6, 2017

Package anomalize. April 17, 2018

Package WordR. September 7, 2017

Package statar. July 6, 2017

Package fbroc. August 29, 2016

Package assertr. R topics documented: February 23, Type Package

Package graphframes. June 21, 2018

Package condformat. October 19, 2017

Package dat. January 20, 2018

Package wrswor. R topics documented: February 2, Type Package

Package guardianapi. February 3, 2019

Package distdichor. R topics documented: September 24, Type Package

Package BiDimRegression

Package geojsonsf. R topics documented: January 11, Type Package Title GeoJSON to Simple Feature Converter Version 1.3.

Package data.world. April 5, 2018

Package calpassapi. August 25, 2018

Package incidence. August 24, Type Package

Package enpls. May 14, 2018

Package statip. July 31, 2018

Package comparedf. February 11, 2019

Package fingertipsr. May 25, Type Package Version Title Fingertips Data for Public Health

Package ggloop. October 20, 2016

Package jdx. R topics documented: January 9, Type Package Title 'Java' Data Exchange for 'R' and 'rjava'

Package labelled. December 19, 2017

Package readxl. April 18, 2017

Package editdata. October 7, 2017

Package githubinstall

Package tidytree. June 13, 2018

Package strat. November 23, 2016

Package pdfsearch. July 10, 2018

Package purrrlyr. R topics documented: May 13, Title Tools at the Intersection of 'purrr' and 'dplyr' Version 0.0.2

Package opencage. January 16, 2018

Package epitab. July 4, 2018

Modeling in the Tidyverse. Max Kuhn (RStudio)

Transcription:

Type Package Title Tidy Statistical Inference Version 0.3.0 Package infer July 11, 2018 The objective of this package is to perform inference using an epressive statistical grammar that coheres with the tidy design framework. License CC0 Encoding UTF-8 LazyData true Imports assertive, dplyr (>= 0.7.0), methods, tibble, rlang (>= 0.2.0), ggplot2, magrittr Depends R (>= 3.1.2) Suggests broom, devtools (>= 1.12.0), knitr, rmarkdown, nycflights13, stringr, testthat, covr URL https://github.com/andrewpbray/infer BugReports https://github.com/andrewpbray/infer/issues RoygenNote 6.0.1 VignetteBuilder knitr NeedsCompilation no Author Andrew Bray [aut, cre], Chester Ismay [aut], Ben Baumer [aut], Mine Cetinkaya-Rundel [aut], Ted Laderas [ctb], Nick Solomon [ctb], Johanna Hardin [ctb], Albert Kim [ctb], Neal Fultz [ctb], Doug Friedman [ctb], Richie Cotton [ctb] Maintainer Andrew Bray <abray@reed.edu> Repository CRAN Date/Publication 2018-07-11 05:00:03 UTC 1

2 calculate R topics documented: calculate........................................... 2 chisq_stat.......................................... 3 chisq_test.......................................... 3 conf_int........................................... 4 generate........................................... 5 hypothesize......................................... 6 infer............................................. 6 print.infer.......................................... 7 p_value........................................... 7 rep_sample_n........................................ 8 set_params......................................... 9 specify............................................ 9 t_stat............................................. 10 t_test............................................. 11 visualize........................................... 11 %>%............................................ 13 Inde 14 calculate Calculate summary statistics Calculate summary statistics calculate(, stat = c("mean", "median", "sd", "prop", "diff in means", "diff in medians", "diff in props", "Chisq", "F", "slope", "correlation", "t", "z"), order = NULL,...) stat order the output from generate for computation-based inference or the output from hypothesize piped in to here for theory-based inference. a string giving the type of the statistic to calculate. Current options include "mean", "median", "sd", "prop", "diff in means", "diff in medians", "diff in props", "Chisq", "F", "t", "z", "slope", and "correlation". a string vector of specifying the order in which the levels of the eplanatory variable should be ordered for subtraction, where order = c("first", "second") means ("first" - "second") Needed for inference on difference in means, medians, or proportions and t and z statistics.... to pass options like na.rm = TRUE into functions like mean, sd, etc.

chisq_stat 3 A tibble containing a stat column of calculated statistics # Permutation test for two binary variables dplyr::mutate(am = factor(am), vs = factor(vs)) %>% specify(am ~ vs, success = "1") %>% generate(reps = 100, type = "permute") %>% calculate(stat = "diff in props", order = c("1", "0")) chisq_stat A shortcut wrapper function to get the observed test statistic for a chisq test. Uses stats::chisq.test, which applies a continuity correction. A shortcut wrapper function to get the observed test statistic for a chisq test. Uses stats::chisq.test, which applies a continuity correction. chisq_stat(data, formula,...) data a data frame that can be coerced into a tibble formula a formula with the response variable on the left and the eplanatory on the right... additional arguments for stats::chisq.test chisq_test A tidier version of chisq.test for goodness of fit tests and tests of independence. A tidier version of chisq.test for goodness of fit tests and tests of independence. chisq_test(data, formula,...)

4 conf_int data formula a data frame that can be coerced into a tibble a formula with the response variable on the left and the eplanatory on the right... additional arguments for chisq.test # chisq test for comparing number of cylinders against automatic/manual dplyr::mutate(cyl = factor(cyl), am = factor(am)) %>% chisq_test(cyl ~ am) conf_int Compute the confidence interval for (currently only) simulation-based methods get_confidence_interval() and get_ci() are both aliases of conf_int() conf_int(, level = 0.95, type = "percentile", point_estimate = NULL) get_ci(, level = 0.95, type = "percentile", point_estimate = NULL) get_confidence_interval(, level = 0.95, type = "percentile", point_estimate = NULL) level type data frame of calculated statistics or containing attributes of theoretical distribution values. Currently, dependent on statistics being stored in stat column as created in calculate() function. a numerical value between 0 and 1 giving the confidence level. Default value is 0.95. a string giving which method should be used for creating the confidence interval. The default is "percentile" with "se" corresponding to (2 * standard error) as the other option. point_estimate a numeric value or a 11 data frame set to NULL by default. Needed to be provided if type = "se". a 2 1 tibble with values corresponding to lower and upper values in the confidence interval

generate 5 mtcars_df <- dplyr::mutate(am = factor(am)) d_hat <- mtcars_df %>% specify(mpg ~ am) %>% calculate(stat = "diff in means", order = c("1", "0")) bootstrap_distn <- mtcars_df %>% specify(mpg ~ am) %>% generate(reps = 100) %>% calculate(stat = "diff in means", order = c("1", "0")) bootstrap_distn %>% conf_int(level = 0.9) bootstrap_distn %>% conf_int(type = "se", point_estimate = d_hat) generate Generate resamples, permutations, or simulations based on specify and (if needed) hypothesize inputs Generate resamples, permutations, or simulations based on specify and (if needed) hypothesize inputs generate(, reps = 1, type = attr(, "type"),...) a data frame that can be coerced into a tbl_df reps the number of resamples to generate type currently either bootstrap, permute, or simulate... currently ignored A tibble containing rep generated datasets, indicated by the replicate column. # Permutation test for two binary variables dplyr::mutate(am = factor(am), vs = factor(vs)) %>% specify(am ~ vs, success = "1") %>% generate(reps = 100, type = "permute")

6 infer hypothesize Declare a null hypothesis Declare a null hypothesis hypothesize(, null,...) null a data frame that can be coerced into a tbl_df the null hypothesis. Options include "independence" and "point"... arguments passed to downstream functions A tibble containing the response (and eplanatory, if specified) variable data with parameter information stored as well a data frame with attributes set # Permutation test similar to ANOVA dplyr::mutate(cyl = factor(cyl)) %>% specify(mpg ~ cyl) %>% generate(reps = 100, type = "permute") %>% calculate(stat = "F") infer infer: a grammar for statistical inference The objective of this package is to perform statistical inference using a grammar that illustrates the underlying concepts and a format that coheres with the tidyverse. # Eample usage: library(infer)

print.infer 7 print.infer Print methods Print methods ## S3 method for class 'infer' print(,...) an object of class infer, i.e. output from specify or hypothesize... arguments passed to methods p_value Compute the p-value for (currently only) simulation-based methods get_pvalue() is an alias of p_value Compute the p-value for (currently only) simulation-based methods get_pvalue() is an alias of p_value p_value(, obs_stat, direction) get_pvalue(, obs_stat, direction) obs_stat direction data frame of calculated statistics or containing attributes of theoretical distribution values a numeric value or a 11 data frame (as etreme or more etreme than this) a character string. Options are "less", "greater", or "two_sided". Can also specify "left", "right", or "both". a 11 data frame with value between 0 and 1

8 rep_sample_n mtcars_df <- dplyr::mutate(am = factor(am)) d_hat <- mtcars_df %>% specify(mpg ~ am) %>% calculate(stat = "diff in means", order = c("1", "0")) null_distn <- mtcars_df %>% specify(mpg ~ am) %>% generate(reps = 100) %>% calculate(stat = "diff in means", order = c("1", "0")) null_distn %>% p_value(obs_stat = d_hat, direction = "right") rep_sample_n Perform repeated sampling Perform repeated sampling of samples of size n. Useful for creating sampling distributions rep_sample_n(tbl, size, replace = FALSE, reps = 1, prob = NULL) tbl size replace reps prob data frame of population from which to sample sample size of each sample should sampling be with replacement? number of samples of size n = size to take a vector of probability weights for obtaining the elements of the vector being sampled. A tibble of size rep times size rows corresponding to rep samples of size n = size from tbl. suppresspackagestartupmessages(library(dplyr)) suppresspackagestartupmessages(library(ggplot2)) # A virtual population of N = 10,010, of which 3091 are hurricanes population <- dplyr::storms %>% select(status)

set_params 9 # Take samples of size n = 50 storms without replacement; do this 1000 times samples <- population %>% rep_sample_n(size = 50, reps = 1000) samples # Compute p_hats for all 1000 samples = proportion hurricanes p_hats <- samples %>% group_by(replicate) %>% summarize(prop_hurricane = mean(status == "hurricane")) p_hats # Plot sampling distribution ggplot(p_hats, aes( = prop_hurricane)) + geom_density() + labs( = "p_hat", y = "Number of samples", title = "Sampling distribution of p_hat from 1000 samples of size 50") set_params To determine which theoretical distribution to fit (if any) To determine which theoretical distribution to fit (if any) set_params() a data frame that can be coerced into a tibble specify Specify the response and eplanatory variables with specify also converting character variables chosen to be factors Specify the response and eplanatory variables with specify also converting character variables chosen to be factors specify(, formula, response = NULL, eplanatory = NULL, success = NULL)

10 t_stat formula response eplanatory success a data frame that can be coerced into a tibble a formula with the response variable on the left and the eplanatory on the right the variable name in that will serve as the response. This is alternative to using the formula argument the variable name in that will serve as the eplanatory variable the level of response that will be considered a success, as a string. Needed for inference on one proportion, a difference in proportions, and corresponding z stats A tibble containing the response (and eplanatory, if specified) variable data # Permutation test similar to ANOVA dplyr::mutate(cyl = factor(cyl)) %>% specify(mpg ~ cyl) %>% generate(reps = 100, type = "permute") %>% calculate(stat = "F") t_stat A shortcut wrapper function to get the observed test statistic for a t test A shortcut wrapper function to get the observed test statistic for a t test t_stat(data, formula,...) data a data frame that can be coerced into a tibble formula a formula with the response variable on the left and the eplanatory on the right... pass in arguments to infer functions

t_test 11 t_test A tidier version of t.test for two sample tests A tidier version of t.test for two sample tests t_test(data, formula, order = NULL, alternative = "two_sided", mu = 0, conf_int = TRUE, conf_level = 0.95,...) data formula order alternative mu conf_int a data frame that can be coerced into a tibble a formula with the response variable on the left and the eplanatory on the right # @param order a string vector of specifying the order in which the levels of the eplanatory variable should be ordered for subtraction, where order = c("first", "second") means ("first" - "second") character string giving the direction of the alternative hypothesis. Options are "two_sided" (default), "greater", or "less". a numeric value giving the hypothesized null mean value for a one sample test and the hypothesized difference for a two sample test a logical value for whether to include the confidence interval or not. TRUE by default conf_level a numeric value between 0 and 1. Default value is 0.95... for passing in other arguments to stats::t.test # t test for comparing mpg against automatic/manual dplyr::mutate(am = factor(am)) %>% t_test(mpg ~ am, order = c("1", "0"), alternative = "less") visualize Visualize the distribution of the simulation-based inferential statistics or the theoretical distribution (or both!) Visualize the distribution of the simulation-based inferential statistics or the theoretical distribution (or both!)

12 visualize visualize(data, bins = 15, method = "simulation", dens_color = "black", obs_stat = NULL, obs_stat_color = "red2", pvalue_fill = "pink", direction = NULL, endpoints = NULL, endpoints_color = "mediumaquamarine", ci_fill = "turquoise",...) data bins method dens_color obs_stat the output from calculate the number of bins in the histogram a string giving the method to display. Options are "simulation", "theoretical", or "both" with "both" corresponding to "simulation" and "theoretical" a character or he string specifying the color of the theoretical density curve a numeric value or 11 data frame corresponding to what the observed statistic is obs_stat_color a character or he string specifying the color of the observed statistic as a vertical line on the plot pvalue_fill direction a character or he string specifying the color to shade the pvalue. In previous versions of the package this was the shade_color argument a string specifying in which direction the shading should occur. Options are "less", "greater", or "two_sided" for p-value. Can also give "left", "right", or "both" for p-value. For confidence intervals, use "between". and give the endpoint values in endpoints endpoints a 2 element vector or a 1 2 data frame containing the lower and upper values to be plotted. Most useful for visualizing conference intervals. endpoints_color a character or he string specifying the color of the observed statistic as a vertical line on the plot ci_fill a character or he string specifying the color to shade the confidence interval... other arguments passed along to ggplot2 A ggplot object showing the simulation-based distribution as a histogram or bar graph. Also used to show the theoretical curves. # Permutations to create a simulation-based null distribution for # one numerical response and one categorical predictor # using t statistic dplyr::mutate(am = factor(am)) %>% specify(mpg ~ am) %>% # alt: response = mpg, eplanatory = am generate(reps = 100, type = "permute") %>%

%>% 13 calculate(stat = "t", order = c("1", "0")) %>% visualize(method = "simulation") #default method # Theoretical t distribution for # one numerical response and one categorical predictor # using t statistic dplyr::mutate(am = factor(am)) %>% specify(mpg ~ am) %>% # alt: response = mpg, eplanatory = am # generate() is not needed since we are not doing simulation calculate(stat = "t", order = c("1", "0")) %>% visualize(method = "theoretical") # Overlay theoretical distribution on top of randomized t-statistics dplyr::mutate(am = factor(am)) %>% specify(mpg ~ am) %>% # alt: response = mpg, eplanatory = am generate(reps = 100, type = "permute") %>% calculate(stat = "t", order = c("1", "0")) %>% visualize(method = "both") %>% Pipe Like dplyr, infer also uses the pipe function, %>% to turn function composition into a series of imperative statements. lhs, rhs Inference functions and the initial data frame

Inde %>%, 13 calculate, 2, 12 chisq_stat, 3 chisq_test, 3 conf_int, 4 generate, 2, 5 get_ci (conf_int), 4 get_confidence_interval (conf_int), 4 get_pvalue (p_value), 7 hypothesize, 2, 6, 7 infer, 6 infer-package (infer), 6 p_value, 7 print.infer, 7 rep_sample_n, 8 set_params, 9 specify, 7, 9 t_stat, 10 t_test, 11 tbl_df, 5, 6 tibble, 3, 4, 9 11 visualize, 11 14