Introducing R/Tidyverse to Clinical Statistical Programming

Similar documents
An Introduction to R. Ed D. J. Berry 9th January 2017

Data Wrangling in the Tidyverse

The Tidyverse BIOF 339 9/25/2018

Data Manipulation. Module 5

Lecture 09. Graphics::ggplot I R Teaching Team. October 1, 2018

Lecture 12: Data carpentry with tidyverse

Лекция 4 Трансформация данных в R

Session 3 Nick Hathaway;

STAT 1291: Data Science

Modeling in the Tidyverse. Max Kuhn (RStudio)

Data Import and Formatting

Financial Econometrics Practical

The Benefits of Traceability Beyond Just From SDTM to ADaM in CDISC Standards Maggie Ci Jiang, Teva Pharmaceuticals, Great Valley, PA

A Whistle-Stop Tour of the Tidyverse

Transform Data! The Basics Part I!

DCDISC Users Group. Nate Freimark Omnicare Clinical Research Presented on

Tidy Evaluation. Lionel Henry and Hadley Wickham RStudio

Merge Conflicts p. 92 More GitHub Workflows: Forking and Pull Requests p. 97 Using Git to Make Life Easier: Working with Past Commits p.

Julia Silge Data Scientist at Stack Overflow

Grammar of data. dplyr. Bjarki Þór Elvarsson and Einar Hjörleifsson. Marine Research Institute. Bjarki&Einar (MRI) R-ICES 1 / 29

Metadata and ADaM.

DATA SCIENCE INTRODUCTION QSHORE TECHNOLOGIES. About the Course:

PharmaSUG 2018 Paper #AD-05

Data wrangling. Reduction/Aggregation: reduces a variable to a scalar

A Picture is worth 3000 words!! 3D Visualization using SAS Suhas R. Sanjee, Novartis Institutes for Biomedical Research, INC.

Transform Data! The Basics Part I continued!

Python data pipelines similar to R Documentation

PharmaSUG China Big Insights in Small Data with RStudio Shiny Mina Chen, Roche Product Development in Asia Pacific, Shanghai, China

Lecture 3: Pipes and creating variables using mutate()

Subsetting, dplyr, magrittr Author: Lloyd Low; add:

Package qualmap. R topics documented: September 12, Type Package

Introduction to R Programming

What is the ADAM OTHER Class of Datasets, and When Should it be Used? John Troxell, Data Standards Consulting

Package infer. July 11, Type Package Title Tidy Statistical Inference Version 0.3.0

Importing and visualizing data in R. Day 3

Hands-On ADaM ADAE Development Sandra Minjoe, Accenture Life Sciences, Wayne, Pennsylvania Kim Minkalis, Accenture Life Sciences, Wayne, Pennsylvania

Incident Response Programming with R. Eric Zielinski Sr. Consultant, Nationwide

Creating an ADaM Data Set for Correlation Analyses

K-fold cross validation in the Tidyverse Stephanie J. Spielman 11/7/2017

Clustering algorithms and autoencoders for anomaly detection

Manipulating Statistical and Other Procedure Output to Get the Results That You Need

AN INTRODUCTION TO R FOR MANAGEMENT SCHOLARS

Scalable Machine Learning in R. with H2O

Package tidyimpute. March 5, 2018

Utilization of Python in clinical study by SASPy

ELAC 2018 INTRODUCTION TO DATA SCIENCE. Day 2 Rafael Santos

An Efficient Solution to Efficacy ADaM Design and Implementation

Data Handling: Import, Cleaning and Visualisation

Working with Composite Endpoints: Constructing Analysis Data Pushpa Saranadasa, Merck & Co., Inc., Upper Gwynedd, PA

Introduction to R: Using R for Statistics and Data Analysis. BaRC Hot Topics

Using PROC SQL to Generate Shift Tables More Efficiently

Lecture 1 Getting Started with SAS

Session 5 Nick Hathaway;

Dplyr Introduction Matthew Flickinger July 12, 2017

Beyond Statistics. Namrata Deshpande, Cytel. Barcelona, PhUSE 2016

Now let s take a look

Analyzing Economic Data using R

R: A Gentle Introduction. Vega Bharadwaj George Mason University Data Services

R For Everyone: Advanced Analytics And Graphics (Addison-Wesley Data & Analytics Series) PDF

Package furniture. November 10, 2017

Learning Objectives for Data Concept and Visualization

Hands-On ADaM ADAE Development Sandra Minjoe, Accenture Life Sciences, Wayne, Pennsylvania

Package gggenes. R topics documented: November 7, Title Draw Gene Arrow Maps in 'ggplot2' Version 0.3.2

SDTM Implementation Guide Clear as Mud: Strategies for Developing Consistent Company Standards

Introduction to R: Using R for Statistics and Data Analysis. BaRC Hot Topics

How to write ADaM specifications like a ninja.

Tabular data management. Jennifer Bryan RStudio, University of British Columbia

Introduction to Data Analytics. David Walling

AZ CDISC Implementation

Data cleansing and wrangling with Diabetes.csv data set Shiloh Bradley Webster University St. Louis. Data Wrangling 1

A Macro for Generating the Adverse Events Summary for ClinicalTrials.gov

Yes! The basic principles of ADaM are also best practice for our industry Yes! ADaM a standard with enforceable rules and recognized structures Yes!

Package cattonum. R topics documented: May 2, Type Package Version Title Encode Categorical Features

Package catenary. May 4, 2018

Data Visualization. Module 7

Implementing CDISC Using SAS. Full book available for purchase here.

Traceability in the ADaM Standard Ed Lombardi, SynteractHCR, Inc., Carlsbad, CA

Boost your Analytics with ML for SQL Nerds

BGGN 213 Working with R packages Barry Grant

Making sense of census microdata

AUTOMATED CREATION OF SUBMISSION-READY ARTIFACTS SILAS MCKEE

Introduction to R and the tidyverse

From Implementing CDISC Using SAS. Full book available for purchase here. About This Book... xi About The Authors... xvii Acknowledgments...

pandas: Rich Data Analysis Tools for Quant Finance

Data visualization with ggplot2

Demo yeast mutant analysis

Studio Cloud. Frictionless onboarding to data science with. Mine Çetinkaya-Rundel Duke University + RStudio

SAS and Python: The Perfect Partners in Crime

Getting and Cleaning Data. Biostatistics

Making the Most of Data to Benefit Public Health.

An Introduction to Visit Window Challenges and Solutions

separate representations of data.

Syntax and Grammars 1 / 21

Detecting Treatment Emergent Adverse Events (TEAEs)

SQL Server 2017: Data Science with Python or R?

STA130 - Class #2: Nathan Taback

Package rcv. August 11, 2017

Pure, predictable, pipeable: creating fluent interfaces with R. Hadley Chief Scientist, RStudio

R. Muralikrishnan Max Planck Institute for Empirical Aesthetics Frankfurt. 08 June 2017

Lecture 3. Homework Review and Recoding I R Teaching Team. September 5, 2018

Transcription:

Introducing R/Tidyverse to Clinical Statistical Programming MBSW 2018 Freeman Wang, @freestatman 2018-05-15 Slides available at https://bit.ly/2knkalu

Where are my biases Biomarker Statistician Genomic Data Scientist and Bioinformatician Visualization Engineer R/Shiny Developer Long time Linux/HPC/Vim user 2 / 27

Where are my biases Biomarker Statistician Genomic Data Scientist and Bioinformatician Visualization Engineer R/Shiny Developer Long time Linux/HPC/Vim user SAS Certi ed Base and Advanced Programmer 2 / 27

Disclaimer 1. All the data and info in this talk are public (Twitter, GitHub). CDISC example data were downloaded from: GitHub 2. This talk represents my own views, not those of BSSI. BSSI does not have an opinion of which tool you should use: e.g. SAS vs R, or R/base vs R/Tidyverse. 3 / 27

Tidyverse https://www.tidyverse.org/ 4 / 27

Why? Why so popular (1/2) Not about the good-looking plots, or the fancy manipulation functions Content-driven and communication-focused work ow (logic- ow) Concisely expresses human logic as R code Fast human logic I/O Yourself team / customer Past you present you Seamlessly align multiple layers of logic, across analysis objective, programming, and output 5 / 27

Why? Why so popular (2/2) Structured domains of work ow, and wellde ned verb/vocabulary within each domain Grammar of data manipulation (dplyr) Grammar of data visualization (ggplot2) consistent design: learn it once, use it everywhere 6 / 27

How? Tidy principles 1. Tidy data (Shared data structures) 2. Tidy programming API (Compose simple pieces) 3. The pipe! %>% (functional programming for human logic) 4. Tidy statistics 7 / 27

Tidyverse: EDA (Exploratory Data Analysis) Work ow Clinical programming is one type of Data Science http://r4ds.had.co.nz/ 8 / 27

A "Real" Tidyverse Work ow 9 / 27

What? Tidy data Each row is an observation Each column is a variable Clinical examples Long-format is commonly used in data storage, e.g. SDTM/ADaM Wide-format is commonly used for DEA, modeling, and visualization Align manipulation, statistical and visualization logic with tidy data 10 / 27

What? dplyr: Grammar of data manipulation key verbs select (common verb in SQL) mutate (e.g. case_when) filter group_by summarize arrange Translatable to SQL code, Cheatsheet 11 / 27

Example of Tidyverse vs Base R 12 / 27

What? Tidyverse extended families Tidyverse community goodies making life nice and clean ggplot2 extention packages survminer, cowplot, ggpubr, etc plotly summarytools janitor tidyversity jsmisc More bioconductor packages buy in! 13 / 27

Clinical Programming Example 1/3 library(haven) library(tidyverse) iris <- haven::read_sas('data/iris.sas7bdat') adsl <- Hmisc::sasxport.get("data/adam/cdisc/adsl.xpt") ## Processing SAS dataset ADSL.. adsl %>% select(usubjid, contains('trt'), -starts_with('trt01a')) %>% DT::datatable(options = list(pagelength = 3)) Show 3 entries Search: usubjid trt01p trt01pn trtsdt trtedt trtdur 1 01-701- 1015 Placebo 0 2014-01-02 2014-07-02 182 2 01-701- 1023 Placebo 0 2012-08-05 2012-09-01 28 3 01-701- 1028 Xanomeline High Dose 81 2013-07-19 2014-01-14 180 14 / 27

Clinical Programming Example 2a/3 adsl %>% group_by(trt01p) %>% # treatment-wise manipulation summarize(ave_trtdur = mean(trtdur, na.rm = TRUE), n = n()) %>% knitr::kable(format = 'html') trt01p ave_trtdur n Placebo 149.06977 86 Xanomeline High Dose 99.39286 84 Xanomeline Low Dose 99.02381 84 15 / 27

Clinical Programming Example 2b/3 ae_max <- adae %>% group_by(usubjid) %>% arrange(aesev) %>% slice(n()) %>% ungroup() %>% select(usubjid, aesev) # group by patient: patient-wise manip # sort AE severity from low to high, w # take the most severe level, within p ae_max %>% DT::datatable(options = list(pagelength = 3)) Show 3 entries Search: usubjid aesev 1 01-701-1015 MILD 2 01-701-1023 MODERATE 3 01-701-1028 MILD Showing 1 to 3 of 225 entries Previous 1 2 3 4 5 75 Next 16 / 27

Clinical Programming Example 3a/3 adsl %>% left_join(ae_max) %>% filter(!is.na(aesev)) %>% group_by(aesev) %>% summarize(ave_trtdur = mean(trtdur, na.rm = TRUE), n = n()) %>% ggpubr::ggbarplot(x = 'aesev', y='ave_trtdur', fill='aesev', pale 17 / 27

Clinical Programming Example 3b/3 adsl %>% left_join(ae_max) %>% filter(!is.na(aesev)) %>% group_by(aesev, trt01p) %>% summarize(ave_trtdur = mean(trtdur, na.rm = TRUE), n = n()) %>% ggpubr::ggbarplot(x = 'aesev', y='ave_trtdur', fill='aesev', pale facet_grid(~trt01p) + theme(axis.text.x = element_text(size=8)) 18 / 27

Tidy programming API: Compose simple pieces Tidyverse vs Base R Reduce unnecessary intermediate objects (e.g. index, dummy variables), save your brain memory bandwidth Data lives in structured data.frame, instead of individual value or vector Better default argument values, e.g. stringsasfactors=false in tibble package. 19 / 27

Tidy programming API: Compose simple pieces Tidyverse vs Base R Reduce unnecessary intermediate objects (e.g. index, dummy variables), save your brain memory bandwidth Data lives in structured data.frame, instead of individual value or vector Better default argument values, e.g. stringsasfactors=false in tibble package. R Inferno: An essential guide to the trouble spots and oddities of R 19 / 27

The pipe! %>% Conceptually the same with Unix pipe syntax Push the LHS output into the 1st argument of the RHS function Natural representation of human logic Each processing layer is a function Embrace functional programming Similar philosophy to ggplot2 Grammar of Graphics 20 / 27

#TidyTuesday 1/3 21 / 27

#TidyTuesday 2/3 22 / 27

#TidyTuesday 3/3 23 / 27

Tidy Statistics library(broom) turns tidy output of model objects that are suited to further analysis, manipulation, and visualization. 24 / 27

Discussion R/Tidyverse ecosystem is fast growing Keep adopting new ideas, e.g. non-standard evaluation (NSE) Some deep-level API change may cause pain for R package developers (OK for general users) Function name collision is a common R problem Functions may be over-written by later loaded packages (or the packages' dependency packages...) More robust function calling is to add package namespace: dplyr::select() 25 / 27

Thanks for attending Special thanks to Statistical Inference: A Tidy Approach @old_man_chester tidyverse 101: Simplifying life for users Slides created via the R package xaringan by Yihui Xie HTML document created via the R package rmarkdown by RStudio Slides and source code are available at https://github.com/freestatman/mbsw_2018_tidyverse 26 / 27

Freeman Wang, @freestatman 2018-05-15 27 / 27