Multivariable Regression Modelling

Similar documents
A comparison of spline methods in R for building explanatory models

A review of spline function selection procedures in R

Moving Beyond Linearity

Generalized Additive Model

Nonparametric regression using kernel and spline methods

Generalized Additive Models

A Comparison of Modeling Scales in Flexible Parametric Models. Noori Akhtar-Danesh, PhD McMaster University

Continuous Predictors in Regression Analyses

Last time... Bias-Variance decomposition. This week

Moving Beyond Linearity

Sandeep Kharidhi and WenSui Liu ChoicePoint Precision Marketing

Spline Models. Introduction to CS and NCS. Regression splines. Smoothing splines

Package slp. August 29, 2016

Statistics & Analysis. Fitting Generalized Additive Models with the GAM Procedure in SAS 9.2

Nonparametric Mixed-Effects Models for Longitudinal Data

Statistical Modeling with Spline Functions Methodology and Theory

davidr Cornell University

Splines. Patrick Breheny. November 20. Introduction Regression splines (parametric) Smoothing splines (nonparametric)

Linear Penalized Spline Model Estimation Using Ranked Set Sampling Technique

Stat 8053, Fall 2013: Additive Models

Package bgeva. May 19, 2017

Lecture 17: Smoothing splines, Local Regression, and GAMs

Additive hedonic regression models for the Austrian housing market ERES Conference, Edinburgh, June

Soft Threshold Estimation for Varying{coecient Models 2 ations of certain basis functions (e.g. wavelets). These functions are assumed to be smooth an

Nonparametric Regression

Linear Methods for Regression and Shrinkage Methods

1D Regression. i.i.d. with mean 0. Univariate Linear Regression: fit by least squares. Minimize: to get. The set of all possible functions is...

GENREG DID THAT? Clay Barker Research Statistician Developer JMP Division, SAS Institute

Splines and penalized regression

The theory of the linear model 41. Theorem 2.5. Under the strong assumptions A3 and A5 and the hypothesis that

Non-Parametric and Semi-Parametric Methods for Longitudinal Data

STAT 705 Introduction to generalized additive models

Lecture 16: High-dimensional regression, non-linear regression

A toolbox of smooths. Simon Wood Mathematical Sciences, University of Bath, U.K.

Statistical Modeling with Spline Functions Methodology and Theory

A Method for Comparing Multiple Regression Models

GAMs semi-parametric GLMs. Simon Wood Mathematical Sciences, University of Bath, U.K.

100 Myung Hwan Na log-hazard function. The discussion section of Abrahamowicz, et al.(1992) contains a good review of many of the papers on the use of

Assessing the Quality of the Natural Cubic Spline Approximation

A fast Mixed Model B-splines algorithm

ICPSR Training Program McMaster University Summer, The R Statistical Computing Environment: The Basics and Beyond

JMP Book Descriptions

Goals of the Lecture. SOC6078 Advanced Statistics: 9. Generalized Additive Models. Limitations of the Multiple Nonparametric Models (2)

Predictive Analytics: Demystifying Current and Emerging Methodologies. Tom Kolde, FCAS, MAAA Linda Brobeck, FCAS, MAAA

Preface to the Second Edition. Preface to the First Edition. 1 Introduction 1

scikit-learn (Machine Learning in Python)

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Linear penalized spline model estimation using ranked set sampling technique

A popular method for moving beyond linearity. 2. Basis expansion and regularization 1. Examples of transformations. Piecewise-polynomials and splines

Statistics & Analysis. A Comparison of PDLREG and GAM Procedures in Measuring Dynamic Effects

Nonparametric Approaches to Regression

Approximation of 3D-Parametric Functions by Bicubic B-spline Functions

book 2014/5/6 15:21 page v #3 List of figures List of tables Preface to the second edition Preface to the first edition

10601 Machine Learning. Model and feature selection

Predict Outcomes and Reveal Relationships in Categorical Data

CH9.Generalized Additive Model

Package gamm4. July 25, Index 10

Smoothing Scatterplots Using Penalized Splines

The pspline Package. August 4, Author S original by Jim Ramsey R port by Brian Ripley

REGULARIZED REGRESSION FOR RESERVING AND MORTALITY MODELS GARY G. VENTER

CoxFlexBoost: Fitting Structured Survival Models

From logistic to binomial & Poisson models

A Measure for Assessing Functions of Time-Varying Effects in Survival Analysis

Package robustgam. February 20, 2015

The Assessment of Non-Linear Effects in Clinical Research

Network Traffic Measurements and Analysis

What is machine learning?

Generalized additive models I

A technique for constructing monotonic regression splines to enable non-linear transformation of GIS rasters

MINI-PAPER A Gentle Introduction to the Analysis of Sequential Data

Lecture 7: Splines and Generalized Additive Models

CSE Data Mining Concepts and Techniques STATISTICAL METHODS (REGRESSION) Professor- Anita Wasilewska. Team 13

REPLACING MLE WITH BAYESIAN SHRINKAGE CAS ANNUAL MEETING NOVEMBER 2018 GARY G. VENTER

EE613 Machine Learning for Engineers LINEAR REGRESSION. Sylvain Calinon Robot Learning & Interaction Group Idiap Research Institute Nov.

Package simsurv. May 18, 2018

Nonlinearity and Generalized Additive Models Lecture 2

Derivative. Bernstein polynomials: Jacobs University Visualization and Computer Graphics Lab : ESM4A - Numerical Methods 313

Ludwig Fahrmeir Gerhard Tute. Statistical odelling Based on Generalized Linear Model. íecond Edition. . Springer

STATISTICS (STAT) Statistics (STAT) 1

Curve fitting using linear models

Introduction to ungroup

Package robustgam. January 7, 2013

Median and Extreme Ranked Set Sampling for penalized spline estimation

Package GAMBoost. February 19, 2015

Estimating survival from Gray s flexible model. Outline. I. Introduction. I. Introduction. I. Introduction

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

in this course) ˆ Y =time to event, follow-up curtailed: covered under ˆ Missing at random (MAR) a

Knowledge Discovery and Data Mining

JMP 10 Student Edition Quick Guide

Categorical explanatory variables

Package ICsurv. February 19, 2015

Using Multivariate Adaptive Regression Splines (MARS ) to enhance Generalised Linear Models. Inna Kolyshkina PriceWaterhouseCoopers

Using Existing Numerical Libraries on Spark

Cross-validation and the Bootstrap

Machine Learning Duncan Anderson Managing Director, Willis Towers Watson

Package gsscopu. R topics documented: July 2, Version Date

Joe Swintek Badger Technical Services. June 6, 2018

Using Numerical Libraries on Spark

ICPSR Training Program McMaster University Summer, Introduction to the R Statistical Computing Environment

Transcription:

Multivariable Regression Modelling A review of available spline packages in R. Aris Perperoglou for TG2 ISCB 2015 Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 1 / 41

TG2 Members Michal Abrahamowic, Quebec Canada Willi Sauerbrei, Freiburg Germany Harald Binder, Mainz Germany Frank Harrell, Nashville USA Patrick Royston, London UK Matthias Schmid, Bonn Germany Aris Perperoglou, Colchester UK Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 2 / 41

Outline Introduction: Topic Group 2 and the R splines project Splines: from bases to software R packages Some results Discussion Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 3 / 41

Introduction: TG2 Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 4 / 41

TG2: Selection of variables and functional forms in multivariable analysis The main focus of TG2 is to identify influential variables and gain insight into their individual and joint relationship with an outcome of interest. Which variables are related with the outcome and how to: choose relevant predictors determine functional form of continuous predictors. Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 5 / 41

General regression model Consider an outcome of interest y and let x 1, x 2,...,x p some information on p covariates g(y) =f (x 1, 1 )+f (x 2, 2 )+... + f (x p, p ) In practice: Most often a linear relationship of x with g(y) is assumed Still people tend to categorise covariates. See Greenland 1995, Royston, Altman & Sauerbrei 2006 for reasons against it. Sometimes a simple transformation might be considered, such as log(x), x 2... Fractional Polynomials Royston & Altman 1994 Splines Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 6 / 41

Splines Splines are piecewise polynomial functions Given the range of a continuous variables, define points on this interval (knots) Fit a simple polynomial between these knots. Depending on the selection of knots and the type of polynomial the type of spline is named as: polynomial, natural, restricted regression spline (de Boor 1978, Harrel 2013) b-spline (de Boor 1978), p-spline (Marx and Eilers 1996), penalized regression spline (Wood 2006) smoothing splines (see Generalized Additive Models (Hastie and Tibshirani 1990))... Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 7 / 41

Decisions Although splines are a powerful tool, in practice the number of decisions that the user has to make complicate the modelling problem. Type of spline Number of knots Position of knots Order of the spline (complexity) Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 8 / 41

The need Although mathematical properties are well understood the use of splines can be challenging for applied statisticians. Lack of statistical education. Most researchers are not taught how to use splines. Lack of thorough comparisons between di erent approaches. We aim to develop systematic guidance for using splines in applications focusing on level 2 expertise:... we point to methodology which is perhaps slightly below state of the art, but doable by every experienced analyst. We should refer to advantages and disadvantages of competing approaches, point to the importance and implications of underlying assumptions, and stress the necessity of sensitivity analyses... question. Su cient guidance about software plays a key role that this approach is also used in practise. Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 9 / 41

R: Identify and assess methods used in practice 1 Available software (focus on R): Identify all packages within R that provide the ability to use splines as a functional transformation in univariable and/or multivariable analysis. 2 Collect information on packages. 3 Compile documentation for practical use. 4 Evaluate quality of available packages. 5 Compare approaches, further guidance for use in practice. Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 10 / 41

Why R R is the most widely spread programme amongst research statisticians, e.g. people that do research in statistics. Also dominates the field of data science. It is free and comes with an open software licence, which is use at your own cost. Although CRAN requires the basic documentation of all functions contained in submitted packages, help files are often not detailed enough to fully understand how the implemented methods work. Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 11 / 41

Spline basics Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 12 / 41

Preliminary idea Consider the straight line regression model y = 0 + 1 x + The corresponding basis for the model is: S 1 x 1 T W X = U.. X V 1 x n A polynomial basis can also be: S 1 u 1 u1 2 u 3 T 1 W X U = U.... V 1 u n un 2 un 3 where u = x min(x) max(x) min(x) Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 13 / 41

Spline basics: From bases to software pbase <- function(x, p) { u <- (x - min(x)) / (max(x) - min(x)) u <- 2 * (u - 0.5); P <- outer(u, seq(0, p, by = 1), "^") P } Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 14 / 41

Spline basics: From bases to software Polynomial Truncated Polynomial 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 0 20 40 60 80 0 20 40 60 80 Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 15 / 41

Spline basics: cubic B-splines with 12 knots B-splines are a common choice B splines (R) B splines 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 0 20 40 60 80 Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 16 / 41

Spline basics: Natural and Gaussian Splines (12 knots) Natural Gaussian splines 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 0 20 40 60 80 Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 17 / 41

Least square fit A general spline model with k knots forms an X matrix basis. The ordinary least squares fit can be written as: ŷ = X ˆ where ^ minimizes y X 2 See Ruppert, Wand and Carroll 2003 for a good overview of univariable analysis. Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 18 / 41

Choice of splines: Di erences when fitting data Simulated data: y=sin(1.7 x π)+ε y 6 4 2 0 2 4 Poly. Trunc. Poly. B Splines (R) B Splines Natural Gaussian 0.0 0.2 0.4 0.6 0.8 1.0 x Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 19 / 41

Choice of knots: cubic B-splines fit Number of knots will lead to di erent fit: 3 Knots (R default) 6 Knots y 6 4 2 0 2 4 y 6 4 2 0 2 4 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x 12 Knots 24 Knots y 6 4 2 0 2 4 y 6 4 2 0 2 4 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 20 / 41

Choice of complexity: b-splines with 12 knots Degree 1 Degree 2 y 6 4 2 0 2 4 y 6 4 2 0 2 4 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x Degree 3 (R default) Degree 4 y 6 4 2 0 2 4 y 6 4 2 0 2 4 0.0 0.2 0.4 0.6 0.8 1.0 x 0.0 0.2 0.4 0.6 0.8 1.0 x Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 21 / 41

The search for R Packages Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 22 / 41

Rpackages More than 6200 packages available on CRAN. Even more available on GITHUB, RGforce etc that we do not look into. Used a java programme to scan CRAN and identify all packages that have a vague relation to splines. A total of 519 packages were identified (May 2015) out of which 229 were flagged as relevant Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 23 / 41

Subset of 109 packages were selected for further analysis Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 24 / 41

Interdependencies http://drperpo.github.io/docs/rpackages/splines.html Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 25 / 41

What makes a good package? Packages with vignettes Packages with a web site companion/book/documentation Packages with real life data Packages with clear examples of methods Packages with clear references Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 26 / 41

Out of selected: packages with vignettes Bayesian: 1 Boosting: 2 Classification: 1 Functions: 4 Regression: 14 Splines: 1 Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 27 / 41

Results (1): Spline creators Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 28 / 41

Packages, downloads and reverse dependencies (RD) package down RD Description splines 124 Regression spline functions and classes polsplines 35419 0 Polynomial spline routines logspline 19513 3 Logspline density estimation routines cobs 11596 2 Constrained B-Splines pencopula 4892 0 Copula Estimation with P-Splines orthogonalsplinebasis 3959 0 Orthogonal B-Spline Functions pbs 3790 0 Periodic B Splines bezier 3546 1 Bezier Curve and Spline Toolkit Downloads does not mean unique users. Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 29 / 41

Packages, downloads and reverse dependencies (RD) package down RD Description gss 109259 5 General Smoothing Splines pspline 23313 6 Penalized Smoothing Splines MBA 15689 6 Multilevel B-spline Approximation crs 11288 2 Categorical Regression Splines cobs99 5859 0 Constrained B-splines outdated version Kpart 4747 0 Spline fitting freeknotsplines 4228 0 Free-Knot Splines bigsplines 4184 1 Smoothing Splines for Large Samples sspline 3955 0 Smoothing Splines on the Sphere episplinedensity 1042 0 Density Estimation Exponential Epi-splines Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 30 / 41

Results (2): Regression packages Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 31 / 41

General features of popular regression packages package downloaded vignette book website datasets mgcv 380232 x x 2 quantreg 347203 x x 7 survival 243212 x x 33 VGAM 92091 x x x 50 gbm 86729 x 3 gam 62916 x x 1 gamlss 27084 x x x 29 VGAM and gamlss have separate data packages. Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 32 / 41

Regression models per package response mgcv quantreg survival VGAM gbm gam gamlss Linear x x x x x x Categorical x x x x x Count Reg x x x x x x Survival x x x x x x x Quantile Reg x x x x Multivariate x x x x Nonlinear x x x Reduced Rank x x x Other x x x x x x Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 33 / 41

Can the package fit penalized splines and random e ects package p- splines RE mgcv x x quantreg x x survival x x VGAM x x gbm x gam x gamlss x x Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 34 / 41

Criteria for significance of non-linear e ect and variable selection methods package non-linear var selection mgcv df double penalty quantreg lasso survival residuals VGAM gbm boosting gam stepgam gamlss stepgaic Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 35 / 41

Variable selection in other packages Some packages include functions to perform variable selection: leaps, rms, gamboostlss, subselect, glmulti, glmulti, spikeslabgam,spikelab The list is not exhaustive. Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 36 / 41

Post fit functions, extract coe cients, predictors, plot terms and other useful functions package coe /fitted plot terms conf.int mgcv x x x quantreg x x survival x x x VGAM x x x gbm x x gam x x x gamlss x x x Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 37 / 41

Some more questions: Are examples based on real or simulated datasets? Are results for the examples nicely and understandable presented? Give details about the most important default values. Any reason for the default values given? Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 38 / 41

Outcomes Apaperwithpreliminaryresultswithinayear Further research into the quality of packages, advantages/disadvantages, worked out examples etc to follow Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 39 / 41

References: Functional forms and Splines Greenland, S. (1995). Avoiding power loss associated with categorization and ordinal scores in dose-response and trend analysis. Epidemiology, 6(4), 450-454. Royston, P., Altman, D. G., & Sauerbrei, W. (2006). Dichotomizing continuous predictors in multiple regression: a bad idea. Stat Med, 25(1), 127-141. Royston, P., & Altman, D. G. (1994). Regression using fractional polynomials of continuous covariates: parsimonious parametric modelling. Applied Statistics, 429-467. De Boor, C. (1978). A practical guide to splines. Mathematics of Computation. Harrell, F. E. (2013). Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. Springer Science & Business Media. Eilers, P. H., & Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical science, 89-102. Wood, S. (2006). Generalized additive models: an introduction with R. CRC press. Hastie, T. J., & Tibshirani, R. J. (1990). Generalized additive models (Vol. 43). CRC Press. Ruppert, D., Wand, M. P., & Carroll, R. J. (2003). Semiparametric regression (No. 12). Cambridge university press. Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 40 / 41

References: Modelling and R packages Wood, S. N. (2001). mgcv: GAMs and generalized ridge regression for R. R news, 1(2), 20-25. Wood, S. N. (2003). Thin plate regression splines. Journal of the Royal Statistical Society: B (Statistical Methodology), 65(1), 95-114. Wood, S. N. (2008). Fast stable direct fitting and smoothness selection for generalized additive models. Journal of the Royal Statistical Society: B (Statistical Methodology), 70(3), 495-518. Yee, Thomas W., and C. J. Wild. Vector generalized additive models. Journal of the Royal Statistical Society. B (Methodological) (1996): 481-493. Yee, T. W. (2010). The VGAM package for categorical data analysis. Journal of Statistical Software, 32(10), 1-34. Therneau, T. M., & Grambsch, P. M. (2000). Modeling survival data: extending the Cox model. Springer Science & Business Media. Koenker, R. (2005). Quantile regression (No. 38). Cambridge university press. Rigby, R. A., & Stasinopoulos, D. M. (2005). Generalized additive models for location, scale and shape. Journal of the Royal Statistical Society:C (Applied Statistics), 54(3), 507-554. Stasinopoulos, D. M., & Rigby, R. A. (2007). Generalized additive models for location scale and shape (GAMLSS) in R. Journal of Statistical Software, 23(7), 1-46. Aris Perperoglou for TG2 Multivariable Regression Modelling ISCB 2015 41 / 41