Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Similar documents
A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

The Bootstrap and Jackknife

Lecture 25: Review I

Cross-validation and the Bootstrap

Cross-validation and the Bootstrap

RESAMPLING METHODS. Chapter 05

An Introduction to the Bootstrap

Estimation of Item Response Models

Big Data Methods. Chapter 5: Machine learning. Big Data Methods, Chapter 5, Slide 1

Topics in Machine Learning-EE 5359 Model Assessment and Selection

Multicollinearity and Validation CIVL 7012/8012

Gene signature selection to predict survival benefits from adjuvant chemotherapy in NSCLC patients

MS&E 226: Small Data

Cross-validation. Cross-validation is a resampling method.

Lecture 13: Model selection and regularization

Evaluation Measures. Sebastian Pölsterl. April 28, Computer Aided Medical Procedures Technische Universität München

Using Machine Learning to Optimize Storage Systems

Introduction to hypothesis testing

LECTURE 12: LINEAR MODEL SELECTION PT. 3. October 23, 2017 SDS 293: Machine Learning

Performance Estimation and Regularization. Kasthuri Kannan, PhD. Machine Learning, Spring 2018

Introduction to the Bootstrap and Robust Statistics

Slides for Data Mining by I. H. Witten and E. Frank

Correctly Compute Complex Samples Statistics

Analyzing Genomic Data with NOJAH

Machine Learning: An Applied Econometric Approach Online Appendix

1. Estimation equations for strip transect sampling, using notation consistent with that used to

IOM 530: Intro. to Statistical Learning 1 RESAMPLING METHODS. Chapter 05

STATISTICS (STAT) Statistics (STAT) 1

Lecture 27: Review. Reading: All chapters in ISLR. STATS 202: Data mining and analysis. December 6, 2017

Stat 342 Exam 3 Fall 2014

Applying Supervised Learning

Chapter 6: Linear Model Selection and Regularization

The Basics of Decision Trees

Cpk: What is its Capability? By: Rick Haynes, Master Black Belt Smarter Solutions, Inc.

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

Bootstrapping Methods

Package spcadjust. September 29, 2016

Things you ll know (or know better to watch out for!) when you leave in December: 1. What you can and cannot infer from graphs.

Evaluating generalization (validation) Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support

Machine Learning Techniques for Data Mining

Regression Analysis and Linear Regression Models

Classification/Regression Trees and Random Forests

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Lab #13 - Resampling Methods Econ 224 October 23rd, 2018

Multiple imputation using chained equations: Issues and guidance for practice

Neural Networks and Machine Learning Applied to Classification of Cancer. Sachin Govind, Advisor: Namrata Pandya, IMSA

Today s Lecture. Factors & Sampling. Quick Review of Last Week s Computational Concepts. Numbers we Understand. 1. A little bit about Factors

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

What is machine learning?

STA 4273H: Statistical Machine Learning

Bagging for One-Class Learning

Introduction to R, Github and Gitlab

CHAPTER 1 INTRODUCTION

Package EBglmnet. January 30, 2016

10601 Machine Learning. Model and feature selection

Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms

/ Computational Genomics. Normalization

Feature Selection in Knowledge Discovery

Exploring Econometric Model Selection Using Sensitivity Analysis

Supplementary text S6 Comparison studies on simulated data

2017 ITRON EFG Meeting. Abdul Razack. Specialist, Load Forecasting NV Energy

BIO 360: Vertebrate Physiology Lab 9: Graphing in Excel. Lab 9: Graphing: how, why, when, and what does it mean? Due 3/26

EVALUATION OF THE NORMAL APPROXIMATION FOR THE PAIRED TWO SAMPLE PROBLEM WITH MISSING DATA. Shang-Lin Yang. B.S., National Taiwan University, 1996

The linear mixed model: modeling hierarchical and longitudinal data

Regression Lab 1. The data set cholesterol.txt available on your thumb drive contains the following variables:

Correctly Compute Complex Samples Statistics

CS 664 Image Matching and Robust Fitting. Daniel Huttenlocher

Predicting Expenditure Per Person for Cities

Bootstrapping Method for 14 June 2016 R. Russell Rhinehart. Bootstrapping

Introduction to GE Microarray data analysis Practical Course MolBio 2012

Cluster Randomization Create Cluster Means Dataset

Missing Data and Imputation

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

FMA901F: Machine Learning Lecture 3: Linear Models for Regression. Cristian Sminchisescu

Simulating power in practice

Week 4: Simple Linear Regression III

Laboratory #11. Bootstrap Estimates

CHAPTER 7 EXAMPLES: MIXTURE MODELING WITH CROSS- SECTIONAL DATA

Chapters 5-6: Statistical Inference Methods

Predic'ng ALS Progression with Bayesian Addi've Regression Trees

Linear Model Selection and Regularization. especially usefull in high dimensions p>>100.

R (2) Data analysis case study using R for readily available data set using any one machine learning algorithm.

Model combination. Resampling techniques p.1/34

Introduction to Classification & Regression Trees

Package OrderedList. December 31, 2017

How do we obtain reliable estimates of performance measures?

COMPUTATIONAL INTELLIGENCE SEW (INTRODUCTION TO MACHINE LEARNING) SS18. Lecture 6: k-nn Cross-validation Regularization

Chapter 7: Linear regression

Data Mining. ❷Chapter 2 Basic Statistics. Asso.Prof.Dr. Xiao-dong Zhu. Business School, University of Shanghai for Science & Technology

In this computer exercise we will work with the analysis of variance in R. We ll take a look at the following topics:

Module 4. Non-linear machine learning econometrics: Support Vector Machine

Feature selection. LING 572 Fei Xia

Monte Carlo Analysis

Lab 5 - Risk Analysis, Robustness, and Power

Nonparametric and Simulation-Based Tests. Stat OSU, Autumn 2018 Dalpiaz

Week 4: Simple Linear Regression II

Additive Models, Trees, etc. Based in part on Chapter 9 of Hastie, Tibshirani, and Friedman David Madigan

Statistical Analysis in R Guest Lecturer: Maja Milosavljevic January 28, 2015

Leveling Up as a Data Scientist. ds/2014/10/level-up-ds.jpg

Nonparametric Methods Recap

Transcription:

Resampling Methods Levi Waldron, CUNY School of Public Health July 13, 2016

Outline and introduction Objectives: prediction or inference? Cross-validation Bootstrap Permutation Test Monte Carlo Simulation ISLR Chapter 5: James, G. et al. An Introduction to Statistical Learning: with Applications in R. (Springer, 2013). This book can be downloaded for free at http://www-bcf.usc.edu/~gareth/isl/getbook.html

Why do regression? Inference Questions: Which predictors are associated with the response? How are predictors associated with the response? Example: do dietary habits influence the gut microbiome? Linear regression and generalized linear models are the workhorses We are more interested in interpretability than accuracy Produce interpretable models for inference on coefficients Bootstrap, permutation tests

Why do regression? (cont d) Prediction Questions: How can we predict values of Y based on values of X Examples: Framingham Risk Score, OncotypeDX Risk Score Regression methods are still workhorses, but also less-interpretable machine learning methods Cross-validation We are more interested in accuracy than interpretability e.g. sensitivity/specificity for binary outcome e.g. mean-squared prediction error for continuous outcome

Cross-validation

Why cross-validation? Figure 1: Figure 2.9 B Under-fitting, over-fitting, and optimal fitting

K-fold cross-validation approach Create K folds from the sample of size n, K n 1. Randomly sample 1/K observations (without replacement) as the validation set 2. Use remaining samples as the training set 3. Fit model on the training set, estimate accuracy on the validation set 4. Repeat K times, not using the same validation samples 5. Average validation accuracy from each of the validation sets

Variability in cross-validation Figure 3: Variability of 2-fold cross-validation (ISLR Figure 5.2)

Bias-variance trade-off in cross-validation Key point: we are talking about bias and variance of the overall accuracy estimate, not between the folds. 2-fold CV produces a high-bias, low-variance estimate: training on fewer samples causes upward bias in error rate low correlation between models means low variance in average error rate Leave-on-out CV produces a low-bias, high-variance estimate: training on n 1 samples is almost as good as on n samples (almost no bias in prediction error) models are almost identical, so average has a high variance Computationally, K models must be fitted 5 or 10-fold CV are very popular compromises

Cross-validation summary In prediction modeling, we think of data as training or test Cross-validation estimates test set error from a training set Training set error always decreases with more complex (flexible) models Test set error as a function of model flexibility tends to be U-shaped The low point of the U represents the optimal bias-variance trade-off, or the most appropriate amount of model flexibility

Cross-validation caveats Be very careful of information leakage into test sets, e.g.: feature selection using all samples human-loop over-fitting changing your mind on accuracy measure try a different dataset http://hunch.net/?p=22

Cross-validation caveats (cont d) Tuning plus accuracy estimation requires nested cross-validation Example: high-dimensional training and test sets simulated from identical true model Penalized regression models tuned by 5-fold CV Tuning by cross-validation does not prevent over-fitting Waldron et al.: Optimized application of penalized regression methods to diverse genomic data. Bioinformatics 2011, 27:3399 3406.

Cross-validation caveats (cont d) Cross-validation estimates assume that the sample is representative of the population Figure 4: Cross-validation vs. cross-study validation in breast cancer prognosis Bernau C et al.: Cross-study validation for the assessment of prediction algorithms. Bioinformatics 2014, 30:i105 12.

Permutation test

Permutation test Classical hypothesis testing: H 0 of test statistic derived from assumptions about the underlying data distribution e.g. t, χ 2 distribution Permutation testing: H 0 determined empirically using permutations of the data where H 0 is guaranteed to be true original data permuted data PC2 4 2 0 2 PC2 4 2 0 2 6 4 2 0 2 4 6 PC1 6 4 2 0 2 4 6 PC1

Steps of permutation test: 1. Calculate test statistic (e.g. T) in observed sample 2. Permutation: 2.1 Sample without replacement the response values (Y ), using the same X 2.2 re-compute and store the test statistic T 2.3 Repeat R times, store as a vector T R 3. Calculate empirical p value: proportion of permutation T R that exceed actual T

Calculating a p-value Why add 1? P = sum (abs(t R) > abs(t )) + 1 length(t R ) + 1 Phipson B, Smyth GK: Permutation P-values should never be zero: calculating exact P-values when permutations are randomly drawn. Stat. Appl. Genet. Mol. Biol. 2010, 9:Article39.

Permutation test - pros and cons Pros: Cons: does not require distributional assumptions can be applied to any test statistic less useful for small sample sizes p-values usually cannot be estimated with sufficient precision for heavy multiple testing correction in naive implementations, can get p-values of 0

Example from (sleep) data: Sleep data show the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients. ## extra group ID ## Min. :-1.600 1:10 1 :2 ## 1st Qu.:-0.025 2:10 2 :2 ## Median : 0.950 3 :2 ## Mean : 1.540 4 :2 ## 3rd Qu.: 3.400 5 :2 ## Max. : 5.500 6 :2 ## (Other):8

t-test for difference in mean sleep ## ## Welch Two Sample t-test ## ## data: extra by group ## t = -1.8608, df = 17.776, p-value = 0.07939 ## alternative hypothesis: true difference in means is not ## 95 percent confidence interval: ## -3.3654832 0.2054832 ## sample estimates: ## mean in group 1 mean in group 2 ## 0.75 2.33

Permutation test instead of t-test set.seed(1) permt = function(){ index = sample(1:nrow(sleep), replace=false) t.test(extra ~ group[index], data=sleep)$statistic } Tr = replicate(999, permt()) (sum(abs(tr) > abs(tactual)) + 1) / (length(tr) + 1) ## [1] 0.079

Bootstrap

The Bootstrap Figure 5: Schematic of the Bootstrap

Uses of the Bootstrap The Bootstrap is a very general approach to estimating sampling uncertainty, e.g. standard errors Can be applied to a very wide range of models and statistics Robust to outliers and violations of model assumptions

How to perform the Bootstrap The basic approach: 1. Using the available sample (size n), generate a new sample of size n (with replacement) 2. Calculate the statistic of interest 3. Repeat 4. Use repeated experiments to estimate the variability of your statistic of interest

Example: bootstrap in the sleep dataset We used a permutation test to estimate a p-value We will use bootstrap to estimate a confidence interval t.test(extra ~ group, data=sleep) ## ## Welch Two Sample t-test ## ## data: extra by group ## t = -1.8608, df = 17.776, p-value = 0.07939 ## alternative hypothesis: true difference in means is not ## 95 percent confidence interval: ## -3.3654832 0.2054832 ## sample estimates: ## mean in group 1 mean in group 2 ## 0.75 2.33

Example: bootstrap in the sleep dataset set.seed(2) bootdiff = function(){ boot = sleep[sample(1:nrow(sleep), replace = TRUE), ] mean(boot$extra[boot$group==1]) - mean(boot$extra[boot$group==2]) } bootr = replicate(1000, bootdiff()) bootr[match(c(25, 975), rank(bootr))] ## [1] -3.32083333 0.02727273 note: better to use library(boot)

Example: oral carcinoma recurrence risk Oral carcinoma patients treated with surgery Surgeon takes margins of normal-looking tissue around to tumor to be safe number of margins varies for each patient Can an oncogenic gene signature in histologically normal margins predict recurrence? Reis PP, Waldron L, et al.: A gene signature in histologically normal surgical margins is predictive of oral carcinoma recurrence. BMC Cancer 2011, 11:437.

Example: oral carcinoma recurrence risk Model was trained and validated using the maximum expression of each of 4 genes from any margin Figure 6: Oral carcinoma with histologically normal margins

Bootstrap estimation of HR for only one margin Figure 7: Bootstrap re-sample with randomly selected margin

Example: oral carcinoma recurrence risk From results: Simulating the selection of only a single margin from each patient, the 4-gene signature maintained a predictive effect in both the training and validation sets (median HR = 2.2 in the training set and 1.8 in the validation set, with 82% and 99% of bootstrapped hazard ratios greater than the no-effect value of HR = 1)

Monte Carlo

What is a Monte Carlo simulation? Resampling is done from known theoretical distribution Simulated data are used to estimate the probability of possible outcomes most useful application for me is power estimation also used for Bayesian estimation of posterior distributions

How to conduct a Monte Carlo simulation Steps of a Monte Carlo simulations: 1. Sample randomly from the simple distributions in each step 2. Estimate the complex function for the sample 3. Repeat this a large number of times

Random distributions form the basis of Monte Carlo simulation Figure 8: Credit: Markus Gesmann http://www.magesblog.com/2011/12/ fitting-distributions-with-r.html

Power Calculation for a follow-up sleep study What sample size do we need for a future study to detect the same effect on sleep, with 90% power and α = 0.05? power.t.test(power=0.9, delta=(2.33-.75), sd=1.9, sig.level=.05, type="two.sample", alternative="two.sided") ## ## Two-sample t test power calculation ## ## n = 31.38141 ## delta = 1.58 ## sd = 1.9 ## sig.level = 0.05 ## power = 0.9 ## alternative = two.sided ## ## NOTE: n is number in *each* group

The same calculation by Monte Carlo simulation Use rnorm() function to draw samples Use t.test() function to get a p-value Repeat many times, what % of p-values are less than 0.05?

R script set.seed(1) montepval = function(n){ group1 = rnorm(n, mean=.75, sd=1.9) group2 = rnorm(n, mean=2.33, sd=1.9) t.test(group1,group2)$p.value } sum(replicate(1000, montepval(n=32)) < 0.05) / 1000 ## [1] 0.895

Summary: resampling methods Cross- Validation Permutation Test Procedure Data is randomly divided into subsets. Results validated across sub-samples. Samples of size N drawn at random without replacement. Application Model tuning Estimation of prediction accuracy Hypothesis testing

Summary: resampling methods Bootstrap Monte Carlo Procedure Samples of size N drawn at random with replacement. Data are sampled from a known distribution Application Confidence intervals, hypothesis testing Power estimation, Bayesian posterior probabilities