R-Packages for Robust Asymptotic Statistics

Similar documents
Package RobLoxBioC. August 3, 2018

Package ROptEst. August 3, 2018

Package distrmod. R topics documented: September 4, Version Date

R Package distrmod: S4 Classes and Methods for Probability Models

Package distrteach. R topics documented: January 20, Version 2.3 Date

Visualizing Gene Clusters using Neighborhood Graphs in R

Codelink Legacy: the old Codelink class

affyqcreport: A Package to Generate QC Reports for Affymetrix Array Data

How to generate new distributions in packages "distr", "distrex"

/ Computational Genomics. Normalization

Package robustreg. R topics documented: April 27, Version Date Title Robust Regression Functions

Preprocessing -- examples in microarrays

How to generate new distributions in packages "distr", "distrex"

LFCseq: a nonparametric approach for differential expression analysis of RNA-seq data - supplementary materials

Package distrrmetrics

R Primer for Introduction to Mathematical Statistics 8th Edition Joseph W. McKean

Monte Carlo Methods and Statistical Computing: My Personal E

Microarray Data Analysis (VI) Preprocessing (ii): High-density Oligonucleotide Arrays

PARAMETERIZATIONS OF PLANE CURVES

Assessing the Quality of the Natural Cubic Spline Approximation

affypara: Parallelized preprocessing algorithms for high-density oligonucleotide array data

Drug versus Disease (DrugVsDisease) package

IDENTIFYING GEOMETRICAL OBJECTS USING IMAGE ANALYSIS

Evaluating Error Functions for Robust Active Appearance Models

Package RobLox. R topics documented: September 5, Version 1.0 Date

Package RobLox. September 13, Version 0.9. Date Title Optimally robust influence curves and estimators for location and scale

10.4 Measures of Central Tendency and Variation

10.4 Measures of Central Tendency and Variation

Convexization in Markov Chain Monte Carlo

Parallelized preprocessing algorithms for high-density oligonucleotide arrays

GLM II. Basic Modeling Strategy CAS Ratemaking and Product Management Seminar by Paul Bailey. March 10, 2015

RNA-Seq. Joshua Ainsley, PhD Postdoctoral Researcher Lab of Leon Reijmers Neuroscience Department Tufts University

Topic 5.1: Line Elements and Scalar Line Integrals. Textbook: Section 16.2

Package st. July 8, 2015

Package distrellipse

Bioconductor tutorial

Package Modeler. May 18, 2018

Package citools. October 20, 2018

Course on Microarray Gene Expression Analysis

Package madsim. December 7, 2016

Bayesian Analysis of Extended Lomax Distribution

BIOINFORMATICS. A comparison of normalization methods for high density oligonucleotide array data based on variance and bias

CANCER PREDICTION USING PATTERN CLASSIFICATION OF MICROARRAY DATA. By: Sudhir Madhav Rao &Vinod Jayakumar Instructor: Dr.

Lecture 3 - Object-oriented programming and statistical programming examples

Overview of the TREC 2005 Spam Track. Gordon V. Cormack Thomas R. Lynam. 18 November 2005

Using FARMS for summarization Using I/NI-calls for gene filtering. Djork-Arné Clevert. Institute of Bioinformatics, Johannes Kepler University Linz

Stochastic Simulation: Algorithms and Analysis

Confidence Intervals: Estimators

SVM Classification in -Arrays

Package GLDreg. February 28, 2017

Theoretical Concepts of Machine Learning

DS Machine Learning and Data Mining I. Alina Oprea Associate Professor, CCIS Northeastern University

Package gee. June 29, 2015

Package UnivRNG. R topics documented: January 10, Type Package

Why use R? Getting started. Why not use R? Introduction to R: Log into tak. Start R R or. It s hard to use at first

Fitting Flexible Parametric Regression Models with GLDreg in R

Using R for statistics and data analysis

Frozen Robust Multi-Array Analysis and the Gene Expression Barcode

DEPARTMENT - Mathematics. Coding: N Number. A Algebra. G&M Geometry and Measure. S Statistics. P - Probability. R&P Ratio and Proportion

Pattern Recognition. Kjell Elenius. Speech, Music and Hearing KTH. March 29, 2007 Speech recognition

The Bootstrap and Jackknife

Robust Kernel Methods in Clustering and Dimensionality Reduction Problems

Package parody. December 31, 2017

Why is Statistics important in Bioinformatics?

Cosmic Ray Shower Profile Track Finding for Telescope Array Fluorescence Detectors

CT79 SOFT COMPUTING ALCCS-FEB 2014

Resampling Methods. Levi Waldron, CUNY School of Public Health. July 13, 2016

Package affyplm. January 6, 2019

Review of feature selection techniques in bioinformatics by Yvan Saeys, Iñaki Inza and Pedro Larrañaga.

IASC-ERS Summer School CoDaCourse 2014

Statistical Analysis of Metabolomics Data. Xiuxia Du Department of Bioinformatics & Genomics University of North Carolina at Charlotte

mistr: A Computational Framework for Mixture and Composite Distributions

Combinatorial Methods in Density Estimation

Lecture 24: Generalized Additive Models Stat 704: Data Analysis I, Fall 2010

Bayesian Estimation for Skew Normal Distributions Using Data Augmentation

Modelling and Quantitative Methods in Fisheries

Lecture 8 Fitting and Matching

Package smoothmest. February 20, 2015

Shape fitting and non convex data analysis

Introduction to the xps Package: Comparison to Affymetrix Power Tools (APT)

Product Catalog. AcaStat. Software

Overview of R. Biostatistics

Package misclassglm. September 3, 2016

Overview. Background. Locating quantitative trait loci (QTL)

Minitab 17 commands Prepared by Jeffrey S. Simonoff

500K Data Analysis Workflow using BRLMM

Feature Detectors and Descriptors: Corners, Lines, etc.

aroma.affymetrix: A generic framework in R for analyzing small to very large Affymetrix data sets in bounded memory

Math 32, August 20: Review & Parametric Equations

Supplementary Materials for

Response to API 1163 and Its Impact on Pipeline Integrity Management

Introduction to R: Using R for Statistics and Data Analysis. BaRC Hot Topics

Recent advances in Metamodel of Optimal Prognosis. Lectures. Thomas Most & Johannes Will

Package LMGene. R topics documented: December 23, Version Date

Combine the PA Algorithm with a Proximal Classifier

Computational Cognitive Science

CPIB SUMMER SCHOOL 2011: INTRODUCTION TO BIOLOGICAL MODELLING

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

Improving Convergence in Cartesian Genetic Programming Using Adaptive Crossover, Mutation and Selection

Lecture 9 Fitting and Matching

Transcription:

Chair for Stochastics joint work with Dr. Peter Ruckdeschel Fraunhofer ITWM user! The R User Conference 2008 Dortmund August 12

Outline Robust Asymptotic Statistics 1 Robust Asymptotic Statistics 2 3

Outline Robust Asymptotic Statistics 1 Robust Asymptotic Statistics 2 3

Setup I Robust Asymptotic Statistics Ideal model: L 2 -differentiable parametric family of probability measures, parameter space: Θ R k (open) Estimator class: asymptotically linear estimators (ALEs) S n S n (x 1,..., x n ) = θ + 1 n ψ θ (x i ) + R n n i=1 x 1,..., x n : sample ψ θ : influence curve/function (IC) at θ Θ R n : asymptotically negligible remainder E.g. as. normal M-, L-, R-, S- and MD-estimators

Setup I Robust Asymptotic Statistics Ideal model: L 2 -differentiable parametric family of probability measures, parameter space: Θ R k (open) Estimator class: asymptotically linear estimators (ALEs) S n S n (x 1,..., x n ) = θ + 1 n ψ θ (x i ) + R n n i=1 x 1,..., x n : sample ψ θ : influence curve/function (IC) at θ Θ R n : asymptotically negligible remainder E.g. as. normal M-, L-, R-, S- and MD-estimators

Setup II Robust Asymptotic Statistics Infinitesimal neighborhood: deviations (gross errors, outliers, etc.) from the ideal model P θ of form d (P θ, Q) = r =: r n Q M 1 n M 1 : set of all probability measures d : some distance or pseudo-distance r: radius in [0, n] E.g. Tukey s gross error model Q = (1 r n )P θ + r n H n H n M 1

Optimally robust ALEs Optimization problem: G ( asbias (S n ), asvar (S n ) ) = min! G: positive, convex, strictly increasing in both args asbias (S n ): some function of ψ θ (IC) asvar (S n ): some function of ψ θ (IC) Hence: minimum is taken over all ICs ψ θ Optimal solutions: Rieder (1994) [3], Ruckdeschel and Rieder (2004) [10], Kohl (2005) [2] Unknown radius: radius-minimax estimator; cf. Rieder et al. (2008) [8]

Optimally robust ALEs Optimization problem: G ( asbias (S n ), asvar (S n ) ) = min! G: positive, convex, strictly increasing in both args asbias (S n ): some function of ψ θ (IC) asvar (S n ): some function of ψ θ (IC) Hence: minimum is taken over all ICs ψ θ Optimal solutions: Rieder (1994) [3], Ruckdeschel and Rieder (2004) [10], Kohl (2005) [2] Unknown radius: radius-minimax estimator; cf. Rieder et al. (2008) [8]

Optimally robust ALEs Optimization problem: G ( asbias (S n ), asvar (S n ) ) = min! G: positive, convex, strictly increasing in both args asbias (S n ): some function of ψ θ (IC) asvar (S n ): some function of ψ θ (IC) Hence: minimum is taken over all ICs ψ θ Optimal solutions: Rieder (1994) [3], Ruckdeschel and Rieder (2004) [10], Kohl (2005) [2] Unknown radius: radius-minimax estimator; cf. Rieder et al. (2008) [8]

Optimally robust estimation Possible steps to compute an optimally robust estimator: 1 Decide on ideal model, neighborhood and risk 2 Try to find a rough estimate for the amount r n [0, 1] of gross errors such that r n [r n, r n ]. 3 Choose and evaluate appropriate initial estimate; e.g., Kolmogorov or Cramér von Mises MD-estimator 4 Estimate the parameter(s) of interest by means of the corresponding radius-minimax estimator (cf. Rieder et al. (2008) [8]) using a k-step (k 1) construction.

Optimally robust estimation Possible steps to compute an optimally robust estimator: 1 Decide on ideal model, neighborhood and risk 2 Try to find a rough estimate for the amount r n [0, 1] of gross errors such that r n [r n, r n ]. 3 Choose and evaluate appropriate initial estimate; e.g., Kolmogorov or Cramér von Mises MD-estimator 4 Estimate the parameter(s) of interest by means of the corresponding radius-minimax estimator (cf. Rieder et al. (2008) [8]) using a k-step (k 1) construction.

Optimally robust estimation Possible steps to compute an optimally robust estimator: 1 Decide on ideal model, neighborhood and risk 2 Try to find a rough estimate for the amount r n [0, 1] of gross errors such that r n [r n, r n ]. 3 Choose and evaluate appropriate initial estimate; e.g., Kolmogorov or Cramér von Mises MD-estimator 4 Estimate the parameter(s) of interest by means of the corresponding radius-minimax estimator (cf. Rieder et al. (2008) [8]) using a k-step (k 1) construction.

Optimally robust estimation Possible steps to compute an optimally robust estimator: 1 Decide on ideal model, neighborhood and risk 2 Try to find a rough estimate for the amount r n [0, 1] of gross errors such that r n [r n, r n ]. 3 Choose and evaluate appropriate initial estimate; e.g., Kolmogorov or Cramér von Mises MD-estimator 4 Estimate the parameter(s) of interest by means of the corresponding radius-minimax estimator (cf. Rieder et al. (2008) [8]) using a k-step (k 1) construction.

Outline Robust Asymptotic Statistics 1 Robust Asymptotic Statistics 2 3

Some examples Normal (Gaussian): location and scale Binomial: probability of success Poisson: positive mean Gamma: shape and scale Gumbel: location and scale all smoothly parameterized exponential families of full rank Approach also works for other smoothly parametrized families!

Some examples Normal (Gaussian): location and scale Binomial: probability of success Poisson: positive mean Gamma: shape and scale Gumbel: location and scale all smoothly parameterized exponential families of full rank Approach also works for other smoothly parametrized families!

Some examples Normal (Gaussian): location and scale Binomial: probability of success Poisson: positive mean Gamma: shape and scale Gumbel: location and scale all smoothly parameterized exponential families of full rank Approach also works for other smoothly parametrized families!

Basic R-Packages distr: S4-classes for distributions. distrex: Functionals on distributions. RandVar: S4-classes and methods for random variables. distrmod: S4-classes for parametric families of probability measures, minimum distance (MD) estimators. RobAStBase: S4-classes for ICs and infinitesimal neighborhoods. cf. Ruckdeschel et al. (2006) [9], Kohl (2005) [2], http://r-forge.r-project.org/projects/distr/, http://r-forge.r-project.org/projects/robast/

Basic R-Packages distr: S4-classes for distributions. distrex: Functionals on distributions. RandVar: S4-classes and methods for random variables. distrmod: S4-classes for parametric families of probability measures, minimum distance (MD) estimators. RobAStBase: S4-classes for ICs and infinitesimal neighborhoods. cf. Ruckdeschel et al. (2006) [9], Kohl (2005) [2], http://r-forge.r-project.org/projects/distr/, http://r-forge.r-project.org/projects/robast/

R-Packages for optimally robust estimation Devel version 0.6 (version 0.5 on CRAN) ROptEst: Optimally robust estimation for L2 differentiable parametric families. RobLox: Optimally robust estimation for normal (Gaussian) location and scale (optimized for speed). cf. Ruckdeschel et al. (2006) [9], Kohl (2005) [2], http://r-forge.r-project.org/projects/distr/, http://r-forge.r-project.org/projects/robast/

R-Packages for optimally robust estimation Devel version 0.6 (version 0.5 on CRAN) ROptEst: Optimally robust estimation for L2 differentiable parametric families. RobLox: Optimally robust estimation for normal (Gaussian) location and scale (optimized for speed). cf. Ruckdeschel et al. (2006) [9], Kohl (2005) [2], http://r-forge.r-project.org/projects/distr/, http://r-forge.r-project.org/projects/robast/

Example 1: Poisson Decay counts of polonium by Rutherford and Geiger (1910); cf. Feller (1968)[1] R > table(x) x 0 1 2 3 4 5 6 7 8 9 10 11 13 14 57 203 383 525 532 408 273 139 45 27 10 4 1 1 R > ## ML-estimate R > mean(x) [1] 3.871549 R > ## or with package distrmod R > MLest <- MLEstimator(x, PoisFamily(), interval = c(0, 10)) R > estimate(mlest) lambda 3.871547 R > ## Optimally robust 3-step estimate from package ROptEst (version 0.6.0) R > ## takes about 4 sec (Centrino Duo 1.66 GHz) R > ROest <- roptest(x, PoisFamily(), eps.upper = 0.05, interval = c(0, 10), steps = 3) R > estimate(roest) lambda 3.907973

Example 1: Poisson - comparison of results Decay counts of polonium counts per 1/8 minute 0 100 200 300 400 500 mean roptest 0 2 4 6 8 10 12 14 decays

Example 2: Normal location and scale Copper in wholemeal flour; cf. MASS [4] R > chem [1] 2.90 3.10 3.40 3.40 3.70 3.70 2.80 2.50 2.40 2.40 2.70 2.20 [13] 5.28 3.37 3.03 3.03 28.95 3.77 3.40 2.20 3.50 3.60 3.70 3.70 R > ## ML-estimate (mean and sd) from package distrmod R > MLest <- MLEstimator(chem, NormLocationScaleFamily()) R > ## median and MAD R > initial.est <- c(median(chem), mad(chem)) R > ## Optimally robust 3-step estimate from package ROptEst (version 0.6.0) R > ## takes about 80 sec (Centrino Duo 1.66 GHz) R > ROest1 <- roptest(chem, NormLocationScaleFamily(), eps.upper = 0.05, steps = 3, + initial.est = initial.est) R > ## Use package RobLox (version 0.6.0) which is optimized for speed! R > ## takes about 0.12 sec (Centrino Duo 1.66 GHz) R > ROest2 <- roblox(chem, eps.upper = 0.05, k = 3, returnic = TRUE)

Example 2: Normal location and scale chem data (28.95 omitted) copper [parts per million] 0 2 4 6 8 10 mean +/ sd roblox: loc +/ scale 5 10 15 20 index

Example 3: Affymetrix gene expression data Extract log-pm (perfect match) data from a HG U133+ 2.0 array R > library(maqcsubsetafx) R > data(refa) R > ex.data <- refa[,1] R > CDFINFO <- getcdfinfo(ex.data) R > ids <- featurenames(ex.data) R > INDEX <- sapply(ids, get, envir = CDFINFO) R > NROW <- unlist(lapply(index, nrow)) R > table(nrow) NROW 8 9 10 11 13 14 15 16 20 69 5 1 6 54130 4 4 2 482 40 1 R > rawdata <- intensity(ex.data) R > fun <- function(index, x) log2(x[index[,1], ]) R > logpm <- lapply(index, fun, x = rawdata)

Example 3: Affymetrix gene expression data Optimally robust estimation of location and scale for each Affymetrix ID via roblox and rowroblox R > ## takes about 17 minutes (Centrino Duo 1.66 GHz) R > ROest1 <- lapply(logpm, function(x) estimate(roblox(x))) R > ## takes about 1.3 sec (Centrino Duo 1.66 GHz) R > nr <- as.integer(names(table(nrow))) R > ROest2 <- matrix(na, ncol = 2, nrow = length(nrow)) R > for(k in nr){ + ind <- which(nrow == k) + temp <- do.call(rbind, logpm[ind]) + ROest2[ind, 1:2] <- estimate(rowroblox(temp)) + } R > ## maximum deviation roblox vs. rowroblox: location R > max(abs(unlist(roest1)[seq(1, 2*54675-1, 2)] - ROest2[,1])) [1] 5.640855e-06 R > ROest12 <- unlist(roest1)[seq(2, 2*54675, 2)] R > ## maximum deviation roblox vs. rowroblox: scale R > max(abs(unlist(roest1)[seq(2, 2*54675, 2)] - ROest2[,2])) [1] 2.591696e-06

Example 3: Affymetrix gene expression data Affymetrix log expression data 6 8 10 12 14 roblox rowroblox

Outline Robust Asymptotic Statistics 1 Robust Asymptotic Statistics 2 3

Devel version 0.6 (version 0.5 on CRAN) ROptRegTS: Optimally robust estimation for regression and time series models. RobRex: Optimally robust estimation for linear regression with normal errors. Kohl (2005) [2], http://r-forge.r-project.org/projects/robast/

Devel version 0.6 (version 0.5 on CRAN) ROptRegTS: Optimally robust estimation for regression and time series models. RobRex: Optimally robust estimation for linear regression with normal errors. Kohl (2005) [2], http://r-forge.r-project.org/projects/robast/

Current developments Confidence intervals Diagnostic plots Simpler user interfaces for regression models Thank you!

Current developments Confidence intervals Diagnostic plots Simpler user interfaces for regression models Thank you!

Bibliography I M. Feller (1968). An introduction to probability theory and its applications. I. John Wiley and Sons. M. Kohl (2005). Numerical Contributions to the Asymptotic Theory of Robustness. Dissertation. University of Bayreuth. H. Rieder (1994). Robust Asymptotic Statistics. Springer. Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statistics with S. Fourth edition. Springer. L. Gatto (2008). MAQCsubsetAFX: MAQC data subset for the Affymetrix platform. R package version 1.0.0. http://www.slashhome.be/maqcsubsetafx.php L. Gautier, L. Cope, B.M. Bolstad, and R.A. Irizarry (2004). affy analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20, 3 (Feb. 2004), 307-315. R. Gentleman, V.J. Carey, D.M. Bates et al. (2004) Bioconductor: Open software development for computational biology and bioinformatics. Genome Biology, Vol. 5, R80

Bibliography II H. Rieder, M. Kohl and P. Ruckdeschel (2008). The Costs of not Knowing the Radius. Statistical Methods and Application 17(1):13 40. P. Ruckdeschel, M. Kohl, T. Stabla and F. Camphausen (2006). S4 classes for distributions. R-News 6(2):2 6. P. Ruckdeschel and H. Rieder (2004). Optimal influence curves for general loss functions. Stat. Decis. 22:201 223. Rutherford, E. and Geiger, H. (1910). The Probability Variations in the Distribution of alpha Particles. Philosophical Magazine 20:698 704. R Development Core Team (2008). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0, http://www.r-project.org.