A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Similar documents
The nor1mix Package. August 3, 2006

The nor1mix Package. June 12, 2007

STATISTICAL LABORATORY, April 30th, 2010 BIVARIATE PROBABILITY DISTRIBUTIONS

Univariate Data - 2. Numeric Summaries

2 Looking at Multivariate Data: Visualisation

Package SCBmeanfd. December 27, 2016

Stat 302 Statistical Software and Its Applications Density Estimation

Nonparametric Density Estimation

Univariate Data - 2. Numeric Summaries

The MKLE Package. July 24, Kdensity... 1 Kernels... 3 MKLE-package... 3 checkparms... 4 Klik... 5 MKLE... 6 mklewarp... 7 state... 8.

Lecture 6 - Multivariate numerical optimization

Statistics 251: Statistical Methods

Lecture 12. August 23, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Bioimage Informatics

Package tclust. May 24, 2018

Package feature. R topics documented: July 8, Version Date 2013/07/08

Package feature. R topics documented: October 26, Version Date

Tools and methods for model-based clustering in R

Tutorial on the R package TDA

Bioimage Informatics

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Package mixphm. July 23, 2015

Package glogis. R topics documented: March 16, Version Date

DSCI 325: Handout 18 Introduction to Graphics in R

Package MeanShift. R topics documented: August 29, 2016

Common R commands used in Data Analysis and Statistical Inference

Package sbf. R topics documented: February 20, Type Package Title Smooth Backfitting Version Date Author A. Arcagni, L.

Dynamic Thresholding for Image Analysis

STA258H5. Al Nosedal and Alison Weir. Winter Al Nosedal and Alison Weir STA258H5 Winter / 27

Statistical Programming with R

Bioimage Informatics

An Introductory Guide to R

Mixture Analysis of the Old Faithful Geyser Data Using the Package mixak

Package SSLASSO. August 28, 2018

Package sciplot. February 15, 2013

Flexible Mixture Modeling and Model-Based Clustering in R

Introduction to the Bootstrap and Robust Statistics

Common Sta 101 Commands for R. 1 One quantitative variable. 2 One categorical variable. 3 Two categorical variables. Summary statistics

Parallel and Hierarchical Mode Association Clustering with an R Package Modalclust

R Programming: Worksheet 6

R practice. Eric Gilleland. 20th May 2015

Missing Data Analysis for the Employee Dataset

COMPUTATIONAL STATISTICS UNSUPERVISED LEARNING

An Introduction to PDF Estimation and Clustering

Using R. Liang Peng Georgia Institute of Technology January 2005

Display Lists in grid

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

Package coenocliner. R topics documented: August 29, Type Package Title Coenocline Simulation Version Date

Package kader. October 4, 2017

UP School of Statistics Student Council Education and Research

Markov Chain Monte Carlo (part 1)

The gbev Package. August 19, 2006

Problem Set 5. MAS 622J/1.126J: Pattern Recognition and Analysis. Due Monday, 8 November Resubmission due, 12 November 2010

Expectation Maximization (EM) and Gaussian Mixture Models

MCLUST Version 3 for R: Normal Mixture Modeling and Model-Based Clustering

Package bacon. October 31, 2018

Lecture 3 - Object-oriented programming and statistical programming examples

Logistic Regression. (Dichotomous predicted variable) Tim Frasier

Package r2d2. February 20, 2015

Package mlegp. April 15, 2018

Package ModelGood. R topics documented: February 19, Title Validation of risk prediction models Version Author Thomas A.

Points Lines Connected points X-Y Scatter. X-Y Matrix Star Plot Histogram Box Plot. Bar Group Bar Stacked H-Bar Grouped H-Bar Stacked

Package visualizationtools

Package munfold. R topics documented: February 8, Type Package. Title Metric Unfolding. Version Date Author Martin Elff

Supplementary Material sppmix: Poisson point process modeling using normal mixture models

New Features in SAS/INSIGHT in Version 7

Data Mining Chapter 3: Visualizing and Exploring Data Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Package clustvarsel. April 9, 2018

MCLUST Version 3 for R: Normal Mixture Modeling and Model-Based Clustering

The simpleboot Package

Package SiZer. February 19, 2015

Nonparametric Kernel Smoothing Methods. The sm library in Xlisp-Stat.

The copula Package. May 16, 2006

Package nonmem2r. April 5, 2018

Nonparametric Estimation of Distribution Function using Bezier Curve

Kernel Density Estimation

Package detpack. R topics documented: December 14, Type Package

Az R adatelemzési nyelv

Package stochprofml. February 20, 2015

Display Lists in grid

Package SmoothHazard

Graphics in R STAT 133. Gaston Sanchez. Department of Statistics, UC Berkeley

Monte Carlo sampling

Error-Bar Charts from Summary Data

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 1st, 2018

Data Clustering Hierarchical Clustering, Density based clustering Grid based clustering

- ST Computer Intensive Statistical Analytics I Appendix: R basics

Package uclaboot. June 18, 2003

Package intccr. September 12, 2017

Package lvm4net. R topics documented: August 29, Title Latent Variable Models for Networks

Package replicatedpp2w

An Introduction to R Graphics with examples

MATH11400 Statistics Homepage

Package gains. September 12, 2017

Comparing Groups and Hypothesis Testing

Missing Data Analysis for the Employee Dataset

INDEPENDENT COMPONENT ANALYSIS WITH QUANTIZING DENSITY ESTIMATORS. Peter Meinicke, Helge Ritter. Neuroinformatics Group University Bielefeld Germany

Package npregfast. November 30, 2017

Kernel Density Estimation (KDE)

Clustering Lecture 5: Mixture Model

Transcription:

A Handbook of Statistical Analyses Using R Brian S. Everitt and Torsten Hothorn

CHAPTER 7 Density Estimation: Erupting Geysers and Star Clusters 7.1 Introduction 7.2 Density Estimation The three kernel functions are implemented in R as shown in lines 1 3 of Figure 7.1. For some grid x, the kernel functions are plotted using the R statements in lines 5 11 (Figure 7.1). The kernel estimator ˆf is a sum of bumps placed at the observations. The kernel function determines the shape of the bumps while the window width h determines their width. Figure 7.2 (redrawn from a similar plot in Silverman, 1986) shows the individual bumps n 1 h 1 K((x x i )/h), as well as the estimate ˆf obtained by adding them up for an artificial set of data points R> x <- c(0, 1, 1.1, 1.5, 1.9, 2.8, 2.9, 3.5) R> n <- length(x) For a grid R> xgrid <- seq(from = min(x) - 1, to = max(x) + 1, by = 0.01) on the real line, we can compute the contribution of each measurement in x, with h = 0.4, by the Gaussian kernel (defined in Figure 7.1, line 3) as follows; R> h <- 0.4 R> bumps <- sapply(x, function(a) gauss((xgrid - a)/h)/(n * h)) A plot of the individual bumps and their sum, the kernel density estimate ˆf, is shown in Figure 7.2. 7.3 Analysis Using R 7.3.1 A Parametric Density Estimate for the Old Faithful Data R> logl <- function(param, x) { + d1 <- dnorm(x, mean = param[2], sd = param[3]) + d2 <- dnorm(x, mean = param[4], sd = param[5]) + -sum(log(param[1] * d1 + (1 - param[1]) * d2)) + } R> startparam <- c(p = 0.5, mu1 = 50, sd1 = 3, mu2 = 80, sd2 = 3) R> opp <- optim(startparam, logl, x = faithful$waiting, 3

4 DENSITY ESTIMATION 1 R> rec <- function(x) (abs(x) < 1) * 0.5 2 R> tri <- function(x) (abs(x) < 1) * (1 - abs(x)) 3 R> gauss <- function(x) 1/sqrt(2*pi) * exp(-(x^2)/2) 4 R> x <- seq(from = -3, to = 3, by = 0.001) 5 R> plot(x, rec(x), type = "l", ylim = c(0,1), lty = 1, 6 + ylab = expression(k(x))) 7 R> lines(x, tri(x), lty = 2) 8 R> lines(x, gauss(x), lty = 3) 9 R> legend(-3, 0.8, legend = c("rectangular", "Triangular", 10 + "Gaussian"), lty = 1:3, title = "kernel functions", 11 + bty = "n") Figure 7.1 Three commonly used kernel functions.

ANALYSIS USING R 5 1 R> plot(xgrid, rowsums(bumps), ylab = expression(hat(f)(x)), 2 + type = "l", xlab = "x", lwd = 2) 3 R> rug(x, lwd = 2) 4 R> out <- apply(bumps, 2, function(b) lines(xgrid, b)) ^(x) f 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 1 0 1 2 3 4 x Figure 7.2 Kernel estimate showing the contributions of Gaussian kernels evaluated for the individual observations with bandwidth h = 0.4. + method = "L-BFGS-B", + lower = c(0.01, rep(1, 4)), + upper = c(0.99, rep(200, 4))) R> opp $par p mu1 sd1 mu2 sd2 0.3608912 54.6121380 5.8723776 80.0934106 5.8672829 $value [1] 1034.002 $counts

K(x, y) 6 DENSITY ESTIMATION R> epa <- function(x, y) + ((x^2 + y^2) < 1) * 2/pi * (1 - x^2 - y^2) R> x <- seq(from = -1.1, to = 1.1, by = 0.05) R> epavals <- sapply(x, function(a) epa(a, x)) R> persp(x = x, y = x, z = epavals, xlab = "x", ylab = "y", + zlab = expression(k(x, y)), theta = -35, axes = TRUE, + box = TRUE) y x Figure 7.3 Epanechnikov kernel for a grid between ( 1.1, 1.1) and (1.1, 1.1).

ANALYSIS USING R 7 1 R> data("faithful", package = "datasets") 2 R> x <- faithful$waiting 3 R> layout(matrix(1:3, ncol = 3)) 4 R> hist(x, xlab = "Waiting times (in min.)", ylab = "Frequency", 5 + probability = TRUE, main = "Gaussian kernel", 6 + border = "gray") 7 R> lines(density(x, width = 12), lwd = 2) 8 R> rug(x) 9 R> hist(x, xlab = "Waiting times (in min.)", ylab = "Frequency", 10 + probability = TRUE, main = "Rectangular kernel", 11 + border = "gray") 12 R> lines(density(x, width = 12, window = "rectangular"), lwd = 2) 13 R> rug(x) 14 R> hist(x, xlab = "Waiting times (in min.)", ylab = "Frequency", 15 + probability = TRUE, main = "Triangular kernel", 16 + border = "gray") 17 R> lines(density(x, width = 12, window = "triangular"), lwd = 2) 18 R> rug(x) Gaussian kernel Rectangular kernel Triangular kernel Frequency 0.00 0.01 0.02 0.03 0.04 Frequency 0.00 0.01 0.02 0.03 0.04 Frequency 0.00 0.01 0.02 0.03 0.04 40 60 80 100 Waiting times (in min.) 40 60 80 100 Waiting times (in min.) 40 60 80 100 Waiting times (in min.) Figure 7.4 Density estimates of the geyser eruption data imposed on a histogram of the data.

8 DENSITY ESTIMATION R> library("kernsmooth") R> data("cygob1", package = "HSAUR") R> CYGOB1d <- bkde2d(cygob1, bandwidth = sapply(cygob1, dpik)) R> contour(x = CYGOB1d$x1, y = CYGOB1d$x2, z = CYGOB1d$fhat, + xlab = "log surface temperature", + ylab = "log light intensity") log light intensity 3.5 4.0 4.5 5.0 5.5 6.0 6.5 0.2 0.4 0.6 0.2 0.2 0.2 1.2 0.6 2 2.2 1.4 1.8 1 1.6 0.4 0.8 3.4 3.6 3.8 4.0 4.2 4.4 4.6 log surface temperature Figure 7.5 A contour plot of the bivariate density estimate of the CYGOB1 data, i.e., a two-dimensional graphical display for a three-dimensional problem.

estimated density ANALYSIS USING R 9 R> persp(x = CYGOB1d$x1, y = CYGOB1d$x2, z = CYGOB1d$fhat, + xlab = "log surface temperature", + ylab = "log light intensity", + zlab = "estimated density", + theta = -35, axes = TRUE, box = TRUE) log light intensity log surface temperature Figure 7.6 The bivariate density estimate of the CYGOB1 data, here shown in a three-dimensional fashion using the persp function. function gradient 55 55 $convergence [1] 0 Of course, optimising the appropriate likelihood by hand is not very convenient. In fact, (at least) two packages offer high-level functionality for estimating mixture models. The first one is package mclust (Fraley et al., 2006) implementing the methodology described in Fraley and Raftery (2002). Here,

10 DENSITY ESTIMATION a Bayesian information criterion (BIC) is applied to choose the form of the mixture model: R> library("mclust") R> mc <- Mclust(faithful$waiting) R> mc 'Mclust' model object: best model: univariate, equal variance (E) with 2 components and the estimated means are R> mc$parameters$mean 1 2 54.61675 80.09239 with estimated standard deviation (found to be equal within both groups) R> sqrt(mc$parameters$variance$sigmasq) [1] 5.868639 The proportion is ˆp = 0.36. The second package is called flexmix whose functionality is described by Leisch (2004). A mixture of two normals can be fitted using R> library("flexmix") R> fl <- flexmix(waiting ~ 1, data = faithful, k = 2) with ˆp = 0.36 and estimated parameters R> parameters(fl, component = 1) Comp.1 coef.(intercept) 54.628701 sigma 5.895234 R> parameters(fl, component = 2) Comp.2 coef.(intercept) 80.098582 sigma 5.871749 We can get standard errors for the five parameter estimates by using a bootstrap approach (see Efron and Tibshirani, 1993). The original data are slightly perturbed by drawing n out of n observations with replacement and those artificial replications of the original data are called bootstrap samples. Now, we can fit the mixture for each bootstrap sample and assess the variability of the estimates, for example using confidence intervals. Some suitable R code based on the Mclust function follows. First, we define a function that, for a bootstrap sample indx, fits a two-component mixture model and returns ˆp and the estimated means (note that we need to make sure that we always get an estimate of p, not 1 p): R> library("boot") R> fit <- function(x, indx) { + a <- Mclust(x[indx], ming = 2, maxg = 2)$parameters + if (a$pro[1] < 0.5) + return(c(p = a$pro[1], mu1 = a$mean[1],

ANALYSIS USING R 11 R> opar <- as.list(opp$par) R> rx <- seq(from = 40, to = 110, by = 0.1) R> d1 <- dnorm(rx, mean = opar$mu1, sd = opar$sd1) R> d2 <- dnorm(rx, mean = opar$mu2, sd = opar$sd2) R> f <- opar$p * d1 + (1 - opar$p) * d2 R> hist(x, probability = TRUE, xlab = "Waiting times (in min.)", + border = "gray", xlim = range(rx), ylim = c(0, 0.06), + main = "") R> lines(rx, f, lwd = 2) R> lines(rx, dnorm(rx, mean = mean(x), sd = sd(x)), lty = 2, + lwd = 2) R> legend(50, 0.06, lty = 1:2, bty = "n", + legend = c("fitted two-component mixture density", + "Fitted single normal density")) Density 0.00 0.01 0.02 0.03 0.04 0.05 0.06 Fitted two component mixture density Fitted single normal density 40 50 60 70 80 90 100 110 Waiting times (in min.) Figure 7.7 Fitted normal density and two-component normal mixture for geyser eruption data.

12 DENSITY ESTIMATION + mu2 = a$mean[2])) + return(c(p = 1 - a$pro[1], mu1 = a$mean[2], + mu2 = a$mean[1])) + } The function fit can now be fed into the boot function (Canty and Ripley, 2006) for bootstrapping (here 1000 bootstrap samples are drawn) R> bootpara <- boot(faithful$waiting, fit, R = 1000) We assess the variability of our estimates ˆp by means of adjusted bootstrap percentile (BCa) confidence intervals, which for ˆp can be obtained from R> boot.ci(bootpara, type = "bca", index = 1) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = bootpara, type = "bca", index = 1) Intervals : Level BCa 95% ( 0.3041, 0.4233 ) Calculations and Intervals on Original Scale We see that there is a reasonable variability in the mixture model, however, the means in the two components are rather stable, as can be seen from R> boot.ci(bootpara, type = "bca", index = 2) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = bootpara, type = "bca", index = 2) Intervals : Level BCa 95% (53.42, 56.07 ) Calculations and Intervals on Original Scale for ˆµ 1 and for ˆµ 2 from R> boot.ci(bootpara, type = "bca", index = 3) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 1000 bootstrap replicates CALL : boot.ci(boot.out = bootpara, type = "bca", index = 3) Intervals : Level BCa 95% (79.05, 81.01 ) Calculations and Intervals on Original Scale Finally, we show a graphical representation of both the bootstrap distribution of the mean estimates and the corresponding confidence intervals. For convenience, we define a function for plotting, namely R> bootplot <- function(b, index, main = "") { + dens <- density(b$t[,index]) + ci <- boot.ci(b, type = "bca", index = index)$bca[4:5]

ANALYSIS USING R 13 R> layout(matrix(1:2, ncol = 2)) R> bootplot(bootpara, 2, main = expression(mu[1])) R> bootplot(bootpara, 3, main = expression(mu[2])) µ 1 µ 2 Density 0.0 0.2 0.4 0.6 ( ) Density 0.0 0.2 0.4 0.6 0.8 ( ) 52 54 56 78 79 80 81 82 N = 1000 Bandwidth = 0.1489 N = 1000 Bandwidth = 0.111 Figure 7.8 Bootstrap distribution and confidence intervals for the mean estimates of a two-component mixture for the geyser data. + est <- b$t0[index] + plot(dens, main = main) + y <- max(dens$y) / 10 + segments(ci[1], y, ci[2], y, lty = 2) + points(ci[1], y, pch = "(") + points(ci[2], y, pch = ")") + points(est, y, pch = 19) + } The element t of an object created by boot contains the bootstrap replications of our estimates, i.e., the values computed by fit for each of the 1000 bootstrap samples of the geyser data. First, we plot a simple density estimate and then construct a line representing the confidence interval. We apply this function to the bootstrap distributions of our estimates ˆµ 1 and ˆµ 2 in Figure 7.8.

Bibliography Canty, A. and Ripley, B. D. (2006), boot: Bootstrap R (S-PLUS) Functions (Canty), URL http://cran.r-project.org, R package version 1.2-29. Efron, B. and Tibshirani, R. J. (1993), An Introduction to the Bootstrap, London, UK: Chapman & Hall/CRC. Fraley, C. and Raftery, A. E. (2002), Model-based clustering, discriminant analysis, and density estimation, Journal of the American Statistical Association, 97, 611 631. Fraley, C., Raftery, A. E., and Wehrens, R. (2006), mclust: Model-based Cluster Analysis, URL http://www.stat.washington.edu/mclust, R package version 3.1-1. Leisch, F. (2004), FlexMix: A general framework for finite mixture models and latent class regression in R, Journal of Statistical Software, 11, URL http://www.jstatsoft.org/v11/i08/. Silverman, B. (1986), Density Estimation, London, UK: Chapman & Hall/CRC.