Cumulative Distribution Function (CDF) Deconvolution

Similar documents
Cumulative Distribution Function (CDF) Analysis

Package spsurvey. May 22, 2015

Package spsurvey. August 19, 2016

GRTS Survey Designs for a Linear Resource

Chapter 3. Bootstrap. 3.1 Introduction. 3.2 The general idea

Week 4: Describing data and estimation

Kernel Density Estimation (KDE)

Learning Objectives. Continuous Random Variables & The Normal Probability Distribution. Continuous Random Variable

Unit 7 Statistics. AFM Mrs. Valentine. 7.1 Samples and Surveys

1. Estimation equations for strip transect sampling, using notation consistent with that used to

Nonparametric Density Estimation

Today s Lecture. Factors & Sampling. Quick Review of Last Week s Computational Concepts. Numbers we Understand. 1. A little bit about Factors

Chapter 1. Looking at Data-Distribution

VALIDITY OF 95% t-confidence INTERVALS UNDER SOME TRANSECT SAMPLING STRATEGIES

Chapter 6 Normal Probability Distributions

Introduction to R. UCLA Statistical Consulting Center R Bootcamp. Irina Kukuyeva September 20, 2010

A Random Number Based Method for Monte Carlo Integration

Regression III: Advanced Methods

Tutorial 3: Probability & Distributions Johannes Karreth RPOS 517, Day 3

Probability Models.S4 Simulating Random Variables

Unit 1 Review of BIOSTATS 540 Practice Problems SOLUTIONS - Stata Users

Table of Contents (As covered from textbook)

NONPARAMETRIC REGRESSION WIT MEASUREMENT ERROR: SOME RECENT PR David Ruppert Cornell University

Statistics I 2011/2012 Notes about the third Computer Class: Simulation of samples and goodness of fit; Central Limit Theorem; Confidence intervals.

Chapter 2. Descriptive Statistics: Organizing, Displaying and Summarizing Data

CREATING THE DISTRIBUTION ANALYSIS

R Programming Basics - Useful Builtin Functions for Statistics

Statistics 251: Statistical Methods

Lab 5 - Risk Analysis, Robustness, and Power

Minitab on the Math OWL Computers (Windows NT)

On Kernel Density Estimation with Univariate Application. SILOKO, Israel Uzuazor

Part I, Chapters 4 & 5. Data Tables and Data Analysis Statistics and Figures

Quantitative - One Population

Bluman & Mayer, Elementary Statistics, A Step by Step Approach, Canadian Edition

Organizing and Summarizing Data

An Overview of High-Speed Serial Bus Simulation Technologies

Generalized Random Tessellation Stratified (GRTS) Spatially-Balanced Survey Designs for Aquatic Resources

Data Mining: Exploring Data. Lecture Notes for Chapter 3

Learner Expectations UNIT 1: GRAPICAL AND NUMERIC REPRESENTATIONS OF DATA. Sept. Fathom Lab: Distributions and Best Methods of Display

We have seen that as n increases, the length of our confidence interval decreases, the confidence interval will be more narrow.

MULTIVARIATE ANALYSIS USING R

STAT 503 Fall Introduction to SAS

Assignments. Math 338 Lab 1: Introduction to R. Atoms, Vectors and Matrices

Chapter 6: DESCRIPTIVE STATISTICS

Basic Statistical Terms and Definitions

Package EMDomics. November 11, 2018

Using the DATAMINE Program

Minitab 17 commands Prepared by Jeffrey S. Simonoff

IN-CLASS EXERCISE: INTRODUCTION TO THE R "SURVEY" PACKAGE

A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection (Kohavi, 1995)

MATH11400 Statistics Homepage

Use of Extreme Value Statistics in Modeling Biometric Systems

Page 1. Graphical and Numerical Statistics

VARIANCE REDUCTION TECHNIQUES IN MONTE CARLO SIMULATIONS K. Ming Leung

SEMI-BLIND IMAGE RESTORATION USING A LOCAL NEURAL APPROACH

1 Matrices and Vectors and Lists

CSE 417T: Introduction to Machine Learning. Lecture 6: Bias-Variance Trade-off. Henry Chai 09/13/18

Mean Tests & X 2 Parametric vs Nonparametric Errors Selection of a Statistical Test SW242

Six Sigma Green Belt Part 5

AP Statistics. Study Guide

1 Overview of Statistics; Essential Vocabulary

Excel 2010 with XLSTAT

IQC monitoring in laboratory networks

2.1: Frequency Distributions and Their Graphs

2) familiarize you with a variety of comparative statistics biologists use to evaluate results of experiments;

What s Normal Anyway?

Intro to R h)p://jacobfenton.s3.amazonaws.com/r- handson.pdf. Jacob Fenton CAR Director InvesBgaBve ReporBng Workshop, American University

Econ Stata Tutorial I: Reading, Organizing and Describing Data. Sanjaya DeSilva

Unit I Supplement OpenIntro Statistics 3rd ed., Ch. 1

Package SHELF. August 16, 2018

Math 14 Lecture Notes Ch. 6.1

Eksamen ERN4110, 6/ VEDLEGG SPSS utskrifter til oppgavene (Av plasshensyn kan utskriftene være noe redigert)

An Excel Add-In for Capturing Simulation Statistics

Topic (3) SUMMARIZING DATA - TABLES AND GRAPHICS

Chapter 2 - Descriptive Statistics

Chapter 2 Modeling Distributions of Data

Fathom Dynamic Data TM Version 2 Specifications

Package fitur. March 11, 2018

Testing Random- Number Generators

Describing PPS designs to R

Probability and Statistics. Copyright Cengage Learning. All rights reserved.

Nonparametric Estimation of Distribution Function using Bezier Curve

Gauss for Econometrics: Simulation

a. divided by the. 1) Always round!! a) Even if class width comes out to a, go up one.

Multivariate probability distributions

Name: Date: Period: Chapter 2. Section 1: Describing Location in a Distribution

Effective probabilistic stopping rules for randomized metaheuristics: GRASP implementations

Package bayesdp. July 10, 2018

Maths Class 9 Notes for Statistics

A Simple Metal Detector Model to Predict the Probability of Detection in Landmine Detection

Chapter 12: Statistics

Digital Image Processing, 2nd ed. Digital Image Processing, 2nd ed. The principal objective of enhancement

STATISTICS Chapter (1) Introduction

Data606 Lab3 Yuen Chun Wong September 17, 2017

ON THE POTENTIAL OF COMPUTATIONALLY RENDERED SCENES FOR LIGHTING QUALITY EVALUATION

In Minitab interface has two windows named Session window and Worksheet window.

CHAPTER 6. The Normal Probability Distribution

Random integer partitions with restricted numbers of parts

Unit 5: Estimating with Confidence

A (very) brief introduction to R

Transcription:

Cumulative Distribution Function (CDF) Deconvolution Thomas Kincaid September 20, 2013 Contents 1 Introduction 1 2 Preliminaries 1 3 Read the Simulated Variables Data File 2 4 Illustration of Extraneous Variance 3 5 Deconvolution 7 1 Introduction This document presents deconvolution of a cumulative distribution function (CDF). Convolution is a term that refers to a distribution that is a mixture of two or more component distributions. Specifically, we are interested in the situation where the CDF is a mixture of the distribution for a variable of interest and measurement error. The presence of measurement error will cause the CDF to occur across a wider range of variable values, which will lead to biased estimation of the CDF. Deconvolution is the name of the procedure for removing the measurement error from the CDF. In order to illustrate the deconvolution process, simulated data is used rather than data generated by a probability survey. For additional discussion of measurement error and deconvolution see Kincaid and Olsen (2012). 2 Preliminaries The initial step is to use the library function to load the spsurvey package. After the package is loaded, a message is printed to the R console indicating that the spsurvey package was 1

loaded successfully. Load the spsurvey package > # Load the spsurvey package > library(spsurvey) > Version 2.5 of the spsurvey package was loaded successfully. 3 Read the Simulated Variables Data File A data file was created that contains simulated species richness variables. The initial variable was created using the Normal distribution with mean 20 and standard deviation 5. Addition variables were created by adding 25%, 50%, and 100% measurement error variance to the original variable, where the percent values refer to the variance of the original variable. The data file also includes x-coordinates and y-coordinates, which were simulated using the Uniform distribution with a range from zero to one. The data file contains 500 records. The next step is to read the data file. The read.delim function is used to read the tabdelimited file and assign it to a data frame named decon data. The nrow function is used to determine the number of rows in the decon data data frame, and the resulting value is assigned to an object named nr. Finally, the initial six lines and the final six lines in the decon data data frame are printed using the head and tail functions, respectively. Read the survey design and analytical variables data file > # Read the data file and determine the number of rows in the file > decon_data <- read.delim("decon_data.tab") > nr <- nrow(decon_data) > Display the initial six lines in the data file. > # Display the initial six lines in the data file > head(decon_data) xcoord ycoord Richness Richness_25 Richness_50 Richness_100 1 0.2156870 0.1040707 9.195568 10.413331 2.291196 2.819672 2 0.7253217 0.4319726 8.853398 9.521474 4.521184 8.183711 3 0.3200359 0.4419541 8.631306 3.747440 4.815115 5.909885 4 0.9686209 0.8958557 10.361718 10.973729 5.192450 9.874936 5 0.0974488 0.1890712 12.424943 10.121154 5.235458 14.121246 6 0.8153121 0.6643893 9.635846 9.413341 5.872094 16.581228 2

> Display the final six lines in the data file. > # Display the final six lines in the data file > tail(decon_data) xcoord ycoord Richness Richness_25 Richness_50 Richness_100 495 0.7431377 0.68816008 28.40369 29.78458 33.61083 27.16529 496 0.8238306 0.92598006 30.17266 31.48048 33.89747 34.05622 497 0.8884304 0.07949327 26.96940 27.06877 34.58659 33.63208 498 0.5394230 0.60650557 29.80406 26.15533 34.93280 22.27781 499 0.4200749 0.28636323 28.68552 29.03119 36.22896 32.96995 500 0.5356669 0.64663409 30.65390 30.58069 37.46364 30.13545 > 4 Illustration of Extraneous Variance Measurement error is an example of extraneous variance, additional variance that is convoluted with the population frequency distribution for a variable of interest. The impact of extraneous variance on the CDF will be investigated in this section. As an initial step, the summary function is used to summarize the data structure of the species richness values for the original variable. > summary(decon_data$richness) Summarize the data structure of the species richness variable: Min. 1st Qu. Median Mean 3rd Qu. Max. 8.631 15.960 20.090 20.000 23.610 30.660 Note that the species richness values range between approximately 8 and 31. By way of comparison, consider the range of value for the original variable plus 100% additional measurement error. > summary(decon_data$richness_100) Summarize the data structure of the species richness variable plus 100% measurrement error: 3

Min. 1st Qu. Median Mean 3rd Qu. Max. 1.274 15.310 20.030 20.150 25.080 39.650 The range of species richness values now has increased to 1 to 40, which reflects the bias that is introduced into the CDF. To illustrate the bias induced by extraneous variance, CDFs for the species richness variables will be graphed. The first step is to assign a vector of values at which the CDFs will be calculated. The sequence (seq) function is used to create the set of values using the range 0 to 40. Output from the seq function is assigned to an object named cdfvals. > cdfvals <- seq(0,40,length=25) The next step is to calculate a CDF estimate for the original species richness variable. The cdf.est function in spsurvey will be used to calculate the estimate. Since values for the species richness variables were not selected in a probability survey, the survey design weights argument (wgt) for the function is assigned a vector of ones using the repeat (rep) function. Recall that the object named nr contains the number of rows in the decon data data frame. > CDF_org <- cdf.est(z=decon_data$richness, + wgt=rep(1, nr), Similarly, the cdf.est function will be used to calculate CDF estimates for the species richness variables that include measurement error. > CDF_25 <- cdf.est(z=decon_data$richness_25, + wgt=rep(1, nrow(decon_data)), > CDF_50 <- cdf.est(z=decon_data$richness_50, + wgt=rep(1, nrow(decon_data)), > CDF_100 <- cdf.est(z=decon_data$richness_100, + wgt=rep(1, nrow(decon_data)), 4

Density estimates facilitate visualizing the increased variance induced in a CDF by measurement error. The ash1.wgt function in spsurvey, which implements the average shifted histogram algorithm (Scott 1985), will be used to calculate density estimates for each of the species richness variables. > Density_org <- ash1.wgt(decon_data$richness, nbin=25) > Density_25 <- ash1.wgt(decon_data$richness_25, nbin=25) > Density_50 <- ash1.wgt(decon_data$richness_50, nbin=25) > Density_100 <- ash1.wgt(decon_data$richness_100, nbin=25) CDF and density estimates are displayed in Figure 1. Estimates for the original species richness variable are displayed using red. Estimates for variables containing measurement error are displayed using blue. 5

CDF Estimates Percent 0 20 40 60 80 Original Variable 25% Added Variance 50% Added Variance 100% Added Variance 0 5 10 15 20 25 30 35 40 Species Richness Density Estimates Resource Density 0.00 0.02 0.04 0.06 0 5 10 15 20 25 30 35 40 Species Richness Figure 1: CDF and Density Estimates for Species Richness Variables. 6

5 Deconvolution Deconvolution will be demonstrated using the species richness variable that includes 100% additional measurement error variance. The decon.est function in spsurvey will be used to calculate a deconvoluted CDF estimate, which uses an algorithm based on the procedure developed by Stefansky and Bay (1996). Argument sigma for that function specifies the extraneous variance. For an actual survey design, a value for sigma would be estimated from the survey data or obtained from some other source. Since we are using simulated data, a direct estimate of sigma is available. Specifically, the variance estimation function (var) will be used to estimate the additional variance for the species richness variable with added measurement error, and the result will be assigned to an object named extvar. > extvar <- var(decon_data$richness_100) - var(decon_data$richness) > CDF_decon <- cdf.decon(z=decon_data$richness_100, + wgt=rep(1,nr), + sigma=extvar, The original CDF and deconvoluted CDF estimates for the species richness variable that includes 100% additional measurement error variance are displayed in Figure 2. The CDF estimate and confidence bounds for the original variable are displayed in red. The deconvoluted CDF estimate and confidence bound are displayed in blue. One consequence of deconvolution is increased confidence bound width, which can be observed in the CDFs in Figure 2. For example, for a species richness value of 25, the confidence bounds for the original variable are 78.7% to 84.5%, which is a width of 5.8%. For the deconvoluted CDF, the confidence bounds are 74.2% to 83.9%, which is a width of 9.7%. 7

Percent 0 20 40 60 80 100 Original CDF Confidence Bounds Deconvoluted CDF Confidence Bounds 0 5 10 15 20 25 30 35 40 Species Richness Figure 2: Original and Deconvoluted CDF Estimates for a Species Richness Variable with Added Measurement Error. 8

References Kincaid, T. M. and A. R. Olsen (2012). Survey analysis in natural resource monitoring programs with a focus on cumulative distribution functions. In R. A. Gitzen, J. J. Millspaugh, A. B. Cooper, and D. S. Licht (Eds.), Design and Analysis of Long-term Ecological Monitoring Studies, pp. 313 324. Cambridge University Press. Scott, D. W. (1985). Averaged shifted histograms: effective nonparametric density estimators in several dimensions. Annals of Statistics 13, 1024 1040. Stefansky, F. A. and J. M. Bay (1996). Simulation extrapolation deconvolution of finite population cumulative distribution function estimators. Biometrika 83, 407 417. 9